cancel
Showing results for 
Search instead for 
Did you mean: 

Eliminating duplicated records based on the Last Updated Date

darshthakkar
Valued Contributor

Continuing the discussion from Fetch data whose date is less than 10hrs from the existing date!:

Hi Team,

The discussion was going on a different thread but it would have made more sense to open a new topic thus this one (thanks @koryknick for your suggestion)

Snowflake → Salesforce

In a nutshell, I’m fetching data from snowflake and ingesting it to Salesforce with all different joins, rules and filter conditions. One of the conditions for updating records is to consider records which have been updated in the last 48 hours ONLY and then update the table in Salesforce, for the ones which are existing records and not updated within last 48 hours, eliminate those (already achieved), new records should be ingested directly (already achieved)

After writing the logic for records updated within last 48 hours (thanks @bojanvelevski for your assistance on this one), I used a copy snap and routed the data to 2 different channels, one taking a backup as a flat file so that I can use for validation/troubleshooting purposes and other directly updating the salesforce using Salesforce Update snap.

After executing the pipeline and analyzing the flat file generated after every execution, I found that the records that were seen in the first run are still visible in the next runs (as the condition is Last_updated date < 48 hours, so those are ought to come) but this deceives me with the actual counts of data which were really updated.

This will eventually be a scheduled job so is there a way wherein I can check whether records were updated in the last batch, I don’t consider them and go with the newer records? The alarming part is if we eliminate records based on their run in previous batches, if there is really an update for those records coming from upstream systems (i.e. snowflake), it will eliminate it and that should not be the expected behavior.

Any ideas are welcomed to this, even if we don’t reach a conclusion, brainstorming would help me a lot.
Thanking everyone in advance for your time on this thread.

CC: @dmiller @robin

Regards,
Darsh

1 ACCEPTED SOLUTION

koryknick
Employee
Employee

@darshthakkar - Just FYI - the Salesforce Upsert snap returns a “created” field in the response which you can use to determine if the record was inserted vs. updated.

In regards to pulling modified records since the last execution, I mentioned the use of a control table that you can store the last modification date pulled from the source that was successfully upserted to Salesforce. This can be done with a simple Aggregate snap to get the max date, then write back to the config table only after all updates have been completed in Salesforce.

Note that if you are using the Bulk API option on the Salesforce Upsert snap, you may need to follow-up with a Salesforce Poller to get the final result of the updates. See the snap documentation for more details.

View solution in original post

6 REPLIES 6

koryknick
Employee
Employee

@darshthakkar - Just FYI - the Salesforce Upsert snap returns a “created” field in the response which you can use to determine if the record was inserted vs. updated.

In regards to pulling modified records since the last execution, I mentioned the use of a control table that you can store the last modification date pulled from the source that was successfully upserted to Salesforce. This can be done with a simple Aggregate snap to get the max date, then write back to the config table only after all updates have been completed in Salesforce.

Note that if you are using the Bulk API option on the Salesforce Upsert snap, you may need to follow-up with a Salesforce Poller to get the final result of the updates. See the snap documentation for more details.

I concur to your suggestion, every statement of your last response made sense. Thank you @koryknick