cancel
Showing results for 
Search instead for 
Did you mean: 

Eliminating duplicated records based on the Last Updated Date

darshthakkar
Valued Contributor

Continuing the discussion from Fetch data whose date is less than 10hrs from the existing date!:

Hi Team,

The discussion was going on a different thread but it would have made more sense to open a new topic thus this one (thanks @koryknick for your suggestion)

Snowflake → Salesforce

In a nutshell, I’m fetching data from snowflake and ingesting it to Salesforce with all different joins, rules and filter conditions. One of the conditions for updating records is to consider records which have been updated in the last 48 hours ONLY and then update the table in Salesforce, for the ones which are existing records and not updated within last 48 hours, eliminate those (already achieved), new records should be ingested directly (already achieved)

After writing the logic for records updated within last 48 hours (thanks @bojanvelevski for your assistance on this one), I used a copy snap and routed the data to 2 different channels, one taking a backup as a flat file so that I can use for validation/troubleshooting purposes and other directly updating the salesforce using Salesforce Update snap.

After executing the pipeline and analyzing the flat file generated after every execution, I found that the records that were seen in the first run are still visible in the next runs (as the condition is Last_updated date < 48 hours, so those are ought to come) but this deceives me with the actual counts of data which were really updated.

This will eventually be a scheduled job so is there a way wherein I can check whether records were updated in the last batch, I don’t consider them and go with the newer records? The alarming part is if we eliminate records based on their run in previous batches, if there is really an update for those records coming from upstream systems (i.e. snowflake), it will eliminate it and that should not be the expected behavior.

Any ideas are welcomed to this, even if we don’t reach a conclusion, brainstorming would help me a lot.
Thanking everyone in advance for your time on this thread.

CC: @dmiller @robin

Regards,
Darsh

1 ACCEPTED SOLUTION

koryknick
Employee
Employee

@darshthakkar - Just FYI - the Salesforce Upsert snap returns a “created” field in the response which you can use to determine if the record was inserted vs. updated.

In regards to pulling modified records since the last execution, I mentioned the use of a control table that you can store the last modification date pulled from the source that was successfully upserted to Salesforce. This can be done with a simple Aggregate snap to get the max date, then write back to the config table only after all updates have been completed in Salesforce.

Note that if you are using the Bulk API option on the Salesforce Upsert snap, you may need to follow-up with a Salesforce Poller to get the final result of the updates. See the snap documentation for more details.

View solution in original post

6 REPLIES 6

koryknick
Employee
Employee

@darshthakkar - it sounds like what you really want when updating into Salesforce is the Salesforce Bulk Upsert snap which can handle both new records (insert) and modified records (update). This does require that you specify the External-ID Field (i.e. primary key) of the target Salesforce object, but it handles all of the logic of determining if the target records exists to perform the appropriate update or insert operation.

There is also a Salesforce Upsert snap, but that would generally be used for transactional processing and based on the description of your process it sounds like you are doing batch operations on a data set, which would point you towards the Bulk Upsert snap.

Please refer to the online documentation for the Bulk Upsert snap for more details and an example using it.

Hi @koryknick,

Thanks for getting back to me on this so quickly. As I also need the counts of records that were updated vs created, I’m using 2 different snaps; Salesforce Update and Salesforce Create instead of an upsert.

Moreover, all Salesforce Bulk snaps have not been setup in my organization so I’ve been advised to use the conventional ones with Bulk API as an option.

I also observed one thing while using Salesforce Update that it was actually not updating any records, so I switched to Salesforce Upsert and it did the job for me.

Snaps for your reference:

Upsert working vs Update

image

Bulk snaps not configured
image

Normal snap with Bulk API as an option
image

However, the question still remains the same for not including duplicate records as it would be hitting salesforce multiple times for the same record whose update has been captured in the first batch itself.

When the condition has been setup to include records updated in the last 48hrs, and the batches are running at an interval of 6hr, 4 times a day, the record that was captured in batch 1 will appear for the next 2 days and hit salesforce even though it hasn’t been updated.

It could be the other way around as well, if the record was actually updated but the timestamp would be helpful in determining that, for the testing I was doing, I saw the same 4 records without any update, would you suggest me to compare all the fields in the target system and source system to gauge an actual update of that records?

Thanks!

koryknick
Employee
Employee

@darshthakkar - you are correct, the Salesforce Bulk snaps will only use the Mutual Auth or Oauth2 account types.

Regarding selection of records from Snowflake based on timestamp, you could change the method you are using for that date range selection so that modified records are only selected once. A typical pattern I’ve seen used successfully is to store the maximum value of the last modification timestamp that was updated into Salesforce so it can be used as the filtering criteria for the Snowflake query during the next pipeline execution. Obviously, you will need to store that value somewhere - perhaps back into a configuration table in Snowflake?

I hear you, thanks @koryknick.

We can follow either of the 2 steps below:

  1. Have a staging table for comparison (compare all the fields to be double sure)
  2. Updating the query which would consider records updated in last 6hrs as the batch is running at an interval of 6hrs (4 times a day)

Well, I wouldn’t ask you for a generic logic as the use cases differ with each client, there might be a requirement in the future to run batch every hour and there in, I’ll have to update the query which considers records updated last hour or last hour and half (putting half an hour more as a buffer).

The only reason to consider records updated in last 48 hours was to assess risks in advance, let’s assume snapLogic is down (which is quite unusual as per my experience thus far!) or there are some issues in any of the upstream or downstream systems, 48 hrs was a segway to consider all those records in the downtime.

I will brainstorm the logic again and put something which considers exactly the modified records and modified records + some modified records in the last batch run (if batch run interval is 6hrs and query is pulling records updated in last 6.5hr or 7hr).

Let me know your thoughts on this one and thank you again for your time and assistance.
Enjoy the long weekend.

Regards,
Darsh