Hi, I have a requirement to load data from Salesforce to Redshift. It is supposed to be a DELTA load every 2 hours (meaning capture only the changes that have occurred in source data after last execution).What should be the best way to design this pipeline, considering that Salesforce source table has huge volume of data.
Here is a sample pipeline to do just that. The key concept is to extract SFDC data from where you left off by 1st finding the max LastModifiedDate and use it in your SOQL as a condition to pull the net change.
Note: all timestamp store in Redshift is default to UTC
This was very helpful. Thanks for your quick response.
My org uses one of the first Salesforce adaptations and is one of the largest configurations the company offers so these might be issues that affect larger instances of Salesforce. But in case you start noticing data sync issues/accuracy over time, here are some things you might want to check over on your syncs.
In regards to using “LastModifiedDate”, you might want to evaluate your Salesforce system for another datetime field called “systemmodstamp” that should be available on almost every object. I found this field to be much more accurate because it’s modified when the record is updated by internal and external apps that your org might be using in Salesforce. Through comparisons I was able to find in my environment where the LastModifiedDate was not updated but systemmodstamp was a lot more often.
Another tip I have is around deleted records, aka isdeleted field. You will learn that hard deleted records do not end up in SF Recycle bin so the isdeleted flag does not get marked and the record just disappears in SF. The only way to get a list of these is to use REST API or SOAP API calls to Salesforce for objects that your SF admin has enabled "Data Replication” option. More info is located here about that:
Unfortunately there are no Salesforce Snaps that allow you to use these SOAP and REST API functions. If you need to use them, you will have to use the REST API or SOAP API Snaps to get this data out of Salesforce. I’ve submitted last year to SnapLogic all these as product enhancements ideas but haven’t seen them on the product road map yet.
I had to learn all of this the hard way after running few audits after time. My Redshift Salesforce data was constantly out of synch because of these issues. These days I have 99.99% accuracy in audits and run data syncs between SF and Redshift every hour on some of our larger active SF objects and daily on others.