Data reconciliation solutions?

Question

One of my company's use cases for SnapLogic today is replication of data from Salesforce into internal Kafka topics for use throughout the enterprise.

There have been various instances of internal consumers of the Kafka data reporting missing records.

Investigations have found multiple causes for these data drops. Some of the causes are related to behavior that Salesforce describes as "Working As Designed".

Salesforce has recommended other replication architectures, but there are various concerns with my company about using them (license cost, platform load) ... and we might still end up with missing data.

So, we're looking into data reconciliation / auditing solutions. Are there any recommendations on a tool that can:

* Identify record(s) where the record in Salesforce does not have a matching record (e.g. same timestamp) existing in Kafka

* Generate a message containing relevant metadata (e.g. record Id, Salesforce object, Kafka topic) to be sent to a REST endpoint / message queue for reprocessing

scott · Answer

Salesforce Subscriber Snap &amp; Change Data Capture (CDC) API
Our Salesforce Subscriber Snap is designed to work with Salesforce’s Change Data Capture (CDC) streaming API, which was their primary API for tracking real-time data changes for many years. However, this API has limitations in usability, reliability, and scalability, making it difficult to implement robust pipelines that ensure no events are missed. Given these constraints, reliably capturing and processing all changes can be challenging.
In recent years, Salesforce introduced a new API to address these challenges, offering better scalability and reliability. However, because it is a completely different API, we have not yet developed a new version of the Salesforce Subscriber Snap to support it. While this has been discussed internally, there are currently no immediate plans to build an updated Snap for the new API.
Handling Missed Events in Kafka
For customers using Kafka to process Salesforce data changes, it’s important to note that Kafka is not a traditional database where records can be looked up by key. Instead, Kafka operates as an append-only log of events, meaning you consume events in the order they arrive rather than retrieving a specific record at will.
The Salesforce CDC API events contain record identifiers, but they are intended to be processed in real time. If an event is missed, there is a limited window to retrieve it, and there is no built-in mechanism to look up past events by key within Kafka. Because of this, a common approach is to consume the events as they arrive and store them in a target database (e.g., Snowflake) to maintain an up-to-date view of customer records.

lazo_ilijoski · Answer

Hi&nbsp;feenst,Different data replication strategies can be implemented based on business requirements, whether the need is for real-time, near-real-time, or scheduled synchronization.For scheduled data replication, SnapLogic provides a straightforward solution using the Salesforce Snappack. Data can be retrieved from any Salesforce object by retrieving data based on LastModifiedDate (or SystemModstamp) field. At the end of each sync cycle, the most recent timestamp can be stored and used in the next cycle to fetch only those records that have been modified since the last successful run.For real-time or near-real-time replication, in addition to the Change Data Capture (CDC) feature, there can be used&nbsp;Outbound Messages&nbsp;in Salesforce.&nbsp;Outbound Messages can be configured to trigger SnapLogic tasks. In this case, Ultra Tasks&nbsp;are ideal for real-time processing due to their low-latency response time and built-in retry capabilities.&nbsp;Similarly,&nbsp;HTTP Callouts from Salesforce can also be used to invoke SnapLogic tasks, typically implemented via @future methods or queueable Apex classes to ensure proper bulkification and asynchronous execution.We’ve successfully implemented these approaches multiple times and found them to be reliable and scalable.If this aligns with your needs, I’d be happy to set up a call to walk through the different architecture options for replicating Salesforce data using SnapLogic.Best regards, Lazo.

feenst · Answer

Hi Scott,Thanks for the response.My organization is not using the CDC API today. We are using a REST and Bulk APIs to replicate data. Salesforce support and solution architects have recommended CDC instead of our current architecture.Do you have more information about the usability, reliability, and scalability limitations of CDC?Also, can you advise on what the new API is that Salesforce has introduced?Regarding handling the missed events in Kafka, my organization uses Kafka to allow multiple internal applications to consume updates from Salesforce for objects of interest.My team is responsible for the replication (today using SnapLogic) as well as the Kafka platform. We are interested in validating that all of the expected records from Salesforce have successfully reached Kafka.If a record has not reached Kafka, we are interested in triggering an automated process to retrieve the missing record(s) and ensure that all of the latest, expected records have reached Kafka.So, I am seeking input from the community of any tools that might support this use case. If the tool requires data to be extracted from Kafka and into a traditional database for the reconciliation / audit with Salesforce, this would be fine.

aleung · Answer

What would be the issue if you use SystemModStamp as your state.

e.g.
Assuming Data Warehouse is one of the consumer
Data Warehouse : select Max(SystemModStamp) > SFDC > REST SOQL where SystemModStamp > $SystemModStamp via SL > Kafka Producer > Data Warehouse

feenst · Answer

I'm assuming the above response assumes some sort of custom data reconciliation happening within the data warehouse?If so, a custom reconciliation is one option being considered, and would potentially be similar to what you've described.However, the intention of this thread was to seek recommendations for any existing offering (commercial or open source) that can perform this type of reconciliation.To the point of SystemModstamp specifically, it would not work for all objects.1.) Not every object being replicated / needing reconciliation has SystemModstamp2.) For objects that have SystemModstamp, it's almost never used as part of the query for the existing replication architectureA reason for this is that updates in Salesforce having nothing to do with a change in data can result in SystemModstamp being updated on records.During validation of the current replication architecture when SystemModstamp was used for replication, I observed a situation where retrieval was being attempted for tens of millions of records for an object when there was no update in the actual data that would have been replicated.

Forum Discussion

Data reconciliation solutions?

6 Replies

Recent Discussions

AWS SageMaker Model Integration

REST Get Pagination in various scenarios

Array of Objects manipulation

How to convert a MS Word file to PDF

Need Guidance on Dynamic Excel File Generation and Email Integration