Forum Discussion

feenst's avatar
feenst
New Contributor III
4 months ago

Data reconciliation solutions?

One of my company's use cases for SnapLogic today is replication of data from Salesforce into internal Kafka topics for use throughout the enterprise.

There have been various instances of internal consumers of the Kafka data reporting missing records.

Investigations have found multiple causes for these data drops. Some of the causes are related to behavior that Salesforce describes as "Working As Designed".

Salesforce has recommended other replication architectures, but there are various concerns with my company about using them (license cost, platform load) ... and we might still end up with missing data.

So, we're looking into data reconciliation / auditing solutions. Are there any recommendations on a tool that can:

* Identify record(s) where the record in Salesforce does not have a matching record (e.g. same timestamp) existing in Kafka

* Generate a message containing relevant metadata (e.g. record Id, Salesforce object, Kafka topic) to be sent to a REST endpoint / message queue for reprocessing

6 Replies

  • Salesforce Subscriber Snap & Change Data Capture (CDC) API

    Our Salesforce Subscriber Snap is designed to work with Salesforce’s Change Data Capture (CDC) streaming API, which was their primary API for tracking real-time data changes for many years. However, this API has limitations in usability, reliability, and scalability, making it difficult to implement robust pipelines that ensure no events are missed. Given these constraints, reliably capturing and processing all changes can be challenging.

    In recent years, Salesforce introduced a new API to address these challenges, offering better scalability and reliability. However, because it is a completely different API, we have not yet developed a new version of the Salesforce Subscriber Snap to support it. While this has been discussed internally, there are currently no immediate plans to build an updated Snap for the new API.

    Handling Missed Events in Kafka

    For customers using Kafka to process Salesforce data changes, it’s important to note that Kafka is not a traditional database where records can be looked up by key. Instead, Kafka operates as an append-only log of events, meaning you consume events in the order they arrive rather than retrieving a specific record at will.

    The Salesforce CDC API events contain record identifiers, but they are intended to be processed in real time. If an event is missed, there is a limited window to retrieve it, and there is no built-in mechanism to look up past events by key within Kafka. Because of this, a common approach is to consume the events as they arrive and store them in a target database (e.g., Snowflake) to maintain an up-to-date view of customer records.

    • feenst's avatar
      feenst
      New Contributor III

      Hi Scott,

      Thanks for the response.

      My organization is not using the CDC API today. We are using a REST and Bulk APIs to replicate data. Salesforce support and solution architects have recommended CDC instead of our current architecture.

      Do you have more information about the usability, reliability, and scalability limitations of CDC?

      Also, can you advise on what the new API is that Salesforce has introduced?

      Regarding handling the missed events in Kafka, my organization uses Kafka to allow multiple internal applications to consume updates from Salesforce for objects of interest.

      My team is responsible for the replication (today using SnapLogic) as well as the Kafka platform. We are interested in validating that all of the expected records from Salesforce have successfully reached Kafka.

      If a record has not reached Kafka, we are interested in triggering an automated process to retrieve the missing record(s) and ensure that all of the latest, expected records have reached Kafka.

      So, I am seeking input from the community of any tools that might support this use case. If the tool requires data to be extracted from Kafka and into a traditional database for the reconciliation / audit with Salesforce, this would be fine.

      • aleung's avatar
        aleung
        Contributor III

        What would be the issue if you use SystemModStamp as your state.  

        e.g.
        Assuming Data Warehouse is one of the consumer
        Data Warehouse : select Max(SystemModStamp) > SFDC > REST SOQL where SystemModStamp > $SystemModStamp via SL >  Kafka Producer > Data Warehouse 


  • lazo_ilijoski's avatar
    lazo_ilijoski
    New Contributor III

    Hi feenst,

    Different data replication strategies can be implemented based on business requirements, whether the need is for real-time, near-real-time, or scheduled synchronization.

    For scheduled data replication, SnapLogic provides a straightforward solution using the Salesforce Snappack. Data can be retrieved from any Salesforce object by retrieving data based on LastModifiedDate (or SystemModstamp) field. At the end of each sync cycle, the most recent timestamp can be stored and used in the next cycle to fetch only those records that have been modified since the last successful run.

    For real-time or near-real-time replication, in addition to the Change Data Capture (CDC) feature, there can be used Outbound Messages in Salesforce. Outbound Messages can be configured to trigger SnapLogic tasks. In this case, Ultra Tasks are ideal for real-time processing due to their low-latency response time and built-in retry capabilities. Similarly, HTTP Callouts from Salesforce can also be used to invoke SnapLogic tasks, typically implemented via @future methods or queueable Apex classes to ensure proper bulkification and asynchronous execution.

    We’ve successfully implemented these approaches multiple times and found them to be reliable and scalable.

    If this aligns with your needs, I’d be happy to set up a call to walk through the different architecture options for replicating Salesforce data using SnapLogic.

    Best regards, Lazo.

    • feenst's avatar
      feenst
      New Contributor III

      Hi lazo_ilijoski,

      Thanks for the reply. My company today uses the following approach for replication:

      "For scheduled data replication, SnapLogic provides a straightforward solution using the Salesforce Snappack. Data can be retrieved from any Salesforce object by retrieving data based on LastModifiedDate (or SystemModstamp) field. At the end of each sync cycle, the most recent timestamp can be stored and used in the next cycle to fetch only those records that have been modified since the last successful run."

      I am also working with our Salesforce IT team on a PoC using Change Data Capture.

      However, my original question isn't around data reconciliation / auditing solutions.

      Do you have any experience with data reconciliation / auditing solutions that support ingesting data from Salesforce and Kafka?