SnapLogic - Integration Nation

feenst · a month ago

One of my company's use cases for SnapLogic today is replication of data from Salesforce into internal Kafka topics for use throughout the enterprise.

There have been various instances of internal consumers of the Kafka data reporting missing records.

Investigations have found multiple causes for these data drops. Some of the causes are related to behavior that Salesforce describes as "Working As Designed".

Salesforce has recommended other replication architectures, but there are various concerns with my company about using them (license cost, platform load) ... and we might still end up with missing data.

So, we're looking into data reconciliation / auditing solutions. Are there any recommendations on a tool that can:

* Identify record(s) where the record in Salesforce does not have a matching record (e.g. same timestamp) existing in Kafka

* Generate a message containing relevant metadata (e.g. record Id, Salesforce object, Kafka topic) to be sent to a REST endpoint / message queue for reprocessing

Scott · a month ago

Salesforce Subscriber Snap & Change Data Capture (CDC) API

Our Salesforce Subscriber Snap is designed to work with Salesforce’s Change Data Capture (CDC) streaming API, which was their primary API for tracking real-time data changes for many years. However, this API has limitations in usability, reliability, and scalability, making it difficult to implement robust pipelines that ensure no events are missed. Given these constraints, reliably capturing and processing all changes can be challenging.

In recent years, Salesforce introduced a new API to address these challenges, offering better scalability and reliability. However, because it is a completely different API, we have not yet developed a new version of the Salesforce Subscriber Snap to support it. While this has been discussed internally, there are currently no immediate plans to build an updated Snap for the new API.

Handling Missed Events in Kafka

For customers using Kafka to process Salesforce data changes, it’s important to note that Kafka is not a traditional database where records can be looked up by key. Instead, Kafka operates as an append-only log of events, meaning you consume events in the order they arrive rather than retrieving a specific record at will.

The Salesforce CDC API events contain record identifiers, but they are intended to be processed in real time. If an event is missed, there is a limited window to retrieve it, and there is no built-in mechanism to look up past events by key within Kafka. Because of this, a common approach is to consume the events as they arrive and store them in a target database (e.g., Snowflake) to maintain an up-to-date view of customer records.

feenst · 4 weeks ago

Hi Scott,

Thanks for the response.

My organization is not using the CDC API today. We are using a REST and Bulk APIs to replicate data. Salesforce support and solution architects have recommended CDC instead of our current architecture.

Do you have more information about the usability, reliability, and scalability limitations of CDC?

Also, can you advise on what the new API is that Salesforce has introduced?

Regarding handling the missed events in Kafka, my organization uses Kafka to allow multiple internal applications to consume updates from Salesforce for objects of interest.

My team is responsible for the replication (today using SnapLogic) as well as the Kafka platform. We are interested in validating that all of the expected records from Salesforce have successfully reached Kafka.

If a record has not reached Kafka, we are interested in triggering an automated process to retrieve the missing record(s) and ensure that all of the latest, expected records have reached Kafka.

So, I am seeking input from the community of any tools that might support this use case. If the tool requires data to be extracted from Kafka and into a traditional database for the reconciliation / audit with Salesforce, this would be fine.

aleung · 3 weeks ago

What would be the issue if you use SystemModStamp as your state.

e.g.
Assuming Data Warehouse is one of the consumer
Data Warehouse : select Max(SystemModStamp) > SFDC > REST SOQL where SystemModStamp > $SystemModStamp via SL > Kafka Producer > Data Warehouse

feenst · 2 weeks ago

I'm assuming the above response assumes some sort of custom data reconciliation happening within the data warehouse?

If so, a custom reconciliation is one option being considered, and would potentially be similar to what you've described.

However, the intention of this thread was to seek recommendations for any existing offering (commercial or open source) that can perform this type of reconciliation.

To the point of SystemModstamp specifically, it would not work for all objects.

1.) Not every object being replicated / needing reconciliation has SystemModstamp

2.) For objects that have SystemModstamp, it's almost never used as part of the query for the existing replication architecture

A reason for this is that updates in Salesforce having nothing to do with a change in data can result in SystemModstamp being updated on records.

During validation of the current replication architecture when SystemModstamp was used for replication, I observed a situation where retrieval was being attempted for tens of millions of records for an object when there was no update in the actual data that would have been replicated.

SnapLogic - Integration Nation

Data reconciliation solutions?