5 hours ago
Pipeline Execute snap
This snap is one of the most powerful tools in the SnapLogic platform. This snap allows you to call another pipeline from the current one. This can allow you to modularize code for specific functions, making your pipelines simpler to understand and maintain, or it can allow for concurrent execution to improve performance when processing large volumes of data. It can also help limit the amount of data being processed by a single child pipeline.
Understanding some of the configuration settings of this snap can be a bit confusing until you are familiar with the concepts and terminology.
“Execute on” and “Snaplex path” are often left at default and miss out on excellent performance benefits.
“Reuse executions”, “Batch Size” and “Pool Size” seem to be the most often misunderstood, so we will also discuss those options in some detail.
These two options work together to tell SnapLogic where you want your child pipeline to run. The options for Execute On are as follows:
When considering use of LOCAL_SNAPLEX vs LOCAL_NODE, think about how “chatty” the data movement is between the parent and child pipelines. Integrations with high volumes of data between the parent and child should typically use LOCAL_NODE to keep network traffic low; whereas lower data volumes and higher processing requirements on the snaplex nodes should typically use LOCAL_SNAPLEX. For example, a parent that reads a list of files and sends the filename to the child for processing could use LOCAL_SNAPLEX. In contrast, a parent that reads all of the data from a source and calls a child pipeline to apply business logic and return the data to the parent could use LOCAL_NODE to keep network communications to a minimum.
When enabled, Snaplogic will send all input documents to the child pipeline(s) as if they are part of the calling pipeline data stream. In other words, if you do not configure any concurrency (Batch Size=1 and Pool Size=1) and enable the Reuse Executions… option, all input documents will be sent to your child pipeline as a single thread. If Reuse Executions is left disabled, each input document will be sent to its own child pipeline instance, starting as many child pipelines as the number of input documents sent to the Pipeline Execute snap.
Pool Size tells SnapLogic how many instances of the child pipeline are allowed to run concurrently. If you configure Pool Size as 10 and send 1,000 input documents to the Pipeline Execute, then 10 child pipelines will be started at the same time.
With Reuse Executions disabled and a Pool Size > 1, each child will receive only one input document and complete successfully when that document is processed. SnapLogic will ensure that your number of child pipelines running concurrently matches your Pool Size until all input documents are consumed, starting a new child pipeline as each completes.
With Reuse Executions enabled and a Pool Size > 1, SnapLogic will start the number of child pipelines that matches your Pool Size and send input documents to the child pipelines as quickly as they can consume them. Note that documents are fed to the child pipelines according to ring buffer consumption, so the number of documents processed by each child is not expected to be consistent, i.e. if one child is processing slower than the others, it will not process as many documents in the end.
Bach Size tells SnapLogic the maximum number of documents a child pipeline will be given to process. If you configure Batch Size > 1, the Reuse Executions option is hidden. These properties are mutually exclusive as they are very similar in nature. With Batch Size > 1, documents will be sent to the same child until that number of documents (or the end of the input data) is reached.
With Batch Size > 1 and Pool Size = 1, a child pipeline will receive a number of records up to Batch Size value. Once the child has finished processing the set number of records (Batch Size), it will complete successfully and a new child pipeline will be started to process the next batch of documents.
With Batch Size > 1 and Pool Size > 1, the number of child pipelines will be started (up to Pool Size), and each child will receive a number of records up to Batch Size value. Once each child has finished processing the set number of records (Batch Size), it will complete successfully and a new child pipeline will be started to process the next batch of documents.
Pool Size allows you to process concurrently, thereby scaling horizontally. When data processing requires heavy CPU or Memory usage, consider setting Execute On to LOCAL_SNAPLEX, and Pool Size to the number of (or small multiple of) nodes in your snaplex. You may also consider using a higher Pool Size when calling an external service that is slow but allows high concurrency, such as a 3rd party API.
Batch Size can effectively be used to “chunk” data processing. This can been used to create micro-batches for processing large data volumes. For example, one client was reading millions of messages from a source and had to write to their endpoint in batches no more than 10,000 documents. Batch Size allowed them to read all data from the source and send the appropriate number of documents to the endpoint with no further configuration.
Another client used Batch Size to ensure that writing to a secure endpoint didn’t time out their credentials while trying to write large volumes of data.
Reuse Executions is used when you want to call a child as if the snaps were part of the parent. For example, if you have a child pipeline that contains common logic to be applied across all documents (such as data masking or standard business logic), you should call that re-usable pipeline with Reuse Executions enabled so the data passes through the child logic seamlessly and efficiently.
Note that the following are rough timings to simplify understanding. Child pipeline processing times will drift during child pipeline executions due to several factors such as network performance, CPU/memory contention with other processes, and endpoint availability.
Child 1 |
1 document |
Start 00:00 |
End 00:10 |
Child 2 |
1 document |
Start 00:11 |
End 00:20 |
… |
… |
Start as each preceding ends |
End when finished with one doc |
Child 1,000 |
1 document |
Start 166:40 |
End 166:50 |
Child 1 |
1,000 documents |
Start 00:00 |
End ~166:40 |
Child 1 |
1 document |
Start 00:00 |
End 00:10 |
Child 2 |
1 document |
Start 00:00 |
End 00:10 |
… |
1 document |
Start 00:00 |
End 00:10 |
Child 10 |
1 document |
Start 00:00 |
End 00:10 |
Child 11 |
1 document |
Start 00:11 |
End 00:20 |
… |
1 document |
Start as soon as there are fewer than 10 children running |
|
Child 1000 |
1 document |
|
End ~16:40 |
Child 1 |
~100 documents |
Start 00:00 |
End ~16:40 |
Child 2 |
~100 documents |
Start 00:00 |
End ~16:40 |
… |
~100 documents |
Start 00:00 |
End ~16:40 |
Child 10 |
~100 documents |
Start 00:00 |
End ~16:40 |
Child 1 |
30 documents |
Start 00:00 |
End 05:00 |
Child 2 |
30 documents |
Start 00:00 |
End 05:00 |
… |
30 documents |
Start 00:00 |
End 05:00 |
Child 10 |
30 documents |
Start 00:00 |
End 05:00 |
Child 11 |
30 documents |
Start 05:01 |
End 10:00 |
… |
30 documents |
Start 05:01 |
End 10:00 |
Child 20 |
30 documents |
Start 05:01 |
End 10:00 |
Child 21 |
30 documents |
Start 10:01 |
End 15:00 |
… |
30 documents |
Start 10:01 |
End 15:00 |
Child 30 |
30 documents |
Start 10:01 |
End 15:00 |
Child 31 |
30 documents |
Start 15:01 |
End 20:00 |
Child 32 |
30 documents |
Start 15:01 |
End 20:00 |
Child 33 |
30 documents |
Start 15:01 |
End 20:00 |
Child 34 |
10 documents |
Start 15:01 |
End 16:40 |
Pipeline Execute is a very powerful snap, enabling modularization, concurrency, batching, and workload distribution with a small number of simple settings. When working with high data volumes and high concurrency, be sure to monitor your snaplex resources to maintain performance and stability of your system.