Pipeline Execute snap - An in-depth look

- Mark as New
- Bookmark
- Subscribe
- Mute
- Subscribe to RSS Feed
- Permalink
- Report Inappropriate Content
3 weeks ago
Pipeline Execute snap
This snap is one of the most powerful tools in the SnapLogic platform. This snap allows you to call another pipeline from the current one. This can allow you to modularize code for specific functions, making your pipelines simpler to understand and maintain, or it can allow for concurrent execution to improve performance when processing large volumes of data. It can also help limit the amount of data being processed by a single child pipeline.
Understanding some of the configuration settings of this snap can be a bit confusing until you are familiar with the concepts and terminology.
“Execute on” and “Snaplex path” are often left at default and miss out on excellent performance benefits.
“Reuse executions”, “Batch Size” and “Pool Size” seem to be the most often misunderstood, so we will also discuss those options in some detail.
Execute On / Snaplex Path
These two options work together to tell SnapLogic where you want your child pipeline to run. The options for Execute On are as follows:
- SNAPLEX_WITH_PATH (default)
Runs the child pipeline on the snaplex identified by “Snaplex Path”. You should only choose this option when the child pipelines needs to run on a different snaplex from the calling pipeline. This may be required for security or network access restrictions. However, keep in mind that this can cause higher demand on networking resources. - LOCAL_SNAPLEX
Run the child pipeline on any available node on the same snaplex as the calling pipeline. This is an excellent option when running concurrent child pipelines (see Pool Size below) to optimize workload balancing. - LOCAL_NODE
Run the child pipeline on the same node as the calling pipeline. This is the preferred option when calling child pipelines for code re-use or modularization rather than workload balancing.
When considering use of LOCAL_SNAPLEX vs LOCAL_NODE, think about how “chatty” the data movement is between the parent and child pipelines. Integrations with high volumes of data between the parent and child should typically use LOCAL_NODE to keep network traffic low; whereas lower data volumes and higher processing requirements on the snaplex nodes should typically use LOCAL_SNAPLEX. For example, a parent that reads a list of files and sends the filename to the child for processing could use LOCAL_SNAPLEX. In contrast, a parent that reads all of the data from a source and calls a child pipeline to apply business logic and return the data to the parent could use LOCAL_NODE to keep network communications to a minimum.
Reuse executions…
When enabled, Snaplogic will send all input documents to the child pipeline(s) as if they are part of the calling pipeline data stream. In other words, if you do not configure any concurrency (Batch Size=1 and Pool Size=1) and enable the Reuse Executions… option, all input documents will be sent to your child pipeline as a single thread. If Reuse Executions is left disabled, each input document will be sent to its own child pipeline instance, starting as many child pipelines as the number of input documents sent to the Pipeline Execute snap.
Pool Size
Pool Size tells SnapLogic how many instances of the child pipeline are allowed to run concurrently. If you configure Pool Size as 10 and send 1,000 input documents to the Pipeline Execute, then 10 child pipelines will be started at the same time.
With Reuse Executions disabled and a Pool Size > 1, each child will receive only one input document and complete successfully when that document is processed. SnapLogic will ensure that your number of child pipelines running concurrently matches your Pool Size until all input documents are consumed, starting a new child pipeline as each completes.
With Reuse Executions enabled and a Pool Size > 1, SnapLogic will start the number of child pipelines that matches your Pool Size and send input documents to the child pipelines as quickly as they can consume them. Note that documents are fed to the child pipelines according to ring buffer consumption, so the number of documents processed by each child is not expected to be consistent, i.e. if one child is processing slower than the others, it will not process as many documents in the end.
Batch Size
Bach Size tells SnapLogic the maximum number of documents a child pipeline will be given to process. If you configure Batch Size > 1, the Reuse Executions option is hidden. These properties are mutually exclusive as they are very similar in nature. With Batch Size > 1, documents will be sent to the same child until that number of documents (or the end of the input data) is reached.
With Batch Size > 1 and Pool Size = 1, a child pipeline will receive a number of records up to Batch Size value. Once the child has finished processing the set number of records (Batch Size), it will complete successfully and a new child pipeline will be started to process the next batch of documents.
With Batch Size > 1 and Pool Size > 1, the number of child pipelines will be started (up to Pool Size), and each child will receive a number of records up to Batch Size value. Once each child has finished processing the set number of records (Batch Size), it will complete successfully and a new child pipeline will be started to process the next batch of documents.
Why / When to use Pool Size
Pool Size allows you to process concurrently, thereby scaling horizontally. When data processing requires heavy CPU or Memory usage, consider setting Execute On to LOCAL_SNAPLEX, and Pool Size to the number of (or small multiple of) nodes in your snaplex. You may also consider using a higher Pool Size when calling an external service that is slow but allows high concurrency, such as a 3rd party API.
Why / When to use Batch Size
Batch Size can effectively be used to “chunk” data processing. This can been used to create micro-batches for processing large data volumes. For example, one client was reading millions of messages from a source and had to write to their endpoint in batches no more than 10,000 documents. Batch Size allowed them to read all data from the source and send the appropriate number of documents to the endpoint with no further configuration.
Another client used Batch Size to ensure that writing to a secure endpoint didn’t time out their credentials while trying to write large volumes of data.
Why / When to use Reuse Executions?
Reuse Executions is used when you want to call a child as if the snaps were part of the parent. For example, if you have a child pipeline that contains common logic to be applied across all documents (such as data masking or standard business logic), you should call that re-usable pipeline with Reuse Executions enabled so the data passes through the child logic seamlessly and efficiently.
Concurrency Examples
Note that the following are rough timings to simplify understanding. Child pipeline processing times will drift during child pipeline executions due to several factors such as network performance, CPU/memory contention with other processes, and endpoint availability.
1,000 input documents, Reuse Executions disabled, Pool Size = 1, Batch Size = 1
Child 1 |
1 document |
Start 00:00 |
End 00:10 |
Child 2 |
1 document |
Start 00:11 |
End 00:20 |
… |
… |
Start as each preceding ends |
End when finished with one doc |
Child 1,000 |
1 document |
Start 166:40 |
End 166:50 |
1,000 input documents, Reuse Executions enabled, Pool Size = 1, Batch Size = 1
Child 1 |
1,000 documents |
Start 00:00 |
End ~166:40 |
1,000 input documents, Reuse Executions disabled, Pool Size = 10, Batch Size = 1
Child 1 |
1 document |
Start 00:00 |
End 00:10 |
Child 2 |
1 document |
Start 00:00 |
End 00:10 |
… |
1 document |
Start 00:00 |
End 00:10 |
Child 10 |
1 document |
Start 00:00 |
End 00:10 |
Child 11 |
1 document |
Start 00:11 |
End 00:20 |
… |
1 document |
Start as soon as there are fewer than 10 children running |
|
Child 1000 |
1 document |
|
End ~16:40 |
1,000 input documents, Reuse Executions enabled, Pool Size = 10, Batch Size = 1
Child 1 |
~100 documents |
Start 00:00 |
End ~16:40 |
Child 2 |
~100 documents |
Start 00:00 |
End ~16:40 |
… |
~100 documents |
Start 00:00 |
End ~16:40 |
Child 10 |
~100 documents |
Start 00:00 |
End ~16:40 |
1,000 input documents, Reuse Executions enabled, Pool Size = 10, Batch Size = 30
Child 1 |
30 documents |
Start 00:00 |
End 05:00 |
Child 2 |
30 documents |
Start 00:00 |
End 05:00 |
… |
30 documents |
Start 00:00 |
End 05:00 |
Child 10 |
30 documents |
Start 00:00 |
End 05:00 |
Child 11 |
30 documents |
Start 05:01 |
End 10:00 |
… |
30 documents |
Start 05:01 |
End 10:00 |
Child 20 |
30 documents |
Start 05:01 |
End 10:00 |
Child 21 |
30 documents |
Start 10:01 |
End 15:00 |
… |
30 documents |
Start 10:01 |
End 15:00 |
Child 30 |
30 documents |
Start 10:01 |
End 15:00 |
Child 31 |
30 documents |
Start 15:01 |
End 20:00 |
Child 32 |
30 documents |
Start 15:01 |
End 20:00 |
Child 33 |
30 documents |
Start 15:01 |
End 20:00 |
Child 34 |
10 documents |
Start 15:01 |
End 16:40 |
Summary
Pipeline Execute is a very powerful snap, enabling modularization, concurrency, batching, and workload distribution with a small number of simple settings. When working with high data volumes and high concurrency, be sure to monitor your snaplex resources to maintain performance and stability of your system.
