cancel
Showing results forย 
Search instead forย 
Did you mean:ย 

Parallel Reused Pipeline Executes Uneven Load

jamesv
New Contributor II

Iโ€™ve been trying to build out a pipeline that reuses pipeline executes across a pool size of 3 to get input documents on an even spread between them to take advantage of multiple cores on similar work between differing input documents.

However, in practice, it seems like itโ€™s unevenly distributing the work - one such example Iโ€™m looking at right now is that two pipeline executes are getting 8 documents each, and the third is getting 25.

My input data is super simple - it just passes in a day to run across, which the internal logic takes use of and does a bunch of self-contained work before completing. The actual work done is not at all simple, but I would figure that when distributing work, it would want to distribute them as evenly as possible.

Is there a way to guarantee that this work does get evenly distributed without pre-aggregating the data to pass into each pipeline execute thread? That works for evenly distributing the work, but itโ€™s annoying as a pattern, plus doesnโ€™t take advantage of the fact that some of the days executing might run significantly faster than other days.

3 REPLIES 3

tstack
Former Employee

Are these local executions (i.e. the Snaplex property in the PipeExec is empty)?

For local executions, PipeExec will try to send input documents to the least-loaded child execution. So, an imbalance like this can occur when the child executions are able to process the inputs slightly faster than the incoming rate. Iโ€™d need to take a closer look at the execution stats of the parent and child executions to see if thatโ€™s really the case.

jamesv
New Contributor II

Yep, Snaplex property is empty.

Thereโ€™s only one node in this specific instance, too.

I think in this case I donโ€™t need to reuse executions, as itโ€™s been a while since I reexamined what I was doing in the intermediate pipeline execute which doesnโ€™t really need to be parallelized any more, so I might end up just getting rid of โ€œreuse pipeline executionsโ€ at this point!

tstack
Former Employee

The uneven distribution looks like a bug and one has been filed.

I think the problem is that PipeExec is unable to determine exactly how many docs are actively being worked on in some cases and it was assuming the pipeline was not under load when it really was. One case where this can happen is with snaps that work in batches. They consume multiple documents quickly and the platform cannot tell if the snap has finished working on the document or not. So, PipeExec thinks the child is free, when really the docs have only partially been processed.