Parallel Reused Pipeline Executes Uneven Load

I’ve been trying to build out a pipeline that reuses pipeline executes across a pool size of 3 to get input documents on an even spread between them to take advantage of multiple cores on similar work between differing input documents.

However, in practice, it seems like it’s unevenly distributing the work - one such example I’m looking at right now is that two pipeline executes are getting 8 documents each, and the third is getting 25.

My input data is super simple - it just passes in a day to run across, which the internal logic takes use of and does a bunch of self-contained work before completing. The actual work done is not at all simple, but I would figure that when distributing work, it would want to distribute them as evenly as possible.

Is there a way to guarantee that this work does get evenly distributed without pre-aggregating the data to pass into each pipeline execute thread? That works for evenly distributing the work, but it’s annoying as a pattern, plus doesn’t take advantage of the fact that some of the days executing might run significantly faster than other days.

Are these local executions (i.e. the Snaplex property in the PipeExec is empty)?

For local executions, PipeExec will try to send input documents to the least-loaded child execution. So, an imbalance like this can occur when the child executions are able to process the inputs slightly faster than the incoming rate. I’d need to take a closer look at the execution stats of the parent and child executions to see if that’s really the case.

Yep, Snaplex property is empty.

There’s only one node in this specific instance, too.

I think in this case I don’t need to reuse executions, as it’s been a while since I reexamined what I was doing in the intermediate pipeline execute which doesn’t really need to be parallelized any more, so I might end up just getting rid of “reuse pipeline executions” at this point!

The uneven distribution looks like a bug and one has been filed.

I think the problem is that PipeExec is unable to determine exactly how many docs are actively being worked on in some cases and it was assuming the pipeline was not under load when it really was. One case where this can happen is with snaps that work in batches. They consume multiple documents quickly and the platform cannot tell if the snap has finished working on the document or not. So, PipeExec thinks the child is free, when really the docs have only partially been processed.

1 Like