โ12-06-2018 11:40 AM
Iโve been trying to build out a pipeline that reuses pipeline executes across a pool size of 3 to get input documents on an even spread between them to take advantage of multiple cores on similar work between differing input documents.
However, in practice, it seems like itโs unevenly distributing the work - one such example Iโm looking at right now is that two pipeline executes are getting 8 documents each, and the third is getting 25.
My input data is super simple - it just passes in a day to run across, which the internal logic takes use of and does a bunch of self-contained work before completing. The actual work done is not at all simple, but I would figure that when distributing work, it would want to distribute them as evenly as possible.
Is there a way to guarantee that this work does get evenly distributed without pre-aggregating the data to pass into each pipeline execute thread? That works for evenly distributing the work, but itโs annoying as a pattern, plus doesnโt take advantage of the fact that some of the days executing might run significantly faster than other days.
โ12-06-2018 11:58 AM
Are these local executions (i.e. the Snaplex property in the PipeExec is empty)?
For local executions, PipeExec will try to send input documents to the least-loaded child execution. So, an imbalance like this can occur when the child executions are able to process the inputs slightly faster than the incoming rate. Iโd need to take a closer look at the execution stats of the parent and child executions to see if thatโs really the case.
โ12-06-2018 01:00 PM
Yep, Snaplex property is empty.
Thereโs only one node in this specific instance, too.
I think in this case I donโt need to reuse executions, as itโs been a while since I reexamined what I was doing in the intermediate pipeline execute which doesnโt really need to be parallelized any more, so I might end up just getting rid of โreuse pipeline executionsโ at this point!
โ12-07-2018 10:07 AM
The uneven distribution looks like a bug and one has been filed.
I think the problem is that PipeExec is unable to determine exactly how many docs are actively being worked on in some cases and it was assuming the pipeline was not under load when it really was. One case where this can happen is with snaps that work in batches. They consume multiple documents quickly and the platform cannot tell if the snap has finished working on the document or not. So, PipeExec thinks the child is free, when really the docs have only partially been processed.