Hi,
We are using the cloudplex version of Snaplogic. We have a pipeline built that downloads a very large weather data set. We had to build the pipeline to execute in “batches” using the Task Exe...
It would be good to point out to support the particular execution where this occurred so that we can take a closer look.
Can you point out the executions that failed/slowed-down in your support case as well. Especially the crash, that is very concerning and should be addressed.
The thread count for the nodes is updated frequently (a few times a minute). The CPU usage is smoothed out over a minute or two so short spikes are reduced. The problem is that it is hard to predict the resource impact of a pipeline or what new pipelines may be executed in the near future. There is also communication overhead when distributing child executions to neighboring nodes. Many times, a child pipeline will be executed that takes very few resources and runs very quickly. So, we don’t want to run that on a neighbor since the cost of communication negates any benefits from trying to distribute the load. Other times, like in your case, a pipeline with only a few snaps may have a large resource impact on the node and it would be good to distribute. It’s a challenging balancing act that we’re still working on. But, first, we need to make sure executions are not failing due to overload since there is no guarantee the resource scheduler will always be able to distribute executions.
It would be nice if we could specify what kind of load balancing (least used, round robin, etc.) we wanted in the PipelineExe snap, triggered task definitions, etc. I think it would also be great to have a way to define and then specify (preferred/required) logical groups of node(s) within a Snaplex to execute on, which could be dynamically specified during runtime.
Tim - our testing shows that the PipelineExec loads all children to one node. To Chris’ point, I would think that another setting should be “multi-node” where you can specific whether you want the child pipelines spread across multiple nodes or not. I would imagine that multi-node would require a bit more communications between nodes, but would create better utilization
Once we get this project live, I will circle around and re-do the testing so I can send you some screen shots / runtime data on the crashes. I did send the screenshot to support, which shows how even though we have 2 nodes, one will zero pipelines, all the pipelines are being allocated to the same node over a 2 minute period
Are you referring to some executions in the attached screenshot? I took a look at the first set of STOPPED triggered executions in the screenshot and they were all scheduled to the same node because the other node in the Snaplex had reached its maximum memory usage limit. Therefore, the scheduler would not treat the other node as a possible candidate for running new executions.
You can specify that by setting the Snaplex property in PipeExec. The simplest way to get it to run on the same Snaplex as the parent is to make it an expression and set the value to pipe.plexPath. But, it depends on the resource usage of the nodes and it tends to prefer the local node.
If you’re interested in trying another configuration, I think I made a mistake above in recommending to use Reuse with PipeExec. With Reuse disabled, a new child execution will be started for each location and a new scheduling decision will be made. Since the children run for awhile and consume quite a few resources, the scheduler should have more data to make a better decision on where to execute the child. (Enabling Reuse would help to avoid the overhead of starting a new execution, but that’s not very relevant here since every location results in a lot of work being done.) Maybe try setting the pool size to 10 or 20 to start with and then dial it in after some testing.