Task Execute Timing out

Hi,
We are using the cloudplex version of Snaplogic. We have a pipeline built that downloads a very large weather data set. We had to build the pipeline to execute in “batches” using the Task Execute snap because the 100mm row+ download files up the temp file space if we do all in one go…

I am getting timeouts after 15 minutes and would like to either up that timeout parameter or make it so that the parent pipeline doesn’t think the task pipeline has timed out. I was told that having an “open view” in the task pipeline would keep the parent from thinking a timeout has happened but this isn’t working.

Any ideas? thanks

Can you talk more about how you’re using the Task Execute snap? How’re you batching the execution?

My first impression would be to try and move over to using the Pipe Execute snap. The snap is designed to execute other pipelines and seems like it could work for this case. Have you tried it yet?

Yes, a key gap in the Pipeline execute is the simple Batch function that the Execute Task has. I figured out how to simulate that functionality by adding a Group by N, and a splitter to the sub-pipeline. This should be basic functionality for Pipeline Execute imho.

Anyway, while this solves the time out issue, i discovered the next issue which is once you start a pipeline execution, even if your snapplex has multiple nodes, all executions stay in the same node. So force Snaplogic to do workload management, I created a simple way to split the work in 1/2, and created 2 separate tasks that use a pipeline parameter to call different groups of data.

The problem I have now is that even when I start the 2 tasks a couple of minutes apart, that they are both going to the exact same Node. I have tested multiple times and occasionally I do get work to be put on the two nodes, but it isn’t consistent. I need the platform to consistetly realize that a node is at 80+% utilization and use the node that is at 5%…

Any ideas?

Can you help us understand what you are trying to do overall? You’ve mentioned needing to use batches, can you give some more detail on why that is? It’s difficult to help without a better understanding of what you are trying to achieve.

Did you set the Snaplex property in the PipeExec snap? If the Snaplex property is left blank, it will only execute child pipelines on the local node.

The second node might have some issue, open a support case so we can take a closer look and find an explanation.

Hi,
We don’t set the snapplex properly as we expect snaplogic to workload balance the nodes within the production snapplex. In our testing, it appears that Snaplogic will eventually chose a different node when a node is busy, but it takes some time for snaplogic to realize that a node is busy.

A key use case of the batches functionality is to take a smaller number of records to a sub-pipeline. For example, in our case, we are pulling massive amounts of weather data for a list of 5-7k locations. If I send all 7k locations at once, the pull overwhelms node as the data pull side is about 260mm records. therefore, we use a pipleline/sub pipeline structure to pull 1000 locations at a time. However, this data must all be pulled in within an hour, so I need to have multiple pulls going on at the same time. If I do it all in a serial pipeline it takes 3 hours. However, by splitting the locations into 2 groups, I can run 2 nodes at once and get done in 1.5 hours…

The snapplex property doesn’t help becuase you can only specific a snapplex. We need better load balancing on our nodes so that it looks quickly at both nodes and decides which is the least used. Support is taking a look at this and they pointed me to documentation which shows that load balancing is done by thread count. Which is fine, but clearly from our testing it only is looking in like 5+ minute resolution, instead of real time when the job is being prepared

So, you’re getting a bunch of locations, iterating through each location pulling the weather data, and then doing something with that weather data. Correct? That flow should be achievable by putting the operation to get the locations into the parent pipeline and then feeding the locations into a PipeExec with Reuse enabled and a Pool Size that is greater than one (maybe 10?). The PipeExec will then distribute the locations to the child pipelines which pull the weather data and finish processing it. You shouldn’t need to use GroupByN to do any batching with that configuration. You can play with the pool size to control how many child executions are running in parallel depending on resource usage and maybe set the Snaplex property to distribute some child executions to other nodes in the Snaplex.

If I send all 7k locations at once, the pull overwhelms node

Can you elaborate on what you mean by “overwhelms” here? What happens exactly?

Hi,
Thanks for the insights. Yes, we tried the PipeExec with pool. The problem is that the pool redlines the node, and doesn’t distribute dynamically over the 2 nodes. We have basically figured out how to “hard force” resource loading.

We group the location data into 3 groups. We setup a pipeline parameter and we setup 3 tasks, which invoke the master pipeline with the parameter. We learned that Snaplogic takes a while to update the Open Threads count which is used for resource loading so by starting our 3 tasks 5 minutes apart (we were able to get succesfull resource loading with 3 minute, but chose 5 to be safe), we observe the system properly utilizing the nodes. When we did 2 minutes or less, the 3 tasks got assigned all to the same node.

In terms of “overwhelms”, there were severall different tests:

  1. All 7k locations at once. System chugged along with mem/cpu at 95%, which caused other jobs to fail. Then after about 250mm records loaded, we got an out of local storage crash
  2. When used pipeline exec, all the pipelines spawned and again pegged mem/cpu at 95% (atleast what we saw on the monitors). Also we noticed degraded performance (seems slow down) with very large pull requests 1-2 hours in

Our new approach is working. I hope that Snaplogic can update how fast the Open Threads variable is updated, this will improve resource load balacing for Cloudplex users!

Thanks for the time Tim, much appreciate your looking at this

It would be good to point out to support the particular execution where this occurred so that we can take a closer look.

Can you point out the executions that failed/slowed-down in your support case as well. Especially the crash, that is very concerning and should be addressed.

The thread count for the nodes is updated frequently (a few times a minute). The CPU usage is smoothed out over a minute or two so short spikes are reduced. The problem is that it is hard to predict the resource impact of a pipeline or what new pipelines may be executed in the near future. There is also communication overhead when distributing child executions to neighboring nodes. Many times, a child pipeline will be executed that takes very few resources and runs very quickly. So, we don’t want to run that on a neighbor since the cost of communication negates any benefits from trying to distribute the load. Other times, like in your case, a pipeline with only a few snaps may have a large resource impact on the node and it would be good to distribute. It’s a challenging balancing act that we’re still working on. But, first, we need to make sure executions are not failing due to overload since there is no guarantee the resource scheduler will always be able to distribute executions.

It would be nice if we could specify what kind of load balancing (least used, round robin, etc.) we wanted in the PipelineExe snap, triggered task definitions, etc. I think it would also be great to have a way to define and then specify (preferred/required) logical groups of node(s) within a Snaplex to execute on, which could be dynamically specified during runtime.

1 Like

Hi Chris,
I agree.

Tim - our testing shows that the

PipelineExec loads all children to one node. To Chris’ point, I would think that another setting should be “multi-node” where you can specific whether you want the child pipelines spread across multiple nodes or not. I would imagine that multi-node would require a bit more communications between nodes, but would create better utilization

Once we get this project live, I will circle around and re-do the testing so I can send you some screen shots / runtime data on the crashes. I did send the screenshot to support, which shows how even though we have 2 nodes, one will zero pipelines, all the pipelines are being allocated to the same node over a 2 minute period

Are you referring to some executions in the attached screenshot? I took a look at the first set of STOPPED triggered executions in the screenshot and they were all scheduled to the same node because the other node in the Snaplex had reached its maximum memory usage limit. Therefore, the scheduler would not treat the other node as a possible candidate for running new executions.

You can specify that by setting the Snaplex property in PipeExec. The simplest way to get it to run on the same Snaplex as the parent is to make it an expression and set the value to pipe.plexPath. But, it depends on the resource usage of the nodes and it tends to prefer the local node.

If you’re interested in trying another configuration, I think I made a mistake above in recommending to use Reuse with PipeExec. With Reuse disabled, a new child execution will be started for each location and a new scheduling decision will be made. Since the children run for awhile and consume quite a few resources, the scheduler should have more data to make a better decision on where to execute the child. (Enabling Reuse would help to avoid the overhead of starting a new execution, but that’s not very relevant here since every location results in a lot of work being done.) Maybe try setting the pool size to 10 or 20 to start with and then dial it in after some testing.