Task Execute Timing out

chthroop
New Contributor III
6 years ago
Hi,
We don’t set the snapplex properly as we expect snaplogic to workload balance the nodes within the production snapplex. In our testing, it appears that Snaplogic will eventually chose a different node when a node is busy, but it takes some time for snaplogic to realize that a node is busy.

A key use case of the batches functionality is to take a smaller number of records to a sub-pipeline. For example, in our case, we are pulling massive amounts of weather data for a list of 5-7k locations. If I send all 7k locations at once, the pull overwhelms node as the data pull side is about 260mm records. therefore, we use a pipleline/sub pipeline structure to pull 1000 locations at a time. However, this data must all be pulled in within an hour, so I need to have multiple pulls going on at the same time. If I do it all in a serial pipeline it takes 3 hours. However, by splitting the locations into 2 groups, I can run 2 nodes at once and get done in 1.5 hours…

The snapplex property doesn’t help becuase you can only specific a snapplex. We need better load balancing on our nodes so that it looks quickly at both nodes and decides which is the least used. Support is taking a look at this and they pointed me to documentation which shows that load balancing is done by thread count. Which is fine, but clearly from our testing it only is looking in like 5+ minute resolution, instead of real time when the job is being prepared
- tstack
  Former Employee
  6 years ago
  So, you’re getting a bunch of locations, iterating through each location pulling the weather data, and then doing something with that weather data. Correct? That flow should be achievable by putting the operation to get the locations into the parent pipeline and then feeding the locations into a PipeExec with Reuse enabled and a Pool Size that is greater than one (maybe 10?). The PipeExec will then distribute the locations to the child pipelines which pull the weather data and finish processing it. You shouldn’t need to use GroupByN to do any batching with that configuration. You can play with the pool size to control how many child executions are running in parallel depending on resource usage and maybe set the Snaplex property to distribute some child executions to other nodes in the Snaplex.
  
  If I send all 7k locations at once, the pull overwhelms node
  
  Can you elaborate on what you mean by “overwhelms” here? What happens exactly?
  - chthroop
    New Contributor III
    6 years ago
    Hi,
    Thanks for the insights. Yes, we tried the PipeExec with pool. The problem is that the pool redlines the node, and doesn’t distribute dynamically over the 2 nodes. We have basically figured out how to “hard force” resource loading.
    
    We group the location data into 3 groups. We setup a pipeline parameter and we setup 3 tasks, which invoke the master pipeline with the parameter. We learned that Snaplogic takes a while to update the Open Threads count which is used for resource loading so by starting our 3 tasks 5 minutes apart (we were able to get succesfull resource loading with 3 minute, but chose 5 to be safe), we observe the system properly utilizing the nodes. When we did 2 minutes or less, the 3 tasks got assigned all to the same node.
    
    In terms of “overwhelms”, there were severall different tests:
    
    All 7k locations at once. System chugged along with mem/cpu at 95%, which caused other jobs to fail. Then after about 250mm records loaded, we got an out of local storage crash
    
    When used pipeline exec, all the pipelines spawned and again pegged mem/cpu at 95% (atleast what we saw on the monitors). Also we noticed degraded performance (seems slow down) with very large pull requests 1-2 hours in
    
    Our new approach is working. I hope that Snaplogic can update how fast the Open Threads variable is updated, this will improve resource load balacing for Cloudplex users!
    
    Thanks for the time Tim, much appreciate your looking at this

Forum Discussion

Recent Discussions

Way to lock down in Prod org to "Monitor" only access?

trace API and proxy calls

Pagination Logic Fails After Migrating from REST GET to HTTP Client Snap

Pipeline Execute Pool size

Concat values of a field based on value of another field