Snap peak memory usage and estimating required memory for a Snaplex node

The “Memory Allocated” section of the “Check Pipeline Execution Statistics” documentation page states, “Note that this number does not reflect the amount of memory that was freed and it is not the peak memory usage of the Snap. So, it is not necessarily a metric that can be used to estimate the required size of a Snaplex node.” I am interested in being able to determine the peak memory used by a Snap during execution, or at least the maximum that could be used. Is this possible?

Are the following valid?

Sort snap - Wouldn’t be able to determine the peak memory used, but the maximum that could be used can be determined by looking at the Maximum memory %.

In-memory Lookup snap - The maximum that could be used can be determined by looking at the Maximum memory %, and the peak memory used would be the total memory allocated from the pipeline execution statistics.

Join snap (with unsorted input streams) - would this be similar to the Sort snap?

Aggregate snap - ???

Also, is there any way to estimate the required memory for a Snaplex node? Or any way to estimate how much of a workload a given amount of memory can support?

1 Like

I have to ask, why do you think you need to know? Are you seeing memory-related issues right now when running pipelines? Ideally, this is not something that you should be worrying about.

Also, keep in mind that something like the Sort snap may not allocate that much memory since it is not creating documents. The memory for the documents is allocated earlier in the pipeline by Snaps like Parsers, DB Selects, and so on.

That being said, it’s a bit hard to know the peak usage for a single snap in isolation. You can probably get an idea of how much the whole pipeline is consuming by running only that pipeline on a Snaplex and observing the overall memory usage for the node that the pipeline ran on.

Note that the amount of memory used by a pipeline can vary depending on how fast it runs. For example, if a pipeline is mostly/all streaming (i.e. doesn’t use collecting snaps like Sort), it will consume more memory when the data sources are fast compared to when they are slow. In other words, faster data sources means there will be more documents in-flight, which means there will be more memory consumption.

The Sort snap has some idea of how much memory the documents it has ingested is keeping alive in order to know when to spill to disk. It might be a good idea for the snap to surface this number so that you can tune this property appropriately.

Otherwise, yes, the snap should limit its memory usage based on the Max mem property.

Yes, this is somewhat similar to Sort, except it’s keeping the documents that came in on the right input view alive instead of all the ones coming in to the left input view.

I think Join uses a different mechanism and might use the disk more aggressively.

An Aggregate probably doesn’t keep much memory alive compared to the others.

There’s so much variability here, with the size of documents, the speed of the endpoints, and the design of the pipeline, that this is pretty hard to do generically. For the most part, I think you just have to try it out and check the resource graphs of the nodes in the dashboard.