Forum Discussion
Hi bojanvelevski,
Thanks for the reply. I too tried Group by N snap but i forget as we can read the columns using index path. Now to get my output i need to do group by on some column, but now we dont have any column to do, i tried using Aggregate snap as i cant group by i got only date output but i need file name too. any advice here.
The output should be as below. Thanks
Thanks
I have to ask, why do you think you need to know? Are you seeing memory-related issues right now when running pipelines? Ideally, this is not something that you should be worrying about.
Also, keep in mind that something like the Sort snap may not allocate that much memory since it is not creating documents. The memory for the documents is allocated earlier in the pipeline by Snaps like Parsers, DB Selects, and so on.
That being said, it’s a bit hard to know the peak usage for a single snap in isolation. You can probably get an idea of how much the whole pipeline is consuming by running only that pipeline on a Snaplex and observing the overall memory usage for the node that the pipeline ran on.
Note that the amount of memory used by a pipeline can vary depending on how fast it runs. For example, if a pipeline is mostly/all streaming (i.e. doesn’t use collecting snaps like Sort), it will consume more memory when the data sources are fast compared to when they are slow. In other words, faster data sources means there will be more documents in-flight, which means there will be more memory consumption.
The Sort snap has some idea of how much memory the documents it has ingested is keeping alive in order to know when to spill to disk. It might be a good idea for the snap to surface this number so that you can tune this property appropriately.
Otherwise, yes, the snap should limit its memory usage based on the Max mem property.
Yes, this is somewhat similar to Sort, except it’s keeping the documents that came in on the right input view alive instead of all the ones coming in to the left input view.
I think Join uses a different mechanism and might use the disk more aggressively.
An Aggregate probably doesn’t keep much memory alive compared to the others.
There’s so much variability here, with the size of documents, the speed of the endpoints, and the design of the pipeline, that this is pretty hard to do generically. For the most part, I think you just have to try it out and check the resource graphs of the nodes in the dashboard.
Related Content
- 6 months ago
- 11 months ago
- 2 years ago
- 4 years ago