Forum Discussion
7 Replies
- koryknickEmployee
@winosky - I did the following write-up a while back on memory intensive snaps:
Memory Intensive Snaps
Though not exhaustive, the following list of snaps should be used with some consideration and understanding of impact on memory. Any snap that performs aggregation of the streaming documents could severely impact your snaplex node memory and may cause node failure if the JVM Heap Size is exceeded.
- Aggregate (if not using sorted input)
- Sort (if large “Maximum Memory %” is used)
- In-Memory Lookup
- Join (if not using sorted input)
- Gate
- Group By Fields
- Group By N
- Parquet Formatter
Snaps such as Gate and the Group By snaps, which move individual documents into arrays of documents, can easily consume Gigabytes of memory very quickly. Consider an average document size of 1KB processing 5 million records. If a Gate snap is used, it will place all 5 million records into a single array, consuming 1KB * 5 million = 5GB.
The Group By snaps, while they do not consume any more memory than a single group of documents, can still use large amounts of memory. Consider the following common pipeline pattern, typically used to split processing to handle different groups of data, or as an attempt to “batch” processing into smaller datasets:
Let’s take the same example dataset we had before: 5 million records of 1KB documents, and we’ll assume we have 50 groups that are evenly distributed across the 5 million records of 100,000 records each. Remember that the Sort snap has a “Maximum Memory%” setting that ensures that it will cache data to disk to avoid using too much memory during the sort process. So we are really only concerned with the Group By Fields snap in this pipeline, which uses the sorted input and outputs each group as soon as the key values change, meaning the most memory the Group By snap uses will be 1KB * 100,000 records which is 100MB. That is still well below any concerning memory consumption for a snap. However, SnapLogic buffers 1024 documents between snaps, so in this case it will buffer all 50 groups that are 100MB in size which is still the full 5GB. If the number of input documents is increased to 50 million records, the memory consumed by this pipeline could be 50GB, assuming that the child pipeline called in Pipeline Execute takes longer to process the documents than it does for the Group By to produce them.- winoskyNew Contributor III
This is awesome @koryknick, this sheds some light. I’m going to assume the JSON Formatter snap is also in that list as that was the one that did it for me. Although I do also have a gate snap but the plex didn’t started falling over until I added the JSON Formatter so it sounds like a combination of memory intensive snaps.
Now that I think about it, it might not even be just memory intensive snaps but enough snaps to cause the heap size to go over it’s limit.
Fortunately I found a better alternative but I’m interested if you developed a method into determining ways to throttle read snaps or is throwing ram at the plex your conclusion?
Thanks
- koryknickEmployee
It is possible that JSON Formatter would contribute, especially if following a Gate that consumes the entire input stream into memory. A JSON Splitter may do the same thing since it reads one document at a time to be able to split… if that document is particularly large, it would need enough memory to load it.
The number of snaps doesn’t necessarily contribute to the in-use memory unless you are passing one very large document and the snaps need to allocate and de-allocate the memory.
My recommended remediation would be pipeline redesign to limit the number and type of memory intensive snaps. As with most things, there are a number of ways to accomplish the same goal. The art of this science is to find the best solution for the given environment.
There are some enhancements to many of the snaps I listed that can help limit the amount of memory being used.
- winoskyNew Contributor III
Ok thanks for the insight Kory, appreciate the help.
- tstackFormer Employee
Are you asking about the document in memory or when it’s read-from or written-to disk?
In memory, the document is made up of Java objects and I don’t think we add any additional limits to what is available in native Java and how much memory the node has. One limit that comes to mind is the size of a byte array/string must be less than 2GB.
I don’t think there are any limits when serializing/deserializing.
Why are you asking?
- stephenknilansContributor
One thing that I COULD add is that there IS a limit somewhere on preview mode. It could be on the read, or internal, but it is there, at least in the snaplexes I have used on the cloud. I guess they figure there is no reason to have a test larger than that. So you have to make the test data sets relatively small. If you actually run it as a background task, or scheduled, there is not any small limit. I have run what I would consider large data sets, with some large JSON records. So if you are asking because it isn’t running in preview mode, there is your answer. BTW the problem I described can also give WEIRD errors that may cause you to search for a red herring. In such a case, try running it outside of preview mode. If it runs, it was probably because the dataset was too big for preview mode.
- winoskyNew Contributor III
I have the same question, specific snaps seem to be bottlenecks and causing plex to go down.