Maximum size of a JSON document

Employee

5 years ago

@winosky - I did the following write-up a while back on memory intensive snaps:

Memory Intensive Snaps

Though not exhaustive, the following list of snaps should be used with some consideration and understanding of impact on memory. Any snap that performs aggregation of the streaming documents could severely impact your snaplex node memory and may cause node failure if the JVM Heap Size is exceeded.

Aggregate (if not using sorted input)
Sort (if large “Maximum Memory %” is used)
In-Memory Lookup
Join (if not using sorted input)
Gate
Group By Fields
Group By N
Parquet Formatter

Snaps such as Gate and the Group By snaps, which move individual documents into arrays of documents, can easily consume Gigabytes of memory very quickly. Consider an average document size of 1KB processing 5 million records. If a Gate snap is used, it will place all 5 million records into a single array, consuming 1KB * 5 million = 5GB.

The Group By snaps, while they do not consume any more memory than a single group of documents, can still use large amounts of memory. Consider the following common pipeline pattern, typically used to split processing to handle different groups of data, or as an attempt to “batch” processing into smaller datasets:

Let’s take the same example dataset we had before: 5 million records of 1KB documents, and we’ll assume we have 50 groups that are evenly distributed across the 5 million records of 100,000 records each. Remember that the Sort snap has a “Maximum Memory%” setting that ensures that it will cache data to disk to avoid using too much memory during the sort process. So we are really only concerned with the Group By Fields snap in this pipeline, which uses the sorted input and outputs each group as soon as the key values change, meaning the most memory the Group By snap uses will be 1KB * 100,000 records which is 100MB. That is still well below any concerning memory consumption for a snap. However, SnapLogic buffers 1024 documents between snaps, so in this case it will buffer all 50 groups that are 100MB in size which is still the full 5GB. If the number of input documents is increased to 50 million records, the memory consumed by this pipeline could be 50GB, assuming that the child pipeline called in Pipeline Execute takes longer to process the documents than it does for the Group By to produce them.

Forum Discussion

Related Content

Recent Discussions

Javascript to promote top level lists

Google Sheets Subscribe questions

Basic string transformations not working

Can we generate XML file in pretty print format using native snapLogic snaps?

Multipart Reader failure - 'content-type' was not found