Ingest large number of small files as a sequence file and store file metadata relating to each file

Created by @pkona


This pipeline pattern ingests a large number of small files as a sequence file and stores file metadata relating to each file.

If you have a number of small files on SPTP/S3/HDFS/Kafka, you may want to store these files and their metadata into a larger file for efficient storage.

You can use this pattern if you have multiple small files each relating to a certain event, such as a few hundred of small files generated in an hour. This pattern retrieves the files, extracts the metadata and ingests as a single sequence file that contains zipped source files and file metadata. The files can then be written to any target location using the File Writer Snap including S3, WASB, ADL, HDFS, SFTP, and other protocols.

Configuration

Sources: file
Targets: sequence file
Snaps used: Directory Browser, File Reader, Compress, Binary to Document, Mapper, Sequence Formatter, HDFS Writer

Downloads

IngestPattern - Ingest multiple small files as SequenceFile.slp (11.4 KB)