Determine changes between two data files

Created by @pkona

This pipeline pattern determines changes between two data files (inserts, updates, deletes, no-changes, older-records-for-updates).

This is a standard mode pipeline pattern that illustrates how to find and separate changes like inserts, updates, deletes, no-changes, older-records-for-updates into separate files.

This pattern is best suited to handle small to medium files (< 50 million records). For larger files > 5GB or 100 Million records, consider using Spark mode pipelines for optimal performance.


Velocity templates help generate code, schemas, and text.

Sources: Files
Targets: Files
Snaps used: CSV Generator, JSON Generator, Head, CSV Formatter, CSV Parser, Mapper, Copy, Script, JSON Formatter, File Writer


Generate_csvSchema4Spark.slp (26.7 KB)