08-22-2018 01:25 PM
Created by @pkona
This pipeline pattern determines changes between two data files (inserts, updates, deletes, no-changes, older-records-for-updates).
This is a standard mode pipeline pattern that illustrates how to find and separate changes like inserts, updates, deletes, no-changes, older-records-for-updates into separate files.
This pattern is best suited to handle small to medium files (< 50 million records). For larger files > 5GB or 100 Million records, consider using Spark mode pipelines for optimal performance.
Velocity templates help generate code, schemas, and text.
Sources: Files
Targets: Files
Snaps used: CSV Generator, JSON Generator, Head, CSV Formatter, CSV Parser, Mapper, Copy, Script, JSON Formatter, File Writer
Generate_csvSchema4Spark.slp (26.7 KB)