I have a requirement where I need to capture the first record with a given field needs to pass through and all subsequent records with that field value are stripped and written to a different file.
In the example below, only the first row for each ID needs to pass through.
ID Type A Alpha B Alpha B Beta C Gamma C Delta
So the Pipeline needs to output this:
ID Type A Alpha B Alpha C Gamma
Then output this to a file to let the admins know that a duplicate row was found for these records:
ID Type B Beta C Delta
I have a Pipeline that inserts row numbers with a Sequence, copies the data stream, does an Aggregate with a Group By, then finishes with a Join to bring the other fields back in.
Here is the Aggregate Snap (I changed the Group By field name to Type in order to match the example above, so please ignore the warning)
This works to retrieve the first row, but I have two questions:
Is this really the best way to do this?
How do I capture the duplicate rows that were rejected by the Join?