How to keep first row in a set?

I have a requirement where I need to capture the first record with a given field needs to pass through and all subsequent records with that field value are stripped and written to a different file.

In the example below, only the first row for each ID needs to pass through.

ID   Type
A    Alpha
B    Alpha
B    Beta
C    Gamma
C    Delta

So the Pipeline needs to output this:

ID   Type
A    Alpha
B    Alpha
C    Gamma

Then output this to a file to let the admins know that a duplicate row was found for these records:

ID   Type
B    Beta
C    Delta

I have a Pipeline that inserts row numbers with a Sequence, copies the data stream, does an Aggregate with a Group By, then finishes with a Join to bring the other fields back in.

Here is the Aggregate Snap (I changed the Group By field name to Type in order to match the example above, so please ignore the warning)

image

This works to retrieve the first row, but I have two questions:

  1. Is this really the best way to do this?

  2. How do I capture the duplicate rows that were rejected by the Join?

Hi, Please use the attached pipeline, if this resolves your problem.
The output of the mapper will show duplicates values, if more than 1 duplicate is there, it will shown separated by ‘|’
AN new pipeline 0_2019_12_20.slp (5.7 KB)

input

output

Regards
Anubhav