cancel
Showing results for 
Search instead for 
Did you mean: 

How to keep first row in a set?

graham-onpoint
New Contributor II

I have a requirement where I need to capture the first record with a given field needs to pass through and all subsequent records with that field value are stripped and written to a different file.

In the example below, only the first row for each ID needs to pass through.

ID   Type
A    Alpha
B    Alpha
B    Beta
C    Gamma
C    Delta

So the Pipeline needs to output this:

ID   Type
A    Alpha
B    Alpha
C    Gamma

Then output this to a file to let the admins know that a duplicate row was found for these records:

ID   Type
B    Beta
C    Delta

I have a Pipeline that inserts row numbers with a Sequence, copies the data stream, does an Aggregate with a Group By, then finishes with a Join to bring the other fields back in.

image

Here is the Aggregate Snap (I changed the Group By field name to Type in order to match the example above, so please ignore the warning)

image

This works to retrieve the first row, but I have two questions:

  1. Is this really the best way to do this?

  2. How do I capture the duplicate rows that were rejected by the Join?

1 REPLY 1

anubhav_nautiya
Contributor

Hi, Please use the attached pipeline, if this resolves your problem.
The output of the mapper will show duplicates values, if more than 1 duplicate is there, it will shown separated by ‘|’
AN new pipeline 0_2019_12_20.slp (5.7 KB)

image input

image output

Regards
Anubhav