Hi,
I am working in migration project from Datastage to snaplogic.
In DataStage, the Remove Duplicates stage lets me specify one or more key columns to determine uniqueness (with an upstream Sort on the same keys). It then keeps the first/last occurrence per key depending on sort order.
In SnapLogic, I tried using Deduplicate / Unique (and related approaches), but I’m not getting the same record counts or the same “kept” record per key. Notably:
The Unique snap (in my setup) seems to treat the entire document when determining duplicates, so if non-key columns differ, it doesn’t drop them—unlike DataStage, which only looks at the key.
Deduplicate snap doesn’t give me deterministic control over which record is kept when duplicates exist unless I pre-sort and shape the data.
Minimal Examples
What DataStage does (Remove Duplicates on key columns)
Input (CSV/table):
id,name,city,updated_at
1,Alice,NY,2024-01-01
1,Alice,NY,2024-03-15
2,Bob,SF,2024-02-10
2,Bob,SF,2024-02-05
3,Carol,LA,2024-01-20
Goal (keep newest per id):
Sort upstream by id ASC, updated_at DESC
Remove Duplicates on key = id
id,name,city,updated_at
1,Alice,NY,2024-03-15
2,Bob,SF,2024-02-10
3,Carol,LA,2024-01-20
DataStage ignores non-key differences when deciding duplicates and keeps the first row per key after sort (hence deterministic).
What I’m seeing in SnapLogic (Deduplicate/Unique)
Input (as documents):
JSON
[
{"id":1,"name":"Alice","city":"NY","updated_at":"2024-01-01"},
{"id":1,"name":"Alice","city":"NY","updated_at":"2024-03-15"},
{"id":2,"name":"Bob","city":"SF","updated_at":"2024-02-10"},
{"id":2,"name":"Bob","city":"SF","updated_at":"2024-02-05"},
{"id":3,"name":"Carol","city":"LA","updated_at":"2024-01-20"}
]
Observed issues:
If
Unique
considers the entire document, both id=1 rows are treated as distinct because updated_at differs → wrong count vs. DataStage.
Using Deduplicate alone (without shaping) doesn’t guarantee keeping the newest row per id.
Kindly provide me resolution to remove duplicate same as datastage.