Combine CSV document into a new file
I’m reaching out regarding a pipeline that I am building. Here are the details:
Requirement: Combine two csv files (one has a header and the other one does not have)
Steps Taken:
I have tried using Join type merge but the output looks like below which is not what I am trying to achieve.
What I expect the result to look like:
marenas - I would not recommend using Gate to combine the data files - it can cause excessive memory consumption for very large files since the data has to be stored completely in memory. I recommend the attached approach.
The trick here is in the Mapper on the bottom path and the second input view added to the CSV Parser. If you review the documentation, you will see that the second input view allows you to specify a header and also datatypes, if you choose. I simply added the header in the Mapper.
Then in the Union, it combines the data in the way you are looking for.
One thing to note is that Union will take documents as they come from each input view - meaning in this case that if both CSV Parsers are sending a large volume of records, you will see them intermixed - it does not wait for all of the documents on the first path before consuming the documents from the second path. There are easy fixes for this, but thought I would mention it in case it is a requirement that the data ordering be preserved between the input files.
Hope this helps!
marenas - Here is an updated version of the pipeline to preserve the order of the files. Basically, I've simply added a Mapper to each path and placed a file-number and record-number to the documents of both paths before the Union, then sorted the data to ensure proper record ordering, and finally removed the sorting fields from the document.
So a couple new concepts here:
- snap.in.totalCount is a built-in value that gives us the number of records seen by the input view of the snap.
- I enabled Pass-Through in the Mapper settings for the "Add fileN sortKey" snaps to allow the rest of the data to flow through the snap without having to specify it. In the "Remove sortKey" Mapper, I use Pass-Through again to allow all the fields to pass through except the fields that I have in the Expression without specifying the Target-Path - this effectively deletes just those 2 fields.
Hope this helps!