2 weeks ago
Hello,
Below is the requirement where I have to read and parse a CSV file and load each line based on the first column's name. The file will have thousands and thousands of lines.
I have the sample csv file below and 1st line will be Header and will have 7 columns and the main body will have 13 columns and there will be trailer line at the end and that will have 3 columns.
"HEADER","20250219","053545","108","GRN","AX","4.01"
"SUMMARY","101","","A19","220","USD","19"," 3868","3868"," 102"," 1","0"," P"
"SUBMISSION","101","","A19","220","USD","124","217","217","170126","USD","0500167400","00000010"
"TRANSACTN","102","","A19","220","USD","124","217","217","170126","USD","0500167400","00000010"
"TXNPRICING","103","","A19","220","USD","124","217","217","170126","USD","0500167400","00000010"
"TRAILER","0000000108","0000058371"
I have to parse the above csv file and load each line into separate table based on the name of the first column.
For example if the first column is "HEADER", the line has to be parsed and based on each comma separated value, it has to be loaded into Header table.
If the fist column is "SUMMARY", then the each comma separated value of that line has to be parased and loaded into Summary table.
I am trying the CSV parser but getting the error as below
"Reason: The number of CSV data fields is greater than the number of header columns in the input data: number of CSV data fields = 19, number of header columns = 7, Resolution: Please make sure the number of CSV data fields is not greater than the number of header columns."
How to parse this and what would be the best way to achieve this? Thanks in advance.
Thanks
Aravind N
Solved! Go to Solution.
Monday
@arvindnsn - I've attached 2 sample pipelines that will both solve the request. The first uses the Fixed Width Parser snap to read the file into individual documents. Then the Router snap looks at the first set of characters in the returned record to determine which path to send it for further processing. Now it simply uses the Fixed Width Formatter snap to re-create each set of data as a binary stream. This is followed by the CSV Parser snap to get the individual fields for each type of data. This allows for simple breakdown of the file into the different data types that you want to process.
The second example pipeline (v2) accomplishes the same result with a slightly more advanced methodology. Note that I'm still using the Fixed Width Parser to return a "$record" value for each line of the input file. But then I use a Mapper snap and a few advanced expressions to break down the records, allowing me to bypass the reformatting and parsing of the data streams. Here is the expression in the Mapper:
$record.split('","').toObject((val,idx,arr)=> idx == 0 ? "recType" : "field" + "%03d".sprintf(idx), (val,idx,arr)=> val.replaceAll('"',''))
There are a few things going on here so I'll break it down.
$record.split('","') simply splits the string into an array of strings, using the double-quote, comma, double-quote as the field separator. This leaves the first and last values with an extra double-quote, which we'll deal with at the end with the String.reaplceAll() call. I could have done this with the String.match() method instead, so feel free to play with that if you wish.
Once I have an array, I want to re-create a JSON object that we can use on the output. So the next method is Array.toObject() which uses two callback functions to create the field name and value. For the field name, I want to use "recType" for the first field (HEADER, SUMMARY, etc.). For the value, I can simply return the value of the array element, removing any double-quotes (i.e. the replaceAll() method).
Hope this helps!
a week ago
Check the Header size error policy configuration within the CSV Parser Snap:
With Trim record to fit header you should be able to parse the CSV data.
a week ago
@SpiroTaleski , There are 2 things I wanted to update, the CSV doesnt have fixed length or fixed number of columns and each type (eg: Heasder, Summary, Submission) will have different number of columns and I dont want to trim the number of column, but parse each line even though there will different number of columns in each line.
Monday
@arvindnsn - I've attached 2 sample pipelines that will both solve the request. The first uses the Fixed Width Parser snap to read the file into individual documents. Then the Router snap looks at the first set of characters in the returned record to determine which path to send it for further processing. Now it simply uses the Fixed Width Formatter snap to re-create each set of data as a binary stream. This is followed by the CSV Parser snap to get the individual fields for each type of data. This allows for simple breakdown of the file into the different data types that you want to process.
The second example pipeline (v2) accomplishes the same result with a slightly more advanced methodology. Note that I'm still using the Fixed Width Parser to return a "$record" value for each line of the input file. But then I use a Mapper snap and a few advanced expressions to break down the records, allowing me to bypass the reformatting and parsing of the data streams. Here is the expression in the Mapper:
$record.split('","').toObject((val,idx,arr)=> idx == 0 ? "recType" : "field" + "%03d".sprintf(idx), (val,idx,arr)=> val.replaceAll('"',''))
There are a few things going on here so I'll break it down.
$record.split('","') simply splits the string into an array of strings, using the double-quote, comma, double-quote as the field separator. This leaves the first and last values with an extra double-quote, which we'll deal with at the end with the String.reaplceAll() call. I could have done this with the String.match() method instead, so feel free to play with that if you wish.
Once I have an array, I want to re-create a JSON object that we can use on the output. So the next method is Array.toObject() which uses two callback functions to create the field name and value. For the field name, I want to use "recType" for the first field (HEADER, SUMMARY, etc.). For the value, I can simply return the value of the array element, removing any double-quotes (i.e. the replaceAll() method).
Hope this helps!
Monday
@koryknick , Thank you for multiple solutions. I tried both and the 2nd one seems to be more interesting by eliminating CSV parsers ( Realistically I will be using this methodology where the CSV file will have at least 15 to 20 different types of line items) after using the fixed width parser and use the mapper to derive the solution using expressions.
Thank you for both the solutions!!