CSV parser not parsing when CR is found in column content

Hi
we have a csv file which has around 40 columns and one of the column is notes column which has Carriage return (CR) in the content, due to which the CSV parser is not able to break the line properly and its making it fail in loading actual content of the record instead the record gets split into 2 and fails

We have tried in using binary document and tried replacing the value and then document to binary and then CSV parser, No luck

Any other alternatives needs to be done to handle these records

Thanks
Regards
smitha

Hi Smitha,

This should be possible. Try and look at this example and see if you can adapt it to for your needs.

The mapper can be used to conveniently alter binary data before parsing. See the “views” section in the Mapper documentation.

In the attached pipeline, the mapper is using binary views, converting the input data to a string, then replacing the \r\n to a space character .

cr-example.zip (3.1 KB)


Hi tlikarish,

I have tried the approach you have suggested, since as you see in the attached snapshot the record spans over multiple line and there is only CR in one of the columns in between, I have changed the expression to replace -$content.toString().replace(’\r’, ’ '), by this the records did not get loaded properly, the records was still not considered as single record.
Can you let us know if we need to do something else

Thanks

Regards
smitha csvparser.zip (25.5 KB)

Are you seeing an error message? Since you’re using quotes and the column is quoted, the CSV Parser should treat the carriage return as part of the column’s value and not as a row delimiter. Is it possible the quoting is off?

Doh – also messed up with the expression I gave you.

$content.toString().replace('\r', ' ')

Only replaces the first match. You should probably use replaceAll or change the regular expression to /\r/g. This will replace all carriage returns, so if the lines are delimited with \r\n, then you’d have to use something like /\r(?!\n)/g, which would remove all carriage returns not followed by a line feed.

What if we aren’t using quotes?

Although I haven’t tried it, I think this expression should work even if your data isn’t quoted.