cancel
Showing results for 
Search instead for 
Did you mean: 

How to convert utf-8 file to DOS(Disk Operating System format-ANSI)

rnbabar
New Contributor II

Hi All

      Need to convert following Input File data which is in UTF-8 to DOS format, I tried with Tanscoder snap but it  throw an Error 

rnbabar_0-1718647586707.png

The Error Description is 

rnbabar_1-1718647647547.png

 

Any suggestion please?

 

 

3 REPLIES 3

endor_force
New Contributor III

DOS format is a bit vague of the limitations of the target system.

I get the same error when sending in a file with chinese characters in it and setting the output to basically any format other than UTF*. The error message states that there is something wrong with the input file and it may not be UTF-8, but verifying that the input file is UTF-8 using some common tests:

 

file -i sample.csv
sample.csv: text/plain; charset=utf-8

 

 

Using "chardet" in python shows a 99% confidence of utf-8 for the input file

 

from chardet.universaldetector import UniversalDetector

files = ['sample.csv']

detector = UniversalDetector()
for filename in files:
    print(filename.ljust(20), end='')
    detector.reset()
    for line in open(filename, 'rb'):
        detector.feed(line)
        if detector.done: break
    detector.close()
    print(detector.result)

 

 

Outputs:

 

sample.csv {'encoding': 'utf-8', 'confidence': 0.99, 'language': ''}

 

 


However when setting the output format to UTF-16 or UTF-32 it will work and the transcode will successfully create a file with a different UTF format, confirmed by chardet:

 

 

test_output.csv          {'encoding': 'utf-32be', 'confidence': 0.85, 'language': ''}

 

 


I have tried some different non-UTF prefixed output formats but any that i tried has failed so far.
It seems to be related to the chinese characters, when removing the lines with these characters the transcoding works, but as soon as the chinese characters are introduced in the file then the transcoding fails.
It could be that some of the other character sets does not have full support for all characters which are available in UTF formats.

I would recommend you to get in contact with the snaplogic support team for further analysis or clarification.

Full error, UTF-8 to US-ASCII as a sample: 

 

 

Transcoder[5c46bfd317f60c09d026aecf_6a02bd2b-c54d-4d54-bd03-66d0bd7a5e20 -- c927bdd5-dafb-4217-88d4-aba17d2291de]
com.snaplogic.snap.api.SnapDataException: Failed to transcode from UTF-8 to US-ASCII
	at com.snaplogic.snaps.transform.Transcoder.process(Transcoder.java:125)
	at com.snaplogic.snap.api.write.SimpleBinaryWriteSnap.doWork(SimpleBinaryWriteSnap.java:62)
	at com.snaplogic.snap.api.SimpleBinarySnap.execute(SimpleBinarySnap.java:57)
	at com.snaplogic.cc.snap.common.SnapRunnableImpl.executeSnap(SnapRunnableImpl.java:804)
	at com.snaplogic.cc.snap.common.SnapRunnableImpl.execute(SnapRunnableImpl.java:577)
	at com.snaplogic.cc.snap.common.SnapRunnableImpl.doRun(SnapRunnableImpl.java:869)
	at com.snaplogic.cc.snap.common.SnapRunnableImpl.call(SnapRunnableImpl.java:427)
	at com.snaplogic.cc.snap.common.SnapRunnableImpl.call(SnapRunnableImpl.java:116)
	at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
	at java.base/java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
	at java.base/java.util.concurrent.FutureTask.run(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
	at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
	at java.base/java.lang.Thread.run(Unknown Source)
Caused by: com.snaplogic.cc.snap.common.SnapStreamException: Exception while reading binary data from the stream
	at com.snaplogic.cc.snap.view.binary.BinaryOutputViewImpl.write(BinaryOutputViewImpl.java:279)
	at com.snaplogic.snap.api.OutBoundViewsImpl.write(OutBoundViewsImpl.java:287)
	at com.snaplogic.snaps.transform.Transcoder.process(Transcoder.java:102)
	... 13 more
Caused by: java.nio.charset.UnmappableCharacterException: Input length = 1
	at java.base/java.nio.charset.CoderResult.throwException(Unknown Source)
	at java.base/sun.nio.cs.StreamEncoder.implWrite(Unknown Source)
	at java.base/sun.nio.cs.StreamEncoder.implWrite(Unknown Source)
	at java.base/sun.nio.cs.StreamEncoder.write(Unknown Source)
	at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1613)
	at org.apache.commons.io.IOUtils.copyLarge(IOUtils.java:1591)
	at com.snaplogic.snaps.transform.Transcoder$1.write(Transcoder.java:115)
	at com.snaplogic.cc.snap.view.binary.BinaryOutputViewImpl.write(BinaryOutputViewImpl.java:217)
	... 15 more
Reason: The character set in the input data may not be UTF-8
Resolution: Please select the correct input character set.

 

 

Input file which i tried with is attached.

 

rnbabar
New Contributor II

Thanks @endor_force for the detail analysis , I have tried with other sample file which do not have Chinese characters, It does not accept throw the same error, Instead I used ISO-8859-1 which accept the input, Now the problem is what is would be output character set. I can not see any ANSI or MS DOS character set there instead it shows Windows character sets

rnbabar_0-1718734974830.png

I am trying to get the output in MS DOS ANSI character set format , any thoughts ?

 

endor_force
New Contributor III

IBM850 would be the Latin multilingual MS-DOS charset i assume.

I tested on my pc and when typing out a file transcoded from UTF-8 to IBM850 in a windows dos prompt it looks ok, i have not verified on older dos versions or dosbox.

It will fail with error in transcoding if you have any unsupported character of the target charset, even the euro-sign (€) will cause failure. It seems like the transcoding is relying on old non-euro version of IBM850?
Other Multilingual Latin charsets with euro support such as 858 or 912 is not existing to select from.

For verification with the tools used previously, chardet would identify an IBM850 transcoded file as windows1252 with 73% confidence (which is not correct) and "file -i" says it is unknown 8-bit