parsing - Handing strings with binary data in it using java.nio -
i having issues parsing text files have illegal characters(binary markers) in them. answer follows:
test.csv
^000000^id1,text1,text2,text3
here ^000000^
textual representation of illegal characters in source file.
i thinking using java.nio validate line before process it. so, thinking of introducing validator trait follows:
import java.nio.charset._ trait validator{ private def encoder = charset.forname("utf-8").newencoder def isvalidencoding(line:string):boolean = { encoder.canencode(line) } }
do guys think correct approach handle situation?
thanks
it late when have string
, utf-8 can encode string*. need go point decoding file initially.
iso-8859-1 encoding interesting properties:
- literally byte sequence valid iso-8859-1
- the code point of each decoded character same value of byte decoded from
so decode file iso-8859-1 , strip non-english characters:
//pseudo code str = file.decode("iso-8859-1"); str = str.replace( "[\u0000-\u0019\u007f-\u00ff]", "");
you can iterate line-by-line, , ignore each line contains character in [\u0000-\u0019\u007f-\u00ff]
, if that's mean validating line before processing it.
occurred me binary marker bom. can use hex editor view values.
*except illegal surrogates not case here.
Comments
Post a Comment