parsing - Handing strings with binary data in it using java.nio -


i having issues parsing text files have illegal characters(binary markers) in them. answer follows:

test.csv

^000000^id1,text1,text2,text3 

here ^000000^ textual representation of illegal characters in source file.

i thinking using java.nio validate line before process it. so, thinking of introducing validator trait follows:

import java.nio.charset._ trait validator{    private def encoder = charset.forname("utf-8").newencoder    def isvalidencoding(line:string):boolean = {      encoder.canencode(line)    } } 

do guys think correct approach handle situation?

thanks

it late when have string, utf-8 can encode string*. need go point decoding file initially.


iso-8859-1 encoding interesting properties:

  • literally byte sequence valid iso-8859-1
  • the code point of each decoded character same value of byte decoded from

so decode file iso-8859-1 , strip non-english characters:

//pseudo code str = file.decode("iso-8859-1"); str = str.replace( "[\u0000-\u0019\u007f-\u00ff]", ""); 

you can iterate line-by-line, , ignore each line contains character in [\u0000-\u0019\u007f-\u00ff], if that's mean validating line before processing it.


occurred me binary marker bom. can use hex editor view values.

*except illegal surrogates not case here.


Comments

Popular posts from this blog

monitor web browser programmatically in Android? -

Shrink a YouTube video to responsive width -

wpf - PdfWriter.GetInstance throws System.NullReferenceException -