parsing - reading badly formed csv in R - mismatched quotes -
i have hundreds of large csv files (sizes vary 10k lines 100k lines in each) , of them have badly formed descriptions quotes within quotes might like
id,description,x 3434,"abc"def",988 2344,"fred",3484 2345,"fr""ed",3485 2346,"joe,fred",3486
i need able cleanly parse of these lines in r csv. dput()'ing , reading ...
txt <- c("id,description,x", "3434,\"abc\"def\",988", "2344,\"fred\",3484", "2345,\"fr\"\"ed\",3485", "2346,\"joe,fred\",3486") read.csv(text=txt[1:4], colclasses='character') error in read.table(file = file, header = header, sep = sep, quote = quote, : incomplete final line found readtableheader on 'text'
if change quoting , not include last line embedded comma - works well
read.csv(text=txt[1:4], colclasses='character', quote='')
however, if change quoting , include last line embedded comma...
read.csv(text=txt[1:5], colclasses='character', quote='') error in scan(file, what, nmax, sep, dec, quote, skip, nlines, na.strings, : line 1 did not have 4 elements
edit x2: should have said unfortunately of descriptions have commas in them - code edited above.
change quote
setting:
read.csv(text=txt, colclasses='character',quote = "") id description x 1 3434 "abc"def" 988 2 2344 "fred" 3484 3 2345 "fr""ed" 3485 4 2346 "joe" 3486
edit deal errant commas:
txt <- c("id,description,x", "3434,\"abc\"def\",988", "2344,\"fred\",3484", "2345,\"fr\"\"ed\",3485", "2346,\"joe,fred\",3486") txt2 <- readlines(textconnection(txt)) txt2 <- strsplit(txt2,",") txt2 <- lapply(txt2,function(x) c(x[1],paste(x[2:(length(x)-1)],collapse=","),x[length(x)]) ) m <- do.call("rbind",txt2) df <- as.data.frame(m,stringsasfactors = false) names(df) <- df[1,] df <- df[-1,] # id description x # 2 3434 "abc"def" 988 # 3 2344 "fred" 3484 # 4 2345 "fr""ed" 3485 # 5 2346 "joe,fred" 3486
no idea, if sufficiently efficient use case.
Comments
Post a Comment