Reading in an oddly shaped HTML table to R -
i have offline html file journal supplementary data. of format 1 one row/entry, columns split, example:
organismid geneid org1 gene1 ____________ org2 gene1 gene2 ___ org3 gene2 gene3 gene4
so organismid column has 3 rows, geneid column has 1 row corresponding first row of organismid, 2 rows corresponding second row of organismid , 3 rows corresponding third row of organismid. looks when split cells in table in document. how can r , perhaps better format traditional r data.frame?
edit:
i've included html code first few entries nicely display how columns of table can have different rows. i'm not , current on html seem "make room" multiple rows in 4th 5th , 6th columns defining @ start of each row of column 1 stating rowspan
:
<!doctype html public "-//w3c//dtd html 4.01//en" "http://www.w3.org/tr/html4/strict.dtd"> <html> <head> <title>overview per gene</title> </head> <body> <table border="1"> <tr> <th>species</th> <th>gene id</th> <th>length upstream</th> <th>motif id</th> <th>position</th> <th>strand</th> <th>match</th> </tr> <tr> <td rowspan="2">p. infestans</td> <td rowspan="2">pitg_00002</td> <td rowspan="2">1000</td> <td>motif-0</td> <td>-574</td> <td>-</td> <td>tcagtcttacatctac</td> </tr> <tr> <td>motif-1</td> <td>-430</td> <td>-</td> <td>gttacatgaag</td> </tr> <tr> <td rowspan="1">p. infestans</td> <td rowspan="1">pitg_00004</td> <td rowspan="1">454</td> <td>motif-1</td> <td>-264</td> <td>+</td> <td>tacatgtaa</td> </tr> <tr> <td rowspan="2">p. infestans</td> <td rowspan="2">pitg_00006</td> <td rowspan="2">1000</td> <td>motif-0</td> <td>-55</td> <td>+</td> <td>cattcctaatttcgcc</td> </tr> <tr> <td>motif-1</td> <td>-326</td> <td>+</td> <td>catatatgtatgg</td> </tr> <tr> <td rowspan="3">p. infestans</td> <td rowspan="3">pitg_00009</td> <td rowspan="3">1000</td> <td>motif-0</td> <td>-413</td> <td>-</td> <td>tcacttctctactttg</td> </tr> <tr> <td>motif-1</td> <td>-31</td> <td>+</td> <td>tacatgtac</td> </tr> <tr> <td>motif-3</td> <td>-271</td> <td>-</td> <td>tacttggaatttgtat</td> </tr> <tr>
i made little corrections html code example i've closed <table>
, <body>
, <html>
, have used xml
package reading table. noticed in cases columns not in right order, can fix after reading table.
my proposition below.
library(xml) a<-'<html> <head> <title>overview per gene</title> </head> <body> <table border="1"> <tr> <th>species</th> <th>gene id</th> <th>length upstream</th> <th>motif id</th> <th>position</th> <th>strand</th> <th>match</th> </tr> <tr> <td rowspan="2">p. infestans</td> <td rowspan="2">pitg_00002</td> <td rowspan="2">1000</td> <td>motif-0</td> <td>-574</td> <td>-</td> <td>tcagtcttacatctac</td> </tr> <tr> <td>motif-1</td> <td>-430</td> <td>-</td> <td>gttacatgaag</td> </tr> <tr> <td rowspan="1">p. infestans</td> <td rowspan="1">pitg_00004</td> <td rowspan="1">454</td> <td>motif-1</td> <td>-264</td> <td>+</td> <td>tacatgtaa</td> </tr> <tr> <td rowspan="2">p. infestans</td> <td rowspan="2">pitg_00006</td> <td rowspan="2">1000</td> <td>motif-0</td> <td>-55</td> <td>+</td> <td>cattcctaatttcgcc</td> </tr> <tr> <td>motif-1</td> <td>-326</td> <td>+</td> <td>catatatgtatgg</td> </tr> <tr> <td rowspan="3">p. infestans</td> <td rowspan="3">pitg_00009</td> <td rowspan="3">1000</td> <td>motif-0</td> <td>-413</td> <td>-</td> <td>tcacttctctactttg</td> </tr> <tr> <td>motif-1</td> <td>-31</td> <td>+</td> <td>tacatgtac</td> </tr> <tr> <td>motif-3</td> <td>-271</td> <td>-</td> <td>tacttggaatttgtat</td> </tr> </table> </body> </html>' doc<-htmlparse(a) tab<-readhtmltable(doc,which=1) idx<-which(is.na(tab$match)) lapply(tab,class) (i in 1:ncol(tab)){ tab[,i]<-as.character(tab[,i]) } tab[idx,c(4:7)]<-tab[idx,c(1:4)] tab[idx,c(1:3)]<-na
and result
tab species gene id length upstream motif id position strand match 1 p. infestans pitg_00002 1000 motif-0 -574 - tcagtcttacatctac 2 <na> <na> <na> motif-1 -430 - gttacatgaag 3 p. infestans pitg_00004 454 motif-1 -264 + tacatgtaa 4 p. infestans pitg_00006 1000 motif-0 -55 + cattcctaatttcgcc 5 <na> <na> <na> motif-1 -326 + catatatgtatgg 6 p. infestans pitg_00009 1000 motif-0 -413 - tcacttctctactttg 7 <na> <na> <na> motif-1 -31 + tacatgtac 8 <na> <na> <na> motif-3 -271 - tacttggaatttgtat
Comments
Post a Comment