Reading in an oddly shaped HTML table to R -


i have offline html file journal supplementary data. of format 1 one row/entry, columns split, example:

organismid  geneid  org1        gene1 ____________ org2        gene1             gene2 ___ org3        gene2             gene3             gene4   

so organismid column has 3 rows, geneid column has 1 row corresponding first row of organismid, 2 rows corresponding second row of organismid , 3 rows corresponding third row of organismid. looks when split cells in table in document. how can r , perhaps better format traditional r data.frame?

edit:

i've included html code first few entries nicely display how columns of table can have different rows. i'm not , current on html seem "make room" multiple rows in 4th 5th , 6th columns defining @ start of each row of column 1 stating rowspan:

<!doctype html public "-//w3c//dtd html 4.01//en" "http://www.w3.org/tr/html4/strict.dtd"> <html> <head> <title>overview per gene</title> </head> <body> <table border="1"> <tr> <th>species</th> <th>gene id</th> <th>length upstream</th> <th>motif id</th> <th>position</th> <th>strand</th> <th>match</th> </tr> <tr> <td rowspan="2">p. infestans</td> <td rowspan="2">pitg_00002</td> <td rowspan="2">1000</td> <td>motif-0</td> <td>-574</td> <td>-</td> <td>tcagtcttacatctac</td> </tr> <tr> <td>motif-1</td> <td>-430</td> <td>-</td> <td>gttacatgaag</td> </tr> <tr> <td rowspan="1">p. infestans</td> <td rowspan="1">pitg_00004</td> <td rowspan="1">454</td> <td>motif-1</td> <td>-264</td> <td>+</td> <td>tacatgtaa</td> </tr> <tr> <td rowspan="2">p. infestans</td> <td rowspan="2">pitg_00006</td> <td rowspan="2">1000</td> <td>motif-0</td> <td>-55</td> <td>+</td> <td>cattcctaatttcgcc</td> </tr> <tr> <td>motif-1</td> <td>-326</td> <td>+</td> <td>catatatgtatgg</td> </tr> <tr> <td rowspan="3">p. infestans</td> <td rowspan="3">pitg_00009</td> <td rowspan="3">1000</td> <td>motif-0</td> <td>-413</td> <td>-</td> <td>tcacttctctactttg</td> </tr> <tr> <td>motif-1</td> <td>-31</td> <td>+</td> <td>tacatgtac</td> </tr> <tr> <td>motif-3</td> <td>-271</td> <td>-</td> <td>tacttggaatttgtat</td> </tr> <tr> 

i made little corrections html code example i've closed <table>, <body> , <html> , have used xml package reading table. noticed in cases columns not in right order, can fix after reading table.

my proposition below.

library(xml)  a<-'<html>     <head>     <title>overview per gene</title>     </head>     <body>     <table border="1">     <tr>     <th>species</th>     <th>gene id</th>     <th>length upstream</th>     <th>motif id</th>     <th>position</th>     <th>strand</th>     <th>match</th>     </tr>     <tr>     <td rowspan="2">p. infestans</td>     <td rowspan="2">pitg_00002</td>     <td rowspan="2">1000</td>     <td>motif-0</td>     <td>-574</td>     <td>-</td>     <td>tcagtcttacatctac</td>     </tr>     <tr>     <td>motif-1</td>     <td>-430</td>     <td>-</td>     <td>gttacatgaag</td>     </tr>     <tr>     <td rowspan="1">p. infestans</td>     <td rowspan="1">pitg_00004</td>     <td rowspan="1">454</td>     <td>motif-1</td>     <td>-264</td>     <td>+</td>     <td>tacatgtaa</td>     </tr>     <tr>     <td rowspan="2">p. infestans</td>     <td rowspan="2">pitg_00006</td>     <td rowspan="2">1000</td>     <td>motif-0</td>     <td>-55</td>     <td>+</td>     <td>cattcctaatttcgcc</td>     </tr>     <tr>     <td>motif-1</td>     <td>-326</td>     <td>+</td>     <td>catatatgtatgg</td>     </tr>     <tr>     <td rowspan="3">p. infestans</td>     <td rowspan="3">pitg_00009</td>     <td rowspan="3">1000</td>     <td>motif-0</td>     <td>-413</td>     <td>-</td>     <td>tcacttctctactttg</td>     </tr>     <tr>     <td>motif-1</td>     <td>-31</td>     <td>+</td>     <td>tacatgtac</td>     </tr>     <tr>     <td>motif-3</td>     <td>-271</td>     <td>-</td>     <td>tacttggaatttgtat</td>     </tr>     </table>     </body>     </html>'  doc<-htmlparse(a) tab<-readhtmltable(doc,which=1) idx<-which(is.na(tab$match)) lapply(tab,class) (i in 1:ncol(tab)){   tab[,i]<-as.character(tab[,i]) } tab[idx,c(4:7)]<-tab[idx,c(1:4)] tab[idx,c(1:3)]<-na 

and result

tab        species    gene id length upstream motif id position strand            match 1 p. infestans pitg_00002            1000  motif-0     -574      - tcagtcttacatctac 2         <na>       <na>            <na>  motif-1     -430      -      gttacatgaag 3 p. infestans pitg_00004             454  motif-1     -264      +        tacatgtaa 4 p. infestans pitg_00006            1000  motif-0      -55      + cattcctaatttcgcc 5         <na>       <na>            <na>  motif-1     -326      +    catatatgtatgg 6 p. infestans pitg_00009            1000  motif-0     -413      - tcacttctctactttg 7         <na>       <na>            <na>  motif-1      -31      +        tacatgtac 8         <na>       <na>            <na>  motif-3     -271      - tacttggaatttgtat 

Comments

Popular posts from this blog

monitor web browser programmatically in Android? -

Shrink a YouTube video to responsive width -

wpf - PdfWriter.GetInstance throws System.NullReferenceException -