r - Joining and grouping two data tables -


suppose i've got following 2 data tables :

dt1 <- data.table(id=1:3,val1=c("a","a","b"),key="id") #    id val1 # 1:  1    # 2:  2    # 3:  3    b   dt2 <- data.table(id=c(1:3,1:2),val2=10:14,key="id") #    id val2 # 1:  1   10 # 2:  1   13 # 3:  2   11 # 4:  2   14 # 5:  3   12 

let's dt1 list of people identified id, , dt2 list of observations on these same people, correspondent id.

now, i'd compute mean of val2 each group of val1. i've understood can following way :

dt1[dt2][,mean(val2),by=val1] #    val1 v1 # 1:    12 # 2:    b 12 

but i've read in faq (section 1.14) it's not efficient (at least large data tables).

so, there better, more efficient way ?

edit : related question : saw following 2 lines give same result :

dt1[dt2][,mean(val2),by=val1] dt2[dt1][,mean(val2),by=val1] 

are equivalent or there difference between 2 ?

in case, it's okay that. documentation explains, iiuc, example in scenario (where not grouping/aggregating on columns):

dt1 <- data.table(id=1:3,val1=c("a","a","b"),key="id") dt2 <- data.table(id=c(1:3,1:2),val2=10:14,key="id")  dt2[, val3 := rep(5:7, c(2,1,2))] #    id val2 val3 # 1:  1   10    5 # 2:  1   13    5 # 3:  2   11    6 # 4:  2   14    7 # 5:  3   12    7 

now, suppose want mean of val2 alone every val1, there's no point in joining columns. in case, can do:

dt1[dt2, list(val1, val2)][, mean(val2), by=val1] #    val1 v1 # 1:    12 # 2:    b 12 

instead of doing:

# gives same result performs join on columns dt1[dt2][, mean(val2), by=val1] 

for second question, guess it's essential understand difference dt1[dt2] , dt2[dt1]. this, data not best example. suppose that,

dt1 <- data.table(id=c(1,4,5), val1=c("a","a","b")) dt2 <- data.table(id=c(1,2,3,6,7,8), val2=c(6,5,3,4,2,1))  setkey(dt1, "id") setkey(dt2, "id") 

dt1[dt2] takes, every id in dt2 , fetches corresponding value of other columns in dt1 perform join:

dt1[dt2] #    id val1 val2 # 1:  1       6 # 2:  2   na    5 # 3:  3   na    3 # 4:  6   na    4 # 5:  7   na    2 # 6:  8   na    1 

dt2[dt1] takes every id in dt1 corresponding value other columns of dt2 perform join:

dt2[dt1] #    id val2 val1 # 1:  1    6    # 2:  4   na    # 3:  5   na    b 

note that, values in dt1[dt2] contains only id of dt2. dt2[dt1] contains in dt1. in case, since ids same (ignorning amount of times occur), both join give identical joins (except order of columns), iiuc.


just make part complete, if want "full" join, use merge all=true. merge.data.table method implemented.

merge(dt1, dt2, = true)  merge(dt1, dt2, all.x = true)  # equivalent dt2[dt1]  merge(dt1, dt2, all.y = true) # equivalent dt1[dt2] 

Comments

Popular posts from this blog

monitor web browser programmatically in Android? -

Shrink a YouTube video to responsive width -

wpf - PdfWriter.GetInstance throws System.NullReferenceException -