r - Joining and grouping two data tables -
suppose i've got following 2 data tables :
dt1 <- data.table(id=1:3,val1=c("a","a","b"),key="id") # id val1 # 1: 1 # 2: 2 # 3: 3 b dt2 <- data.table(id=c(1:3,1:2),val2=10:14,key="id") # id val2 # 1: 1 10 # 2: 1 13 # 3: 2 11 # 4: 2 14 # 5: 3 12
let's dt1
list of people identified id
, , dt2
list of observations on these same people, correspondent id
.
now, i'd compute mean of val2
each group of val1
. i've understood can following way :
dt1[dt2][,mean(val2),by=val1] # val1 v1 # 1: 12 # 2: b 12
but i've read in faq (section 1.14) it's not efficient (at least large data tables).
so, there better, more efficient way ?
edit : related question : saw following 2 lines give same result :
dt1[dt2][,mean(val2),by=val1] dt2[dt1][,mean(val2),by=val1]
are equivalent or there difference between 2 ?
in case, it's okay that. documentation explains, iiuc, example in scenario (where not grouping/aggregating on columns):
dt1 <- data.table(id=1:3,val1=c("a","a","b"),key="id") dt2 <- data.table(id=c(1:3,1:2),val2=10:14,key="id") dt2[, val3 := rep(5:7, c(2,1,2))] # id val2 val3 # 1: 1 10 5 # 2: 1 13 5 # 3: 2 11 6 # 4: 2 14 7 # 5: 3 12 7
now, suppose want mean of val2
alone every val1
, there's no point in joining columns. in case, can do:
dt1[dt2, list(val1, val2)][, mean(val2), by=val1] # val1 v1 # 1: 12 # 2: b 12
instead of doing:
# gives same result performs join on columns dt1[dt2][, mean(val2), by=val1]
for second question, guess it's essential understand difference dt1[dt2]
, dt2[dt1]
. this, data not best example. suppose that,
dt1 <- data.table(id=c(1,4,5), val1=c("a","a","b")) dt2 <- data.table(id=c(1,2,3,6,7,8), val2=c(6,5,3,4,2,1)) setkey(dt1, "id") setkey(dt2, "id")
dt1[dt2]
takes, every id
in dt2
, fetches corresponding value of other columns in dt1
perform join:
dt1[dt2] # id val1 val2 # 1: 1 6 # 2: 2 na 5 # 3: 3 na 3 # 4: 6 na 4 # 5: 7 na 2 # 6: 8 na 1
dt2[dt1]
takes every id
in dt1
corresponding value other columns of dt2
perform join:
dt2[dt1] # id val2 val1 # 1: 1 6 # 2: 4 na # 3: 5 na b
note that, values in dt1[dt2]
contains only id of dt2
. dt2[dt1]
contains in dt1
. in case, since id
s same (ignorning amount of times occur), both join give identical joins (except order of columns), iiuc.
just make part complete, if want "full" join, use merge
all=true
. merge.data.table
method implemented.
merge(dt1, dt2, = true) merge(dt1, dt2, all.x = true) # equivalent dt2[dt1] merge(dt1, dt2, all.y = true) # equivalent dt1[dt2]
Comments
Post a Comment