Multicriteria increment with reset in R -
i have large (nearly 6m rows) dataframe called df following structure:
codecontract relmonth amtpmt a0001 10 0.00 a0001 11 15.00 a0002 12 4.55 a0003 4 0.00 ... ... ...
relmonth
defined number of months since particular static event associated codecontract
.
this data sorted codecontract
, relmonth
. data frame holds continuous relmonth
; i.e. given codecontract
interim relmonth
are populated such given codecontract
if have min relmonth=5
, max relmonth=12
, data frame include relmonths 5:12
.
i calculate column called mths_since_last_pmt
count given codecontract
how many relmonths
has been since given codecontract
had amtpmt > amt_threshold
.
it work (assuming amt_threshold=5
)
codecontract relmonth amtpmt mths_since_last_pmt a0001 1 0.00 1 a0001 2 3.00 2 a0001 3 0.00 3 a0001 4 10.00 0 a0001 5 0.00 1 a0002 1 10.00 0 a0002 2 12.00 0 a0002 3 0.00 1 a0002 4 0.00 2
i have working solution uses for
loop can process around 5,000 rows/sec.
i looking way vectorize calculation , maybe without sorting data first, or having uninterrupted relmonths
.
all vectorized solutions have tried develop, typically ddply
call seq_along
, end maxing out ram (24gb). looking solution run in under 2gb of ram usage. maybe solution in form of custom function?
any idea how make work?
update @roland
@roland
i have found different dataset result in erroneous output code below. tweaked input is:
df <- read.table(text="codecontract relmonth amtpmt mths_since_last_pmt a0001 1 0.00 1 a0001 2 3.00 2 a0001 3 0.00 3 a0001 4 10.00 0 a0001 5 0.00 1 a0002 1 1.00 0 a0002 2 14.00 0 a0002 3 14.00 1 a0002 4 14.00 2",header=true)
the corresponding output is:
codecontract relmonth amtpmt mths_since_last_pmt mths_since_last_pmt2 1: a0001 1 0 1 1 2: a0001 2 3 2 2 3: a0001 3 0 3 3 4: a0001 4 10 0 0 5: a0001 5 0 1 1 6: a0002 1 1 0 1 7: a0002 2 14 0 0 8: a0002 3 14 1 -1 9: a0002 4 14 2 -2
the negative numbers -1
, -2
in mths_since_last_pmt2
in last rows incorrect; should both 0
threshold has been exceed. seems algorithm fails when first item sub-group (here codecontract
change) below threshold enough throw off.
is there tweak can apply make work?
try this:
df <- read.table(text="codecontract relmonth amtpmt mths_since_last_pmt a0001 1 0.00 1 a0001 2 3.00 2 a0001 3 0.00 3 a0001 4 10.00 0 a0001 5 0.00 1 a0002 1 10.00 0 a0002 2 12.00 0 a0002 3 0.00 1 a0002 4 0.00 2",header=true) library(data.table) dt <- data.table(df,key=c("codecontract","relmonth")) trsh <- 5 dt[,mths_since_last_pmt2 := cumsum(amtpmt<=trsh)-cumsum(cumsum(amtpmt<=trsh)*(amtpmt>trsh)), by=codecontract] # codecontract relmonth amtpmt mths_since_last_pmt mths_since_last_pmt2 # 1: a0001 1 0 1 1 # 2: a0001 2 3 2 2 # 3: a0001 3 0 3 3 # 4: a0001 4 10 0 0 # 5: a0001 5 0 1 1 # 6: a0002 1 10 0 0 # 7: a0002 2 12 0 0 # 8: a0002 3 0 1 1 # 9: a0002 4 0 2 2
hopefully data.table's assignment reference keep under ram limit.
Comments
Post a Comment