Multicriteria increment with reset in R -
i have large (nearly 6m rows) dataframe called df following structure:
codecontract relmonth amtpmt a0001 10 0.00 a0001 11 15.00 a0002 12 4.55 a0003 4 0.00 ... ... ... relmonth defined number of months since particular static event associated codecontract.
this data sorted codecontract , relmonth. data frame holds continuous relmonth; i.e. given codecontract interim relmonthare populated such given codecontract if have min relmonth=5 , max relmonth=12, data frame include relmonths 5:12.
i calculate column called mths_since_last_pmt count given codecontracthow many relmonths has been since given codecontract had amtpmt > amt_threshold.
it work (assuming amt_threshold=5)
codecontract relmonth amtpmt mths_since_last_pmt a0001 1 0.00 1 a0001 2 3.00 2 a0001 3 0.00 3 a0001 4 10.00 0 a0001 5 0.00 1 a0002 1 10.00 0 a0002 2 12.00 0 a0002 3 0.00 1 a0002 4 0.00 2 i have working solution uses for loop can process around 5,000 rows/sec.
i looking way vectorize calculation , maybe without sorting data first, or having uninterrupted relmonths.
all vectorized solutions have tried develop, typically ddply call seq_along, end maxing out ram (24gb). looking solution run in under 2gb of ram usage. maybe solution in form of custom function?
any idea how make work?
update @roland
@roland
i have found different dataset result in erroneous output code below. tweaked input is:
df <- read.table(text="codecontract relmonth amtpmt mths_since_last_pmt a0001 1 0.00 1 a0001 2 3.00 2 a0001 3 0.00 3 a0001 4 10.00 0 a0001 5 0.00 1 a0002 1 1.00 0 a0002 2 14.00 0 a0002 3 14.00 1 a0002 4 14.00 2",header=true) the corresponding output is:
codecontract relmonth amtpmt mths_since_last_pmt mths_since_last_pmt2 1: a0001 1 0 1 1 2: a0001 2 3 2 2 3: a0001 3 0 3 3 4: a0001 4 10 0 0 5: a0001 5 0 1 1 6: a0002 1 1 0 1 7: a0002 2 14 0 0 8: a0002 3 14 1 -1 9: a0002 4 14 2 -2 the negative numbers -1 , -2 in mths_since_last_pmt2 in last rows incorrect; should both 0 threshold has been exceed. seems algorithm fails when first item sub-group (here codecontract change) below threshold enough throw off.
is there tweak can apply make work?
try this:
df <- read.table(text="codecontract relmonth amtpmt mths_since_last_pmt a0001 1 0.00 1 a0001 2 3.00 2 a0001 3 0.00 3 a0001 4 10.00 0 a0001 5 0.00 1 a0002 1 10.00 0 a0002 2 12.00 0 a0002 3 0.00 1 a0002 4 0.00 2",header=true) library(data.table) dt <- data.table(df,key=c("codecontract","relmonth")) trsh <- 5 dt[,mths_since_last_pmt2 := cumsum(amtpmt<=trsh)-cumsum(cumsum(amtpmt<=trsh)*(amtpmt>trsh)), by=codecontract] # codecontract relmonth amtpmt mths_since_last_pmt mths_since_last_pmt2 # 1: a0001 1 0 1 1 # 2: a0001 2 3 2 2 # 3: a0001 3 0 3 3 # 4: a0001 4 10 0 0 # 5: a0001 5 0 1 1 # 6: a0002 1 10 0 0 # 7: a0002 2 12 0 0 # 8: a0002 3 0 1 1 # 9: a0002 4 0 2 2 hopefully data.table's assignment reference keep under ram limit.
Comments
Post a Comment