Multicriteria increment with reset in R -

i have large (nearly 6m rows) dataframe called df following structure:

codecontract    relmonth    amtpmt a0001           10          0.00 a0001           11          15.00 a0002           12          4.55 a0003           4           0.00 ...             ...         ...

relmonth defined number of months since particular static event associated codecontract.

this data sorted codecontract , relmonth. data frame holds continuous relmonth; i.e. given codecontract interim relmonthare populated such given codecontract if have min relmonth=5 , max relmonth=12, data frame include relmonths 5:12.

i calculate column called mths_since_last_pmt count given codecontracthow many relmonths has been since given codecontract had amtpmt > amt_threshold.

it work (assuming amt_threshold=5)

codecontract    relmonth    amtpmt  mths_since_last_pmt a0001           1           0.00    1 a0001           2           3.00    2 a0001           3           0.00    3 a0001           4           10.00   0 a0001           5           0.00    1 a0002           1           10.00   0 a0002           2           12.00   0 a0002           3           0.00    1 a0002           4           0.00    2

i have working solution uses for loop can process around 5,000 rows/sec.

i looking way vectorize calculation , maybe without sorting data first, or having uninterrupted relmonths.

all vectorized solutions have tried develop, typically ddply call seq_along, end maxing out ram (24gb). looking solution run in under 2gb of ram usage. maybe solution in form of custom function?

any idea how make work?

update @roland

@roland

i have found different dataset result in erroneous output code below. tweaked input is:

df <- read.table(text="codecontract    relmonth    amtpmt  mths_since_last_pmt a0001           1           0.00    1 a0001           2           3.00    2 a0001           3           0.00    3 a0001           4           10.00   0 a0001           5           0.00    1 a0002           1           1.00   0 a0002           2           14.00   0 a0002           3           14.00    1 a0002           4           14.00    2",header=true)

the corresponding output is:

codecontract relmonth amtpmt mths_since_last_pmt mths_since_last_pmt2 1:        a0001        1      0                   1                    1 2:        a0001        2      3                   2                    2 3:        a0001        3      0                   3                    3 4:        a0001        4     10                   0                    0 5:        a0001        5      0                   1                    1 6:        a0002        1      1                   0                    1 7:        a0002        2     14                   0                    0 8:        a0002        3     14                   1                   -1 9:        a0002        4     14                   2                   -2

the negative numbers -1 , -2 in mths_since_last_pmt2 in last rows incorrect; should both 0 threshold has been exceed. seems algorithm fails when first item sub-group (here codecontract change) below threshold enough throw off.

is there tweak can apply make work?

try this:

df <- read.table(text="codecontract    relmonth    amtpmt  mths_since_last_pmt a0001           1           0.00    1 a0001           2           3.00    2 a0001           3           0.00    3 a0001           4           10.00   0 a0001           5           0.00    1 a0002           1           10.00   0 a0002           2           12.00   0 a0002           3           0.00    1 a0002           4           0.00    2",header=true)  library(data.table)  dt <- data.table(df,key=c("codecontract","relmonth"))  trsh <- 5 dt[,mths_since_last_pmt2 :=         cumsum(amtpmt<=trsh)-cumsum(cumsum(amtpmt<=trsh)*(amtpmt>trsh)),             by=codecontract]  #    codecontract relmonth amtpmt mths_since_last_pmt mths_since_last_pmt2 # 1:        a0001        1      0                   1                    1 # 2:        a0001        2      3                   2                    2 # 3:        a0001        3      0                   3                    3 # 4:        a0001        4     10                   0                    0 # 5:        a0001        5      0                   1                    1 # 6:        a0002        1     10                   0                    0 # 7:        a0002        2     12                   0                    0 # 8:        a0002        3      0                   1                    1 # 9:        a0002        4      0                   2                    2

hopefully data.table's assignment reference keep under ram limit.

Search This Blog

Bready

Multicriteria increment with reset in R -

Comments

Post a Comment

Popular posts from this blog

monitor web browser programmatically in Android? -

Shrink a YouTube video to responsive width -

c# - Using multiple datasets in RDLC -