python - How to accumulate unique sum of columns across pandas index -
i have pandas dateframe, df created with
df = pd.read_table('sorted_df_changes.txt', index_col=0, parse_dates=true, names=['date', 'rev_id', 'score'])
which structured so:
page_id score date 2001-05-23 19:50:14 2430 7.632989 2001-05-25 11:53:55 1814033 18.946234 2001-05-27 17:36:37 2115 3.398154 2001-08-04 21:00:51 311 19.386016 2001-08-04 21:07:42 314 14.886722
date index , of type datetimeindex.
every page_id may appear in 1 or more dates (not unique) , large in size ~1 million. of pages make document.
i need score entire document @ every time in date while counting latest score given page_id.
example
example data
page_id score date 2001-05-23 19:50:14 1 3 2001-05-25 11:53:55 2 4 2001-05-27 17:36:37 1 5 2001-05-28 19:36:37 1 1
example solution
score date 2001-05-23 19:50:14 3 2001-05-25 11:53:55 7 (3 + 4) 2001-05-27 17:36:37 9 (5 + 4) 2001-05-28 19:36:37 5 (1 + 4)
the entry 2 counted continuously since not repeated each time id 1 repeated new score replaces old score.
edit:
finally, found solution don't need loop:
df.score.groupby(df.page_id).transform(lambda s:s.diff().combine_first(s)).cumsum()
i think loop needed:
from stringio import stringio txt = """date,page_id,score 2001-05-23 19:50:14, 1,3 2001-05-25 11:53:55, 2,4 2001-05-27 17:36:37, 1,5 2001-05-28 19:36:37, 1,1 2001-05-28 19:36:38, 3,6 2001-05-28 19:36:39, 3,9 """ df = pd.read_csv(stringio(txt), index_col=0) def score_sum_py(page_id, scores): itertools import izip score_sum = 0 last_score = [0]*(np.max(page_id)+1) result = np.empty_like(scores) i, (pid, score) in enumerate(izip(page_id, scores)): score_sum = score_sum - last_score[pid] + score last_score[pid] = score result[i] = score_sum result.name = "score_sum" return result print score_sum_py(pd.factorize(df.page_id)[0], df.score)
output:
date 2001-05-23 19:50:14 3 2001-05-25 11:53:55 7 2001-05-27 17:36:37 9 2001-05-28 19:36:37 5 2001-05-28 19:36:38 11 2001-05-28 19:36:39 14 name: score_sum
if loop in python slow, can try convert 2 series page_id, scores python list first, loop on list , calculation python's native integer maybe faster.
if speed important, can try cython:
%%cython cimport cython cimport numpy np import numpy np @cython.wraparound(false) @cython.boundscheck(false) def score_sum(np.ndarray[int] page_id, np.ndarray[long long] scores): cdef int cdef long long score_sum, pid, score cdef np.ndarray[long long] last_score, result score_sum = 0 last_score = np.zeros(np.max(page_id)+1, dtype=np.int64) result = np.empty_like(scores) in range(len(page_id)): pid = page_id[i] score = scores[i] score_sum = score_sum - last_score[pid] + score last_score[pid] = score result[i] = score_sum result.name = "score_sum" return result
here use pandas.factorize()
convert page_id
array in range 0 , n. n unique count of elements in page_id
. can use dict cache last_score of every page_id without using pandas.factorize()
.
Comments
Post a Comment