python - How to accumulate unique sum of columns across pandas index -

i have pandas dateframe, df created with

df = pd.read_table('sorted_df_changes.txt', index_col=0, parse_dates=true, names=['date', 'rev_id', 'score'])

which structured so:

                     page_id     score   date 2001-05-23 19:50:14  2430        7.632989 2001-05-25 11:53:55  1814033     18.946234 2001-05-27 17:36:37  2115        3.398154 2001-08-04 21:00:51  311         19.386016 2001-08-04 21:07:42  314         14.886722

date index , of type datetimeindex.

every page_id may appear in 1 or more dates (not unique) , large in size ~1 million. of pages make document.

i need score entire document @ every time in date while counting latest score given page_id.

example

example data

                     page_id     score   date 2001-05-23 19:50:14  1           3 2001-05-25 11:53:55  2           4 2001-05-27 17:36:37  1           5 2001-05-28 19:36:37  1           1

example solution

                     score   date 2001-05-23 19:50:14  3 2001-05-25 11:53:55  7 (3 + 4) 2001-05-27 17:36:37  9 (5 + 4) 2001-05-28 19:36:37  5 (1 + 4)

the entry 2 counted continuously since not repeated each time id 1 repeated new score replaces old score.

edit:

finally, found solution don't need loop:

df.score.groupby(df.page_id).transform(lambda s:s.diff().combine_first(s)).cumsum()

i think loop needed:

from stringio import stringio txt = """date,page_id,score 2001-05-23 19:50:14,  1,3 2001-05-25 11:53:55,  2,4 2001-05-27 17:36:37,  1,5 2001-05-28 19:36:37,  1,1 2001-05-28 19:36:38,  3,6 2001-05-28 19:36:39,  3,9 """  df = pd.read_csv(stringio(txt), index_col=0)  def score_sum_py(page_id, scores):     itertools import izip     score_sum = 0     last_score = [0]*(np.max(page_id)+1)     result = np.empty_like(scores)     i, (pid, score) in enumerate(izip(page_id, scores)):         score_sum = score_sum - last_score[pid] + score         last_score[pid] = score         result[i] = score_sum     result.name = "score_sum"     return result  print score_sum_py(pd.factorize(df.page_id)[0], df.score)

output:

date 2001-05-23 19:50:14     3 2001-05-25 11:53:55     7 2001-05-27 17:36:37     9 2001-05-28 19:36:37     5 2001-05-28 19:36:38    11 2001-05-28 19:36:39    14 name: score_sum

if loop in python slow, can try convert 2 series page_id, scores python list first, loop on list , calculation python's native integer maybe faster.

if speed important, can try cython:

%%cython cimport cython cimport numpy np import numpy np  @cython.wraparound(false)  @cython.boundscheck(false) def score_sum(np.ndarray[int] page_id, np.ndarray[long long] scores):     cdef int     cdef long long score_sum, pid, score     cdef np.ndarray[long long] last_score, result      score_sum = 0     last_score = np.zeros(np.max(page_id)+1, dtype=np.int64)     result = np.empty_like(scores)      in range(len(page_id)):         pid = page_id[i]         score = scores[i]         score_sum = score_sum - last_score[pid] + score         last_score[pid] = score         result[i] = score_sum      result.name = "score_sum"     return result

here use pandas.factorize() convert page_id array in range 0 , n. n unique count of elements in page_id. can use dict cache last_score of every page_id without using pandas.factorize().

Search This Blog

Bready