java - Algorithm for counting common group memberships with big data -
i need write program counts number of times 2 users in same group. users given username , groups id. example, input (stored in text-file):
john 32 john 21 jim 21 jim 32 bob 32
i want result:
john-jim 2 john-bob 1 jim-bob 1
this sounds trivial. problem is: have 1,8 million groups , 300,000 users. , lot of memberships (i'm expecting @ least average of 50 per user, possibly more). means huge amount of data , processing.
i've written 5 different programs doing this, none of has been able cut amount of data: slow postgresql query. memory consuming running in map in java working memory (first heap space, after optimization got rare "gc overhead limit exceeded"). slow write continuously database java (even when optimized using batch-queries). growing increasingly desperate, i've tried more exotic things, writing pairs array, sorting them (o(n log (n))) , counting them peu à peu. still way data store in memory.
any ideas on algorithm doing this? or impossible?
an rdbms specialized in operations sorting. doing outside db hardly ever come close in performance. sql!
this job (simplified in update):
select t1.usr || '-' || t2.usr, count(*) ct usr_grp t1 join usr_grp t2 using (grp_id) t2.usr > t1.usr -- prevent dupes , sorted pair group t1.usr, t2.usr;
depending on how many overlaps have, potentially produces huge amount of rows, said. never going fast.
raises question: what's purpose of producing millions of rows nobody can process? sure, operation makes sense begin with?
to make faster, ..
- upgrade! postgresql 8.4 rather outdated now. in particular, postgresql 9.2 had focus on big data. can expect much better performance job this.
, nobody should running 8.4.0. security reasons alone, missing out on lot of bug-fixes, too. current point-release 8.4.17. quote linked web-site:
we recommend users run latest available minor release whatever major version in use.
- use
integer
surrogate key users, deal integers inusr_grp
. makes table , indexes smaller , processing faster. if n:m table (usr_grp
) has bigger cardinality tableusr
, should faster, if means additional joins.
select u1.usr || '-' || u2.usr, count(*) ct usr_grp t1 join usr_grp t2 using (grp_id) join usr u1 on t1.usr_id = u1.usr_id join usr u2 on t2.usr_id = u2.usr_id t2.usr_id > t1.usr_id group u1.usr_id, u2.usr_id;
- create multi-column index (if don't have yet).
grp_id
must come first. why matter?
create index usr_grp_gu_idx on usr_grp(grp_id, usr_id);
- put a lot of ram machine , increase settings
work_mem
,shared_buffers
.
test case
i took numbers @oldcurmudgeon reported test case , created comparable test case in postgresql.
~ 250 ms in public test database.
result not ordered (no order by
) since hasn't been specified.
compared 2.5 minutes, reported below. factor 600.
Comments
Post a Comment