java - Algorithm for counting common group memberships with big data -


i need write program counts number of times 2 users in same group. users given username , groups id. example, input (stored in text-file):

john 32 john 21 jim 21 jim 32 bob 32 

i want result:

john-jim 2  john-bob 1 jim-bob 1 

this sounds trivial. problem is: have 1,8 million groups , 300,000 users. , lot of memberships (i'm expecting @ least average of 50 per user, possibly more). means huge amount of data , processing.

i've written 5 different programs doing this, none of has been able cut amount of data: slow postgresql query. memory consuming running in map in java working memory (first heap space, after optimization got rare "gc overhead limit exceeded"). slow write continuously database java (even when optimized using batch-queries). growing increasingly desperate, i've tried more exotic things, writing pairs array, sorting them (o(n log (n))) , counting them peu à peu. still way data store in memory.

any ideas on algorithm doing this? or impossible?

an rdbms specialized in operations sorting. doing outside db hardly ever come close in performance. sql!

this job (simplified in update):

select t1.usr || '-' || t2.usr, count(*) ct   usr_grp t1 join   usr_grp t2 using (grp_id)   t2.usr > t1.usr   -- prevent dupes , sorted pair group  t1.usr, t2.usr; 

depending on how many overlaps have, potentially produces huge amount of rows, said. never going fast.

raises question: what's purpose of producing millions of rows nobody can process? sure, operation makes sense begin with?

to make faster, ..

  • upgrade! postgresql 8.4 rather outdated now. in particular, postgresql 9.2 had focus on big data. can expect much better performance job this.
    , nobody should running 8.4.0. security reasons alone, missing out on lot of bug-fixes, too. current point-release 8.4.17. quote linked web-site:

we recommend users run latest available minor release whatever major version in use.

  • use integer surrogate key users, deal integers in usr_grp. makes table , indexes smaller , processing faster. if n:m table (usr_grp) has bigger cardinality table usr, should faster, if means additional joins.

select u1.usr  || '-' || u2.usr, count(*) ct   usr_grp t1 join   usr_grp t2 using (grp_id)  join   usr u1 on t1.usr_id = u1.usr_id join   usr u2 on t2.usr_id = u2.usr_id  t2.usr_id > t1.usr_id group  u1.usr_id, u2.usr_id; 
  • create multi-column index (if don't have yet).
    grp_id must come first. why matter?

    create index usr_grp_gu_idx on usr_grp(grp_id, usr_id); 

test case

i took numbers @oldcurmudgeon reported test case , created comparable test case in postgresql.

-> sqlfiddle demo.

~ 250 ms in public test database.
result not ordered (no order by) since hasn't been specified.
compared 2.5 minutes, reported below. factor 600.


Comments

Popular posts from this blog

monitor web browser programmatically in Android? -

Shrink a YouTube video to responsive width -

wpf - PdfWriter.GetInstance throws System.NullReferenceException -