algorithm - How to print frequency k words from more than 1TB of data? -


this question asked in interview. wanted know solution.

give text file contains 1 word in each line , size of file more 1tb. task print words frequency k in file.

i didnt answer question fully. guess, started in right way. have use hashing technique , code take atleast o(n) time (since has read through file)

can answer me can done efficiently.

in general class of problems topic of "top k" or "selection" algorithms. here's wikipedia article on general topic: wikipedia: selection algorithm. seems have come vogue "big data" systems generally, , perhaps past previous generation of interviews focused on sorting algorithms long past time when every serious candidate had memorized quicksort , heapsort code.

as practical matter textbook problem "big data" (hadoop, , other map/reduce systems) built. if data distributed on n nodes each can compute separate partial histograms (mapping histogram function on entire data set) , merge results (reducing subtotals grand totals).

for interview scenario popular question because there no simple trick. enumerate on several published approaches in academic literature or tackle problem de novo.

if "vocabulary" relatively small (for example there few tens of thousands of words in typical english lexicon --- quarter million word vocabulary pretty extensive). in case expect count fit in ram typical modern hardware. if "words" in data set more extensive --- more tens or hundreds of millions --- such approach no longer feasible.

conceivably 1 try adaptive or statistical approach. if know there no major clusters of single "word" ... statistically significant sample of data set similar other ... build our histograms , throw away "words" (and counts) more rare others. if data presented stream , not given hard guarantee distribution of terms not feasible approach. if have data in random access filesystem possibly sample data set sparsely , randomly build very likely set of top k * m (where m arbitrary multiple of our desired k elements such fit in ram).

hashing might locate our counters each word, have consider chances of collisions if try keep counts of hashes without keeping "words" in data structure. in general i'd think heap better (possibly including dropping things bottom of in-memory heap out storage heap or trie).

i said "adaptive" earlier because 1 use caching (ultimately statistical modeling) keep frequent "words" in ram , shuffle least frequent out storage (to protect against degenerate data set frequent "words" give way rare word becomes more frequent 1 digs deeper through data set).

while conversational exploration of these considerations might work in interviews, i'd suggest being familiar various sections of wikipedia article i've cited can sketch out psuedo-code @ least 1 or 2 of them , show have academic background in material.

absolutely not neglect discuss distributed processing in interview "top k" class of questions being posed. if clarify constraints of question being posed , acknowledge such problems have been driving force modern "big data" distributed processing systems.

also here's question on same general topic: stackoverflow: efficient way find top k frequent words in big word sequence.


Comments

Popular posts from this blog

ios - iPhone/iPad different view orientations in different views , and apple approval process -

java Extracting Zip file -

C# WinForm - loading screen -