database - Creating index on sorted data -


i have text file sorted data, splitted newline. example:

...
abc123
abc124
abd123
abd124
abd125
...

now want create index dataset, should (at least) support:

  1. getstringbyindex(n): return nth item of sorted list;

  2. getindexbystring(s): find s in items, return index (or -1 if not found);

i've read indexing algorithms hashing , b-trees. b-tree field of children size should well. since dateset sorted, wonder if there more effecient solution building b-tree inserting items it?

since data sorted, can locate content , efficiently keeping small, sparse subset of data in memory. example, let's decide store every nth element in memory. efficient initialization of api, you'd want compile sparse list in separate file on disk, don't have stream through 100gb of data it.

for each of these terms, need save disk offset relative head of file term starts. then, you've got load sparse list / offset pairs memory, , implementations of 2 requests become straightforward:

    getstringbyindex(n):         floor(n/n)-th string/offset pair list         seek offset position in index         read/skip n mod n strings, return next 1      getindexbystring(s):         binary search on sparse list in memory             locate lower , upper bound string/offset pairs         if string/offset pair in i-th position in our sparse list,             string (n x i)-th string in our index.             can use information compute return value         if string want isn't in memory:             seek lower-bound offset in index             read strings until we:                 a) find match                 b) reach high-bound offset                 c) reach string lexicographically greater 1 looking         else             return index matching string in our sparse list 

if strings in index fixed-width, can make greater optimizations.

you want careful choice of 'n' algorithm, if implement it. remember, cost of reading 10 bytes position on disk isn't lower cost of reading 10,000 bytes same position: it's overhead of disk seek, , getting in , out of i/o call hurts most.


Comments

Popular posts from this blog

monitor web browser programmatically in Android? -

Shrink a YouTube video to responsive width -

wpf - PdfWriter.GetInstance throws System.NullReferenceException -