database - Creating index on sorted data -
i have text file sorted data, splitted newline. example:
...
abc123
abc124
abd123
abd124
abd125
...
now want create index dataset, should (at least) support:
getstringbyindex(n): return nth item of sorted list;
getindexbystring(s): find s in items, return index (or -1 if not found);
i've read indexing algorithms hashing , b-trees. b-tree field of children size should well. since dateset sorted, wonder if there more effecient solution building b-tree inserting items it?
since data sorted, can locate content , efficiently keeping small, sparse subset of data in memory. example, let's decide store every nth element in memory. efficient initialization of api, you'd want compile sparse list in separate file on disk, don't have stream through 100gb of data it.
for each of these terms, need save disk offset relative head of file term starts. then, you've got load sparse list / offset pairs memory, , implementations of 2 requests become straightforward:
getstringbyindex(n): floor(n/n)-th string/offset pair list seek offset position in index read/skip n mod n strings, return next 1 getindexbystring(s): binary search on sparse list in memory locate lower , upper bound string/offset pairs if string/offset pair in i-th position in our sparse list, string (n x i)-th string in our index. can use information compute return value if string want isn't in memory: seek lower-bound offset in index read strings until we: a) find match b) reach high-bound offset c) reach string lexicographically greater 1 looking else return index matching string in our sparse list
if strings in index fixed-width, can make greater optimizations.
you want careful choice of 'n' algorithm, if implement it. remember, cost of reading 10 bytes position on disk isn't lower cost of reading 10,000 bytes same position: it's overhead of disk seek, , getting in , out of i/o call hurts most.
Comments
Post a Comment