Roberto Konow, Gonzalo Navarro, Charles Clarke, and Alex
We introduce a new representation of the inverted index
that performs faster ranked unions and intersections while
using similar space. Our index is based on the treap data structure,
which allows us to intersect/merge the document identifiers
while simultaneously thresholding by frequency, instead
of the costlier two-step classical processing methods.
To achieve compression we represent the treap topology using different
compact data structures. Further, the treap invariants
allow us to elegantly encode differentially both document
identifiers and frequencies. We also show how to extend this representation
to support incremental updates over the index.
Results show that, under the tf-idf scoring scheme, our index uses about
the same space as state-of-the-art compact representations, while performing
up to 2-20 times faster on ranked single-word, union, or intersection
Under the BM25 scoring scheme, our index may use up to 40% more space than
others and outperforms them less frequently, but still reaches improvement
factors of 2-20 in the best cases.
The index supporting incremental updates poses an overhead of 50%-100% over
the static variants in terms of space, construction and query time.