New Space/Time Tradeoffs for Top-k Document Retrieval on Sequences

Gonzalo Navarro and Sharma V. Thankachan

We address the problem of indexing a collection DD = {T_1,T_2,...T_D} of D string documents of total length n, so that we can efficiently answer {\em top-k queries}: retrieve k documents most relevant to a pattern P of length p given at query time. There exist linear-space data structures, that is, using O(n) words, that answer such queries in optimal O(p+k) time for an ample set of notions of relevance. However, using linear space is not sufficiently good for large text collections. In this paper we explore how far the space/time tradeoff for this problem can be pushed. We obtain three results: (1) When relevance is measured as term frequency (number of times P appears in a document T_i), an index occupying |CSA|+o(n) bits answers the query in time O(tsearch(p)+k lg^2 k lg^e n), where CSA is a compressed suffix array indexing DD, tsearch is its time to find the suffix array interval of P, and e>0 is any constant. (2) With the same measure of relevance, an index occupying |CSA|+n lg D+ o(n lg s+n lg D) bits answers the query in time O(tsearch(p)+k lg* k), where lg* k is the iterated logarithm of k. (3) When the relevance depends only on the documents, an index occupying |CSA|+O(n lg lg n) bits answers the query in O(tsearch(p) + k tSA) time, where tSA is the time the CSA needs to retrieve a suffix array cell. On our way, we obtain some other results of independent interest.