###
New Space/Time Tradeoffs for Top-k Document Retrieval on Sequences

####
Gonzalo Navarro and Sharma V. Thankachan

We address the problem of indexing a collection *DD = {T_1,T_2,...T_D}*
of *D* string documents of total length *n*, so that we can efficiently answer
{\em top-*k* queries}: retrieve *k* documents most relevant to a pattern
*P*
of length *p* given at query time. There exist linear-space data structures,
that is, using *O(n)* words, that answer such queries in optimal
*O(p+k)* time
for an ample set of notions of relevance. However, using linear space is not
sufficiently good for large text collections. In this paper we explore how far
the space/time tradeoff for this problem can be pushed. We obtain three
results: (1) When relevance is measured as term frequency (number of times
*P*
appears in a document *T_i*), an index occupying *|CSA|+o(n)* bits answers
the query in time *O(tsearch(p)+k lg^2 k lg^e n)*, where
*CSA* is a
compressed suffix array indexing *DD*, *tsearch* is its time to find the
suffix array interval of *P*, and *e>0* is any constant. (2) With the same
measure of relevance, an index occupying *|CSA|+n lg D+ o(n lg
s+n lg D)* bits answers the query in time *O(tsearch(p)+k lg*
k)*,
where *lg* k* is the iterated logarithm of *k*. (3) When the relevance
depends only on the documents, an index occupying *|CSA|+O(n lg lg n)* bits
answers the query in *O(tsearch(p) + k tSA)* time, where *tSA* is the time
the *CSA* needs to retrieve a suffix array cell. On our way, we obtain some
other results of independent interest.