###
Time-Optimal Top-*k* Document Retrieval

####
Gonzalo Navarro and Yakov Nekrich

Let *DD* be a collection of *D* documents, which are
strings over an alphabet of size *s*, of total length *n*.
We describe a data structure that uses linear space and
and reports *k* most relevant documents that contain a query
pattern *P*, which is a string of length *p* packed in
*p / log_s n*
words, in time *O(p / log_s n + k)*.
This is optimal in the RAM model in the general case where *log D =
Theta(log n)*, and involves a novel RAM-optimal suffix tree search.
Our construction supports an ample set of important relevance
measures, such as the number of times *P* appears in a document (called term
frequency), a fixed document importance, and the minimal distance
between two occurrences of *P* in a document.
\n\n
When *log D = o(log n)*, we show how to reduce the space
of the data structure from *O(n log n)* to *O(n( log s + log D + log log
n))*
bits, and to *O(n(log s + log D))* bits in the case of the popular
term frequency measure of relevance, at the price of an additive term
*O(log^e_s n)* in the query time, for any constant
*e>0*.
\n\n
We also consider the dynamic scenario, where documents
can be inserted and deleted from the collection. We obtain linear space and
query time *O(p(log log n)^2 / log_s n + log n + k log log k)*,
whereas insertions and deletions require *O(log^(1+e) n)* time per symbol,
for any constant *e>0*.
\n\n
Finally, we consider an extended static scenario where an extra parameter
*par(P,d)* is defined, and the query must retrieve only documents
*d*
such that *par(P,d)\in [t_1,t_2]*, where this range is specified at
query time. We solve these queries using linear space and
*O(p / log_s n + \log^(1+e) n + k log^e n)* time, for any constant
*e>0*.
\n\n
Our technique is to translate these top-*k* problems into multidimensional
geometric search problems. As a bonus, we describe some
improvements to those problems.