Time-Optimal Top-k Document Retrieval
Gonzalo Navarro and Yakov Nekrich
Let DD be a collection of D documents, which are
strings over an alphabet of size s, of total length n.
We describe a data structure that uses linear space and
and reports k most relevant documents that contain a query
pattern P, which is a string of length p packed in
p / log_s n
words, in time O(p / log_s n + k).
This is optimal in the RAM model in the general case where log D =
Theta(log n), and involves a novel RAM-optimal suffix tree search.
Our construction supports an ample set of important relevance
measures, such as the number of times P appears in a document (called term
frequency), a fixed document importance, and the minimal distance
between two occurrences of P in a document.
When log D = o(log n), we show how to reduce the space
of the data structure from O(n log n) to O(n( log s + log D + log log
bits, and to O(n(log s + log D)) bits in the case of the popular
term frequency measure of relevance, at the price of an additive term
O(log^e_s n) in the query time, for any constant
We also consider the dynamic scenario, where documents
can be inserted and deleted from the collection. We obtain linear space and
query time O(p(log log n)^2 / log_s n + log n + k log log k),
whereas insertions and deletions require O(log^(1+e) n) time per symbol,
for any constant e>0.
Finally, we consider an extended static scenario where an extra parameter
par(P,d) is defined, and the query must retrieve only documents
such that par(P,d)\in [t_1,t_2], where this range is specified at
query time. We solve these queries using linear space and
O(p / log_s n + \log^(1+e) n + k log^e n) time, for any constant
Our technique is to translate these top-k problems into multidimensional
geometric search problems. As a bonus, we describe some
improvements to those problems.