###
Colored Range Queries and Document Retrieval

####
Travis Gagie, Juha Kärkkäinen, Gonzalo Navarro, and Simon Puglisi

Colored range queries are a well-studied topic in computational geometry and
database research that, in the past decade, have found exciting applications
in information retrieval. In this paper, we give improved time and space
bounds for three important one-dimensional colored range queries --- colored
range listing, colored range top-*k* queries and colored range counting ---
and, as a consequence, new bounds for various document retrieval problems on
general
collections of sequences.
Colored range listing is the problem of preprocessing a sequence
*S[1,n]* of
colors so that, later, given an interval [i,i+l-1], we list
the different colors in *S[i,i+l-1]*. Colored range top-*k* queries ask
instead for *k* most frequent colors in the interval. Colored range counting
asks for the number of different colors in the interval.
We first describe a framework
including almost all recent results on colored range listing and document
listing, which suggests new combinations of data structures for these
problems. For example, we give the first compressed data structure
(using *nH_k(S)+o(n log s)* bits, for any *k=o(log_s n)*,
where *H_k(S)* is the *k*-th order empirical
entropy of *S* and *s* the number of different colors in
*S*) that
answers
colored range listing queries in constant time per returned result.
We also give an efficient data structure for document listing whose size is
bounded in terms of the *k*-th order entropy of the library of documents.
We then show how
(approximate) colored top-*k* queries can be reduced to (approximate)
range-mode queries on subsequences, yielding the first efficient data
structure for this problem. Finally, we show how modified wavelet trees can
support colored range counting
using *n H_0(S) + O(n) + o(n H_0(S))* bits,
and answer queries in *O(log l)* time.
As far as we know, this is
the first data structure in which the query time depends only on *l* and
not
on *n*. We also show how our data structure can be made dynamic.