###
Top-*k* Term-Proximity in Succinct Space.

####
Ian Munro, Gonzalo Navarro, Jesper Sindahl Nielsen, Rahul Shah, and
Sharma Thankachan.

Let *D = { T_1, T_2, ..., T_D }* be a collection of *D*
string documents of *n* characters in total, that are drawn from an
alphabet set *S=[s]*. The *top-k document retrieval
problem* is to preprocess *D* into a data structure that, given a
query *(P[1..p],k)*, can return the *k* documents of *D* most
relevant to the pattern *P*. The relevance is captured using a
predefined ranking function, which depends on the set of occurrences
of *P* in *T_d*. For example, it can be the term frequency (i.e.,
the number of occurrences of *P* in *T_d*), or it can be the term
proximity (i.e., the distance between the closest pair of occurrences
of *P* in *T_d*), or a pattern-independent importance score of
*T_d*
such as PageRank. Linear space and optimal query time solutions
already exist for the general top-*k* document retrieval problem. Compressed
and compact space solutions are also known, but only for a few ranking
functions such as
term frequency and importance. However, space efficient data
structures for term proximity based retrieval have been evasive. In
this paper we present the first sub-linear space data structure for
this relevance function, which uses only *o(n)* bits on top of any
compressed suffix array of *D* and solves queries in *O((p+k)
polylog n)* time. We also show that scores that consist of a weighted
combination of term proximity, term frequency, and document importance,
can be handled using twice the space required to represent the text
collection.