###
Improved Compressed Indexes for Full-Text Document Retrieval

####
Djamal Belazzougui, Gonzalo Navarro, and Daniel Valenzuela

We give new space/time tradeoffs for compressed indexes that answer document
retrieval queries on general sequences. On a collection of *D* documents of
total length *n*, current approaches require at least *|CSA| +
O(n lg D / lg lg D)* or *2|CSA|+o(n)* bits of space, where *CSA* is
a full-text index. Using monotone minimal perfect hash functions (mmphfs), we
give new
algorithms for document listing with frequencies and top-*k* document
retrieval
using just *|CSA|+O(n lg lg lg D)* bits. We also improve current solutions
that use *2|CSA|+o(n)* bits, and consider other problems such as colored
range
listing, top-*k* most important documents, and computing arbitrary
frequencies.
We give proof-of-concept experimental results that show that using mmphfs
may provide relevant practical tradeoffs for document listing with
frequencies.