###
Efficient Fully-Compressed Sequence Representations

####
Jérémy Barbay, Francisco Claude, Travis Gagie,
Gonzalo Navarro, and Yakov Nekrich

We present a data structure that stores a sequence *S[1..n]* over
alphabet *[1..s]* in *n H0(S) + o(n)(H0(S)+1)* bits, where
*H0(S)* is the zero-order entropy of *S*.
This structure supports the queries *access*, *rank* and *select*,
which are fundamental building blocks for many other compressed data
structures, in worst-case time *O(lg lg s)*
and average time *O(lg H0(S))*.
The worst-case complexity matches the best previous results, yet these had
been achieved with data structures using *n H0(S) + o(n lg s)* bits.
On highly compressible sequences the *o(n lg s)* bits of the redundancy
may be significant compared to the the *n H0(S)* bits that encode the data.
Our representation, instead, compresses the redundancy as well. Moreover,
our
average-case complexity is unprecedented.
Our technique is based on partitioning the alphabet into characters of
similar frequency. The subsequence corresponding to each group can then be
encoded using fast uncompressed representations without harming the overall
compression ratios, even in the redundancy.

The result also improves upon the best current compressed
representations of several other data structures.
For example, we achieve
*(i)* compressed redundancy, retaining the best time complexities, for the
smallest existing full-text self-indexes;
*(ii)* compressed permutations *p* with times for
*p()* and
*p^{-1}()* improved to loglogarithmic; and
*(iii)* the first compressed representation of dynamic collections of
disjoint sets.
We also point out various applications to inverted indexes,
suffix arrays, binary relations, and data compressors.

Our structure is practical on large alphabets. Our experiments show that,
as predicted by theory, it dominates the space/time tradeoff map
of all the sequence representations, both in synthetic and application
scenarios.