###
Self-Indexed Grammar-Based Compression

####
Francisco Claude and Gonzalo Navarro

Self-indexes aim at representing text collections in a compressed format
that allows extracting arbitrary portions and also offers indexed searching
on the collection. Current self-indexes are unable of fully exploiting the
redundancy of highly repetitive text collections that arise in several
applications. Grammar-based compression is well suited to exploit
such repetitiveness.
We introduce the first grammar-based self-index. It builds on Straight-Line
Programs (SLPs), a rather general kind of context-free grammars. If an SLP
of *n* rules represents a text *T[1,u]*, then an SLP-compressed
representation of *T* requires *2n log_2 n* bits. For that same SLP, our
self-index takes *O(n log n) + n log_2 u* bits. It extracts any text
substring
of length *m* in time *O((m+h) log n)*, and finds *occ* occurrences of a
pattern string of length *m* in time *O((m(m+h)+h occ) log n)*, where
*h*
is the height of the parse tree of the SLP. No previous grammar
representation had achieved *o(n)* search time. \\
%
As byproducts we introduce (i) a representation of SLPs that takes
*2n log_2 n (1+o(1))* bits and efficiently supports more operations
than a plain array of rules; (ii) a representation for binary relations
with labels supporting various extended queries; (iii) a generalization of
our self-index to grammar compressors that reduce *T* to a sequence of
terminals and nonterminals, such as Re-Pair and LZ78.