Improved Grammar-Based Compressed Indexes

Francisco Claude and Gonzalo Navarro

We introduce the first grammar-compressed representation of a sequence that supports searches in time that depends only logarithmically on the size of the grammar. Given a text T[1..u] that is represented by a (context-free) grammar of n (terminal and nonterminal) symbols and size N (measured as the sum of the lengths of the right hands of the rules), a basic grammar-based representation of T takes N lg n bits of space. Our representation requires 2N lg n + N lg u + eps n lg n + o(N lg n) bits of space, for any 0. It can find the positions of the occ occurrences of a pattern of length m in T in O((m^2/eps) lg (lg u / lg n) + (m+occ) lg n) time, and extract any substring of length l of T in time O(l+h lg(N/h)), where h is the height of the grammar tree.