Locally Compressed Suffix Arrays

Rodrigo González, Gonzalo Navarro, and Héctor Ferrada

We introduce a compression technique for suffix arrays. It is sensitive to the compressibility of the text and local, meaning that random portions of the suffix array can be decompressed by accessing mostly contiguous memory areas. This makes decompression very fast, especially when various contiguous cells must be accessed.

Our main technical contributions are the following. First, we show that runs of consecutive values that are known to appear in function Psi(i)=A^(-1)[A[i]+1] of suffix arrays A of compressible texts also show up as repetitions in the differential suffix array A'[i]=A[i]-A[i-1]. Second, we use Re-Pair, a grammar-based compressor, to compress the differential suffix array, and upper bound its compression ratio in terms of the number of runs. Third, we show how to compact the space used by the grammar rules by up to 50%, while still permitting direct access to the rules. Fourth, we develop specific variants of Re-Pair that work using knowledge of Psi, and use much less space than the general Re-Pair compressor while achieving almost the same compression ratios. Fifth, we implement the scheme and compare it exhaustively with previous work, including the first implementations of previous theoretical proposals.