Succinct Suffix Arrays based on Run-Length Encoding.

Veli Mäkinen and Gonzalo Navarro

A succinct full-text self-index is a data structure built on a text T=t_1 t_2 ... t_n, which takes little space (ideally close to that of the compressed text), permits efficient search for the occurrences of a pattern P=p_1 p_2 ... p_m in T, and is able to reproduce any text substring, so the self-index replaces the text. Several remarkable self-indexes have been developed in recent years. They usually take space proportional to n H_0 or n H_k bits, where H_k is the kth order empirical entropy of T. The time to count how many times does P occur in T ranges from O(m) to O(m log n).

In this paper we present a new self-index, called RLFM index for "run-length FM-index", that counts the occurrences of P in T in O(m) time when the alphabet size is sigma=O(polylog(n)). The RLFM index requires n H_k log_2(sigma)+O(n) bits of space for small k. We then show how to implement the RLFM index in practice, and obtain in passing another implementation with different space-time tradeoffs. We empirically compare ours against the best existing implementations of other indexes and show that ours are fastest among indexes taking less space than the text.