Advantages of Backward Searching -- Efficient Secondary Memory and Distributed Implementation of Compressed Suffix Arrays

Veli Mäkinen, Gonzalo Navarro and Kunihiko Sadakane

One of the most relevant succinct suffix array proposals in the literature is the Compressed Suffix Array (CSA) of Sadakane [ISAAC 2000]. The CSA needs n(H_0+O(log log s)) bits of space, where n is the text size, s is the alphabet size, and H_0 the zero-order entropy of the text. The number of occurrences of a pattern of length m can be computed in O(m log n) time. Most notably, the CSA does not need the text separately available to operate. The CSA simulates a binary search over the suffix array, where the query is compared against text substrings. These are extracted from the same CSA by following irregular access patterns over the structure. Sadakane [SODA 2002] has proposed using backward searching on the CSA in similar fashion as the FM-index of Ferragina and Manzini [FOCS 2000]. He has shown that the CSA can be searched in O(m) time whenever s = O(polylog(n)).

In this paper we consider some other consequences of backward searching applied to CSA. The most remarkable one is that we do not need, unlike all previous proposals, any complicated sub-linear structures based on the four-Russians technique (such as constant time rank and select queries on bit arrays). We show that sampling and compression are enough to achieve O(m log n) query time using less space than the original structure. It is also possible to trade structure space for search time. Furthermore, the regular access pattern of backward searching permits an efficient secondary memory implementation, so that the search can be done with O(m log_B n) disk accesses, being B the disk block size. Finally, it permits a distributed implementation with optimal speedup and negligible communication effort.