A Lempel-Ziv Text Index on Secondary Storage

Diego Arroyuelo and Gonzalo Navarro

Full-text searching consists in locating the occurrences of a given pattern P[1..m] in a text T[1..u], both sequences over an alphabet of size s. In this paper we define a new index for full-text searching on secondary storage, based on the Lempel-Ziv compression algorithm and requiring 8uH_k(T) + o(u log s) bits of space, where H_k(T) denotes the k-th order empirical entropy of T, for any k=o(log_s u). Our experimental results show that our index is significantly smaller than any other practical secondary-memory data structure: 1.4-2.3 times the text size including the text, which means 39%-65% the size of traditional indexes like String B-trees [Ferragina and Grossi, JACM 1999]. In exchange, our index requires more disk access to locate the pattern occurrences. Our index is able to report up to 600 occurrences per disk access. If we only need to count pattern occurrences, the space can be reduced to about 1.04-1.68 times the text size, requiring about 20-60 disk accesses, depending on the pattern length.