Space-efficient Construction of LZ-index

Diego Arroyuelo and Gonzalo Navarro

A compressed full-text self-index is a data structure that replaces a text and in addition gives indexed access to it, while taking space proportional to the compressed text size. The LZ-index, in particular, requires 4uH_k(1+o(1)) bits of space, where u is the text length in characters and H_k is its k-th order empirical entropy. Although in practice the LZ-index needs 1.0-1.5 times the text size, its construction requires much more main memory (around 5 times the text size), which limits its applicability to large texts. In this paper we present a practical space-efficient algorithm to construct LZ-index, requiring (4+e)uH_k+o(u) bits of space, for any constant 0, and O(su) time, being $s$ the alphabet size. Our experimental results show that our method is efficient in practice, needing an amount of memory close to that of the final index.