###
Practical Approaches to Reduce the Space Requirement of Lempel-Ziv-Based
Compressed Text Indices

####
Diego Arroyuelo and Gonzalo Navarro

Given a text *T[1..n]* over an alphabet of size *s*,
the *full-text search* problem consists in locating
the *occ* occurrences of a given pattern *P[1..m]* in *T*.
*Compressed full-text self-indices* are
space-efficient representations of the text that
provide direct access to and indexed search on it.
The LZ-index of Navarro is a compressed full-text self-index
based on the LZ78 compression algorithm.
This index requires about 5 times the size of the compressed text
(in theory, *4 n Hk(T) + o(n log s)* bits of space, where
*Hk(T)* is the
*k*-th order empirical entropy of *T*).
In practice the average locating complexity of the LZ-index is
*O(s m log_s n + occ s^(m/2))*, where *occ* is the number of
occurrences of *P*. It can extract text substrings of length
*l* in *O(l)* time. This index outperforms competing schemes both to
locate short patterns and to extract text snippets.
However, the LZ-index can be up to 4 times larger than the smallest
existing indices (which use *n Hk(T) + o(n log s)* bits in theory),
and it does not offer space/time tuning options.
This limits its applicability.

In this paper we study practical ways to reduce the space of the LZ-index.
We obtain new LZ-index variants that require
*2(1+e) n Hk(T) + o(n log s)* bits of space, for any *0.
They have an average locating time of
**O((1/e)(m log n + occ s^(m/2)))*, while extracting takes *O(l)* time.

*
We perform extensive experimentation and conclude that our schemes are able
to reduce the space of the original LZ-index by a factor of 2/3, that is,
around 3 times the compressed text size.
Our schemes are able to extract about 1-2 megabytes of the text
per second, being twice as fast as the most competitive alternatives.
Pattern occurrences are located at a rate of up to 1-4 million per second.
This constitutes the best space/time trade-off
when indices are allowed to use 4 times the size of the compressed text or
more.
*