Smaller Self-Indexes for Natural Language

Alberto Ordóñez, Nieves Brisaboa, and Gonzalo Navarro

Self-indexes for natural-language texts, where these are regarded as token (word or separator) sequences, achieve very attractive space and search time. However, they suffer from a space penalty due to their large vocabulary. In this paper we show that by replacing the Huffman encoding they implicitly use by the slightly weaker Hu-Tucker encoding, which respects the lexical order of the vocabulary, both their space and time are improved.