Self-Indexing Natural Language

Nieves Brisaboa, Antonio Fariña, Gonzalo Navarro, Ángeles Places, and Eduardo Rodríguez.

Self-indexing is a concept developed for indexing arbitrary strings. It has been enormously successful to reduce the size ofthe large indexes typically used on strings, namely suffix trees and arrays. Self-indexes represent a string in a space close to its compressed size and provide indexed searching on it. On natural language, a compressed inverted index over the compressed text already provides a reasonable alternative, in space and time, for indexed searching of words and phrases. In this paper we explore the possibility of regarding natural language text as a string of words and applying a self-index to it. There are several challenges involved, such as dealing with a very large alphabet and detaching searchable content from non-searchable presentation aspects in the text. As a result, we show that the self-index requires space very close to that of the best word-based compressors, and that it obtains better search time than inverted indexes (using the same overall space) when searching phrases.