Block Addressing Indices for Approximate Text Retrieval

Ricardo Baeza-Yates and Gonzalo Navarro

The issue of reducing the space overhead when indexing large text databases is becoming more and more important, as the text collections grow in size. Another subject, which is gaining importance as text databases grow and get more heterogeneous and error prone, is that of flexible string matching. One of the best tools to make the search more flexible is to allow a limited number of differences between the words found and those sought. This is called "approximate text searching", which is becoming more and more popular.

In recent years some indexing schemes with very low space overhead have appeared, some of them dealing with approximate searching. These low overhead indices (whose most notorious exponent is Glimpse) are modified inverted files, where space is saved by making the lists of occurrences point to text blocks instead of exact word positions. Despite their existence, little is known about the expected behavior of these "block addressing" indices, and even less is known when it comes to cope with approximate search.

Our main contribution is an analytical study of the space-time trade-offs for indexed text searching. We study the space overhead and retrieval times as functions of the block size. We find that, under reasonable assumptions, it is possible to build an index which is simultaneously sublinear in space overhead and in query time. This surprising analytical conclusion is validated with extensive experiments, obtaining typical performance figures. These results are valid for classical exact queries as well as for approximate searching.

We apply our analysis to the Web, using recent statistics on the distribution of the document sizes. We show that pointing to documents instead of to fixed size blocks reduces space requirements but increases search times.