A Practical Index for Text Retrieval Allowing Errors

Ricardo Baeza-Yates and Gonzalo Navarro

We propose a text indexing technique for approximate pattern matching, which is practical and especially aimed at Information Retrieval (IR). Unlike other indices of this kind, it is able to retrieve any string that approximately matches a given search pattern. Every sequence of a fixed length appearing in the text is stored in the index, together with pointers to all the positions where it appears. The search pattern is cut into pieces so that at least one must match exactly. All the pieces are searched in the index and the union of candidate positions is verified. To reduce space requirements, pointers to blocks instead of exact positions can be used, which increases querying costs. We design an algorithm to optimize the pattern partition into pieces so that the total number of verifications is minimized. This also allows to know in advance the expected cost of the search and the expected relevance of the query to the user. We show experimentally the build time, space requirements and query times of our index, finding that it is a practical alternative for IR.