Indexing Text with Approximate q-grams

Gonzalo Navarro, Erkki Sutinen and Jorma Tarhio

We present a new index for approximate string matching. The index collects text q-samples, that is, disjoint text substrings of length q, at fixed intervals and stores their positions. At search time, part of the text is filtered out by noticing that any occurrence of the pattern must be reflected in the presence of some text q-samples that match approximately inside the pattern. Hence the index points out the text areas that could contain occurrences and must be verified. The index parameters permit load balancing between filtering and verification work, and provide a compromise between the space requirement of the index and the error level for which the filtration is still efficient. We show experimentally that the index is competitive against others that take more space, being in fact the fastest choice for intermediate error levels, an area where no current index is useful.