Multiple Approximate String Matching

Ricardo Baeza-Yates and Gonzalo Navarro

We present two new algorithms for on-line multiple approximate string matching. These are extensions of previous algorithms that search for a single pattern. The single-pattern version of the first one is based on the simulation with bits of a non-deterministic finite automaton built from the pattern and using the text as input. To search for multiple patterns, we superimpose their automata, using the result as a filter. The second algorithm partitions the pattern in sub-patterns that are searched with no errors, with a fast exact multipattern search algorithm. To handle multiple patterns, we search the sub-patterns of all of them together. The running time achieved is in both cases O (n) for moderate error level, pattern length and number of patterns. They adapt (with higher costs) to the other cases. However, the algorithms differ in speed and thresholds of usefulness. We analyze theoretically when the algorithms can be used, and show experimentally that they are faster than previous solutions in a wide range of cases.