New Techniques for Regular Expression Searching

Gonzalo Navarro and Mathieu Raffinot

We present two new techniques for regular expression searching and use them to derive faster practical algorithms.

Based on the specific properties of Glushkov's nondeterministic finite automaton construction algorithm, we show how to encode a deterministic finite automaton (DFA) using O(m*2^m) bits, where m is the number of characters, excluding operator symbols, in the regular expression. This compares favorably against the worst case of O(m*2^m*|S|)$ bits needed by a classical DFA representation (where S is the alphabet) and O(m*2^(2m)) bits needed by the Wu and Manber approach implemented in Agrep.

We also present a new way to search for regular expressions, which is able to skip text characters. The idea is to determine the minimum length l of a string matching the regular expression, manipulate the original automaton so that it recognizes all the reverse prefixes of length up to l of the strings originally accepted, and use it to skip text characters as done for exact string matching in previous work.

We combine these techniques into two algorithms, one able and one unable to skip text characters. The algorithms are simple to implement, and our experiments show that they permit fast searching for regular expressions, normally faster than any existing algorithm.