FIX to Average Complexity of Exact and Approximate Multiple String Matching

Level: Serious

Unfortunately there is an error in the lower bound for multiple exact string matching, more specifically in Lemma 6. Although the symbols of a random pattern are chosen independently, the candidates cannot be regarded independently. Given a random pattern, one can choose the sequence of tested positions specifically for the pattern. We assume (without noticing it) that the way the window is sampled is independent of the pattern. Under this assumption our lower bound proof is correct. However, it is possible to break it by adapting the sampling of the window to the search pattern at hand. Of course the result of the paper does hold, just our proof is wrong. The right proof can be derived by extending Yao's original paper. Our goal in this paper was to provide a simpler proof, as Yao proved stronger results. Unfortunately we failed to do so.

This issue was communicated to us by Ralph Stiebe, Dept. of Computer Science, Otto von Guericke University (stiebe@iws.cs.uni-magdeburg.de). He showed that the formula for the probability c_t is incorrect for sigma=2, r=1, t=2, m=4, ... ,8. His detailed example follows:

I give the sequences of comparisons for the possible patterns of length 4:

Pattern 1st accessed position 2nd accessed position Probability to exclude all candidates
0000 4 arbitrary 0.5 (if first access gives symbol 1)
0001 4 1 0.25 (first access 1, second access 1)
0010 4 2 0.25 (first access 1, second access 1)
0011 4 6 (if first access 0) 0.25 (second access 0)
2 (if first access 1) 0.25 (second access 1)
0100 4 3 0.25 (first access 1, second access 1)
0101 4 5 (if first access 0) 0.25 (second access 0)
3 (if first access 1) 0.25 (second access 1)
0110 3 5 0.25 (first access 0, second access 0)
0111 4 5 0.25 (first access 0, second access 0)

Analogous probabilities hold for patterns starting with symbol 1. Hence, the probability to have excluded all candidates after 2 accesses is 22/64 or 88/256, which is greater than (3/4)^4=81/256. For larger values of m, the ratio between the probability to exclude all candidates with 2 accesses and (3/4)^m increases. For m=6, the probability to exclude all candidates with 2 accesses is 58/256=928/4096, while (3/4)^6=729/4096. The relevance of patterns, that can be handled with 2 accesses with probability 0.5, vanishes. Basically, one counts the number of patterns with a (2m-1,2)-certificate.