Approximate String Matching over Ziv-Lempel Compressed Text

Juha Kärkkäinen, Gonzalo Navarro and Esko Ukkonen

We present the first solution to the open problem of performing approximate pattern matching on compressed text. The format we choose is the Ziv-Lempel family, specifically the LZ78 and LZW variants. Given a text of length u compressed into length n, and a pattern of length m, we report all the R occurrences of the pattern in the text allowing up to k insertions, deletions and substitutions, in O(mkn + R) time. The existence problem needs O(mkn) time. We also show that the algorithm can be adapted to run in O(k^2n + min(mkn,m^2(ms)^k) + R) average time, where s is the alphabet size. The experimental results show a speedup over the basic approach for moderate m and small k.