FIX to Compressed Full-Text Indexes

Level: Small

Not an error, but still making reader's life difficult, is the vagueness about Sadakane's CSA space. Theorem 11 gives the space 1/e n H_0 + O(n log log s), where e stands for epsilon, and is linked to the time to locate, O(log^e n), and to display. This is the original space/time Sadakane reported in his papers.

Next we say that, by using run-length compression on Psi, one can get nH_k (log s + loglog n) + O(n) bits, "while retaining the search times". These refer to counting times. Where did the epsilon go? What happens to the locating and extraction time? In fact, this epsilon does not apply anymore. To find the new locate times, you have to consider the next paragraph, "In practice", where a sampling mechanism is described. Yet, it is not well analyzed until it is used again for the FM-index (where it was first introduced indeed), two paragraphs before Theorem 14. There it is shown that, via sampling, one can get O(log^(1+e) n) time and O(n/log^e n) extra bits of space.

In the "In practice" part of Theorem 13 we get back to Sadakane's CSA, and say that a delta-encoding of the differences in Psi actually get nH_k+O(n loglog s) bits of space, instead of nH_0+O(n loglog s). Note the missing 1/e in the last formula. It appears when one adds higher-order Psi arrays to implement the inverse suffix array. Again, we do not use the inverse suffix array here (as the higher-order Psis are not compressible to H_k as far as we know), but the sampling method.

Finally, it is a mistake to say that German is agglutinating (thanks to Zsuzsanna Lipták for noting this). We referred to languages where many particles are concatenated to form long words that would be phrases in other languages, so cutting the words to find query terms becomes nontrivial. Besides, the right term is "agglutinative".