Distribution-aware compressed full-text indexes

Jouni Sirén, University of Helsinki, Finland
10 Octubre, 2012 - 14:30
Sala 33 – DII, República 701

with Paolo Ferragina and Rossano Venturini University of Pisa, Italy


Compressed suffix arrays allow searching for arbitrary substrings of a text efficiently, while storing the text in a similar space as when compressed with common compressors such as gzip or bzip2. They are usually based on the Burrows-Wheeler transform (BWT), a permutation of the text closely related to the suffix array. The standard solution to locat the occurrences of a pattern is to sample a number of text positions explicitly, and to derive the rest of the positions by using the BWT. This takes time proportional to the distance between the located position and the nearest sampled position.

The sampled positions are usually selected at regular intervals, optimizing the worst-case performance. Yet if the query distribution is skewed, we should be able to get better expected case performance by adapting the samples to the distribution. In a recent paper, we accomplished this for a known query distribution. We described an efficient algorithm for selecting the optimal samples by reducing the problem to finding a minimum-weight k-link path in a directed acyclic graph. In our experiments with a real query distribution, using the optimal samples was up to 30 times faster than using the same number of samples at regular intervals. While there are some heuristics that improve the performance with an unknown query distribution, no satisfactory solution for dynamically adapting the samples exists so far.

Jouni Siren - - http://iki.fi/jouni.siren/


Secretaría de Investigación