Compressing Dynamic Text Collections via Phrase-Based Coding

Nieves Brisaboa, Antonio Fariña, Gonzalo Navarro and José Paramá

We present a new statistical compression method, which we call Phrase Based Dense Code (PBDC), aimed at compressing large digital libraries. PBDC compresses the text collection to 30-32% of its original size, permits maintaining the text compressed all the time, and offers efficient on-line information retrieval services. The novelty of PBDC is that it supports continuous growing of the compressed text collection, by automatically adapting the vocabulary both to new words and to changes in the word frequency distribution, without degrading the compression ratio. Text compressed with PBDC can be searched directly without decompression, using fast Boyer-Moore algorithms. It is also possible to decompress arbitrary portions of the collection. Alternative compression methods oriented to information retrieval focus on static collections and thus are less well suited to digital libraries.