Improving Semistatic Compression via Pair-Based Coding

Nieves Brisaboa, Antonio Fariña, Gonzalo Navarro and José Paramá

In the last years, new semistatic word-based byte-oriented compressors, such as Plain and Tagged Huffman and the Dense Codes, have been used to improve the efficiency of text retrieval systems, while reducing the compressed collections to 30-35% of their original size.

In this paper, we present a new semistatic compressor, called Pair-Based End-Tagged Dense Code (PETDC). PETDC compresses English texts to 27-28%, overcoming the optimal 0-order prefix-free semistatic compressor (Plain Huffman) in more than 3 percentage points. Moreover, PETDC permits also random decompression, and direct searches using fast Boyer-Moore algorithms.

PETDC builds a vocabulary with both words and pairs of words. Since each symbol in the vocabulary is given a codeword, compression is improved by replacing two words of the source text by a unique codeword.