Indexes for Highly Repetitive Document Collections

Francisco Claude, Antonio Fariña, Miguel Martínez-Prieto, and Gonzalo Navarro

Indexing highly repetitive collections is becoming a relevant problem due to the emergence of large repositories of versioned documents, among other applications. These collections may reach huge sizes but be formed mostly of documents that are near-copies of each other. Traditional techniques for indexing them fail at exploiting properly their regularities to reduce space.

We introduce new techniques for compressing inverted indexes that are able to exploit this near-copy regularity. They are based on run-length, Lempel-Ziv, or grammar-based compression of the differential inverted lists, instead of gap-encoding them as is the usual practice. We show that, in this highly repetitive setting, our compression methods significantly reduce the space achieved by classical compression, at the price of moderate slowdowns. Moreover, many of our methods are universal, that is, they do not need to know the versioning structure of the collection.

We also introduce compressed self-indexes in the comparison. These are designed for general strings (not only natural language collections) and represent the text collection and an indexing structure (not an inverted index) in integrated form. We show that techniques can compress much further, using a small fraction of the space required by our new inverted indexes, yet they are orders of magnitude slower.