Document Counting in Compressed Space

Travis Gagie, Aleksi Hartikainen, Juha Kärkkäinen, Gonzalo Navarro, Simon Puglisi, and Jouni Sirén

We address the problem of counting the number of strings in a collection where a given pattern appears, which has applications in information retrieval and data mining. Existing solutions are in a theoretical stage. In this paper we implement these solutions and explore compressed variants, aiming to reduce data structure size. Our main result is to uncover some unexpected compressibility properties of the fastest known data structure for the problem. By taking advantage of these properties, we can reduce the size of the structure by a factor of 5-400, depending on the dataset.