Practical Compressed String Dictionaries
Miguel A. Martínez-Prieto, Nieves Brisaboa, Rodrigo Cánovas,
Francisco Claude, and Gonzalo Navarro
The need to store and query a set of strings -- a string dictionary --
arises in many
kinds of applications. While classically these string dictionaries have
accounted for a small
share of the total space budget (e.g., in Natural Language Processing or when
collections), recent applications in Web engines, Semantic Web (RDF) graphs,
many others, handle very large string dictionaries, whose size is a
significant fraction of the
whole data. In these cases, string dictionary management is a scalability
issue by itself.
This paper focuses on the problem of managing large static string dictionaries
main memory space.
We revisit classical solutions for string dictionaries like hashing, tries,
and front-coding, and improve them by using compression techniques.
We also introduce some novel string dictionary representations built on top
recent advances in succinct data structures and full-text indexes.
All these structures are empirically compared on a heterogeneous testbed
real-world string dictionaries. We show that the compressed representations
may use as little as 5\% of the original dictionary size, while supporting
operations within a few microseconds. These numbers outperform the
tradeoffs in many cases. Furthermore, we enhance some
representations to provide prefix- and substring-based searches, which also
competitively. The results show that compressed string dictionaries are a
block for various data-intensive applications in different domains.