###
Compressing Huffman Models on Large Alphabets

####
Gonzalo Navarro and Alberto Ordóñez

A naive storage of a Huffman model on a text of length *n* over an alphabet
of size *s* requires *O(s log n)* bits. This can be reduced to
*s log s + O(s)* bits using canonical codes. This overhead
over the entropy can be significant when *s* is comparable to
*n*,
and it also dictates the amount of main memory required to compress or
decompress.
We design an encoding scheme that requires *O(s log log n)* bits in the
worst case, and typically less, while supporting encoding and decoding of
symbols in *O(log log n)* time. We show that our technique reduces the
storage size of the model using current techniques to around 15% in
various real-life sequences over
large alphabets, while still offering reasonable compression/decompression
times.