Fabrice Bellard, the creator of FFmpeg and QEMU, introduces a lossless data compressor which uses neural networks

Last month, Fabrice Bellard and his team published a paper named Lossless Data Compression with Neural Networks. This paper talks about lossless data compressors that use pure NN models based on Long Short-Term Memory (LSTM) and Transformer models.

How does this model work?

This lossless data compressor uses the traditional predictive approach:

At each time, the encoder computes the probability vector of the next symbol value with the help of the neural network model knowing all the preceding symbols.

The actual symbol value is then encoded using an arithmetic encoder and the model is updated of knowing the symbol value.

As the decoder works symmetrically, meaning that both the encoder and decoder update their model identically, there is no need to transmit the model parameters.

To improve the compression ratio and speed, a preprocessing stage was added. For small LSTM models, the team reused the text preprocessor CMIX and lstm-compress to have a meaningful comparison. The larger models used subword based preprocessor where each symbol represents a sequence of bytes.

The model uses something called arithmetic coding, a standard compression technique. This model tries to make arithmetic coding adaptive. Here’s an example by a Reddit user that explains how exactly this model works:

“The rough frequency of `e' in English is about 50%. But if you just saw this partial sentence "I am going to th", the probability/frequency of `e' skyrockets to, say, 98%. In standard arithmetic coding scheme, you would still parametrize you encoder with 50% to encode the next "e" despite it's very likely (~98%) that "e" is the next character (you are using more bits than you need in this case), while with the help of a neural network, the frequency becomes adaptive.”

To ensure that both the decoder and encoder are using the exact same model, the authors have developed a custom C library called LibNC. This library is responsible for implementing the various operations needed by the models. It has no dependency on any other libraries and has a C API.

Results of the experiment

Performance of the model was evaluated against enwik8 Hutter Prize benchmark. The models show slower decompression speed, 1.5x slower for the LSTM model and 3x slower for the Transformer model. But, its description is simple and the memory consumption is reasonable as compared to other compressors giving similar compression ratio.

Speaking of the compression ratio, the models are yet to reach the performance of CMIX, a lossless data compressor that gives optimized compression ratio at the cost of high CPU/memory usage. In all the experiments, the Transformer model gives worse performance than the LSTM model although it gives the best performance in language modeling benchmarks.

To know more in detail, check out the paper, Lossless Data Compression with Neural Networks.

Microsoft open-sources Project Zipline, its data compression algorithm and hardware for the cloud

Making the Most of Your Hadoop Data Lake, Part 1: Data Compression

Interpretation of Functional APIs in Deep Neural Networks by Rowel Atienza