Efficiently compressing a byte array
Compressing a byte array is a matter of recognizing repeating patterns within a byte sequence and devising a method that can represent the same underlying information to take advantage of these discovered repetitions.
To get a rough idea of how this works, imagine having a sequence as:
["a" "a" "a" "b" "b" "b" "b" "b" "b" "b" "c" "c"]
It is intuitively more efficient to represent this as:
[3 times "a", 7 times "b", 2 times "c"]
Now, we are going to use a methodology based on a well-known algorithm, that is, the LZ77 compression method. LZ77 is, despite being quite old, the basis of most of all the well-known and currently used compression methods, especially the Deflate algorithm.
Note
Deflate is at the heart of the ZIP family of compression algorithms. It uses a slightly modified version of LZ77 plus a special encoding, that is, the Huffman encoding.
The point of LZ77 is to walk a sequence and recognize a pattern in the past elements that will occur in the upcoming elements, replacing those with a couple of values: how many elements should go backwards in order to locate the recurring pattern, which is called "distance"; and how long is the recurring pattern, which is labeled as "length".
The iteration of the LZ77 compression would look as follows:
- At any point of time, the algorithm is processing a particular element, which is located at the current position. Consider a window of the size n, as a set of n elements preceding the one that is occupying current position, and consider lookahead as the rest of the elements up until the input's end.
- Begin with the first element of the input.
- Move on to the next element.
- Find in the window (that is, past n elements), the longest pattern that can be found in lookahead.
- If such a sequence is found, consider distance as the location where, the matching sequence was found, expressed in regards to the current position, consider length as the length of the matching pattern, and proceed with the two following actions:
- Replace the match in lookahead by "distance" and "length".
- Move forward using the "length" elements and resume algorithm execution at step 4.
- Otherwise, resume at step 3.
The procedure to uncompress is as follows:
- Walk the compressed sequence.
- If the "distance" and "length" are found, go back to the "distance" elements and replace this couple with the "length" elements.
- If not, lay out the element that you've found.
Let's see this in action in Clojure!
How to do it...
- First of all, here is the
ns
declaration containing the Clojure facilities that we are going to use:(ns recipe1.core (:require [clojure.set :as cset])) ;; => we'll need set operations later on.
- Let's begin by working on the uncompressing part. First of all, we need an
expand
function that takes the source array as a vector of the elementsdistance
andlength
and generates a repetition of a sub-vector of the lastdistance
characters from the source array until the length is reached:(defn expand [the-vector distance length] (let [end (count the-vector) start (- end distance) ;;=> Here we go backwards 'distance' elements. pattern (subvec the-vector start end)] ;=> We have our pattern. (into [] (take length ;=> We exactly take "length" from (cycle pattern))))) ;; an infinite repetition of our pattern.
- Now, let's define
un-LZ77
usingexpand
function while walking through a sequence of bytes:(defn un-LZ77 [bytes] (loop [result [] remaining bytes] ;;=> We recur over the contents of the array. (if (seq remaining) (let [current (first remaining) the-rest (rest remaining)] ;=> Current element under scrutiny; (if-not (vector? Current) ;=> If it is not a vector, add to result (recur (conj result ;; the very element, and move on. current) the-rest) (recur (into result (expand result ;;=> This is a vector, then we'll expand here and move on (current 0) (current 1))) the-rest))) result))) ;;=> end of recursion, return result.
- Now let's address the topic of compressing. First of all, we need to grab all sub-vectors, as we'll have to find matches between window and lookahead and then pick the longest one among them:
(defn all-subvecs-from-beginning ;;=> this function will generate a set of all sub-vectors starting ;; from begin [v] (set (map #(subvec v 0 %) ;;=> we apply subvec from 0 to all indices from 1 up to the size ;; of the array + 1. (range 1 (inc (count v)))))) (defn all-subvecs ;;=> this function will generate all [v] ; sub-vectors, applying (loop [result #{} ;; all-subvecs from beginning to ;; all possible beginnings. remaining v] (if (seq remaining) (recur (into result (all-subvecs-from-beginning remaining)) (into[] (rest remaining))) ;;=> We recur fetching all sub-vectors for next beginning. result))) ;;=> end of recursion, I return result.
- Now we define a function to grab the longest match in
left array
with the beginning ofright array
:(defn longest-match-w-beginning [left-array right-array] (let [all-left-chunks (all-subvecs left-array) all-right-chunks-from-beginning ;;=> I take all sub-vectors from left-array (all-subvecs-from-beginning right-array) ;;=> I take all sub-vectors from right-array all-matches (cset/intersection all-right-chunks-from-beginning all-left-chunks)] ;;=> I get all the matchings using intersection on sets (->> all-matches (sort-by count >) first))) ;=> Then I return the longest match, sorting them ;; by decreasing order and taking the first element.
- With the longest match function in hand, we need a function to tell us where where is this match exactly located inside the window:
(defn pos-of-subvec [sv v] {:pre [(<= (count sv) (count v))]} ;;=> I verify that sv elements are less than v's. (loop [cursor 0] (if (or (empty? v) ;;=> If on of the vectors is empty (empty? sv) (= cursor (count v))) ;; or the cursor ended-up exiting v, nil ;; we return nil (if (= (subvec v cursor ;; => If we found that the v sub-vector (+ (count sv) ;; beginning with cursor up to sv count cursor)) sv) ;; is equal to sv cursor ;; we return cursor, this is where the match is. (recur (inc cursor)))))) ;=> We recur incrementing the cursor
- Armed with the toolbox we've built so far, let's devise an LZ77 step:
(defn LZ77-STEP [window look-ahead] (let [longest (longest-match-w-beginning window look-ahead)] ;;=> We find the Longest match, (if-let [pos-subv-w (pos-of-subvec longest window)] ;;=> If there is a match we find its position in window. (let [distance (- (count window) pos-subv-w) ;;=> the distance, pos-subv-l (pos-of-subvec longest look-ahead) ;;=> the position of the match in look-ahead the-char (first (subvec look-ahead (+ pos-subv-l (count longest))))] ;;=> the first element occuring after the match {:distance distance :length (count longest) :char the-char}) ;;=> and we return information about match {:distance 0 :length 0 :char (first look-ahead)}))) ;;=> We did not find a match, we emit zeros for "distance" ;; and "length", and first element of lookahead as first char ;; occurring after the (non-)match.
- Finally, we will write the main LZ77 compression function as follows:
(defn LZ77 [bytes-array window-size] (->> (loop [result [] cursor 0 window [] look-ahead bytes-array] ;;=> we begin with position 0; and everything as look-ahead. (if (empty? look-ahead) result ;;=> end of recursion, I emit result. (let [this-step-output (LZ77-STEP window look-ahead) distance (:distance this-step-output) length (:length this-step-output) literal (:char this-step-output) ;;=> We grab informations about this step output raw-new-cursor (+ cursor length 1) new-cursor (min raw-new-cursor (count bytes-array)) ;;=> We compute the new-cursor, that is, where to go in the next ;; step ;;which is capped by count of bytes-array new-window (subvec bytes-array (max 0 (inc (- new-cursor window-size))) new-cursor) ;;=> new window is window-size elements back from new cursor. new-look-ahead (subvec bytes-array new-cursor )] ;;=> new look-ahead is everything from new cursor on. (recur (conj result [distance length] literal) new-cursor new-window new-look-ahead)))) ;; and we recur with the new elements. (filter (partial not= [0 0])) ;;=> We eliminate the entries related to non-matches (filter (comp not nil?)) ;;=> and any nils (into []))) ;;=> and make a vector out of the output.
That's it! Now, let's see our code in action. Input into your REPL as follows:
(LZ77 ["a" "b" "c" "f" "a" "b" "c" "d"] 5) ;; => ["a" "b" "c" "f" [4 3] "d"] (un-LZ77 ["a" "b" "c" "f" [4 3] "d"]) ;; => ["a" "b" "c" "f" "a" "b" "c" "d"]