Mathematics of Information Teaching Group Publications Home

Data Compression

In the last section, we defined information mathematically based on a set of desirable properties. From an engineering standpoint, such a definition is useful if it allows us to do some relevant task better. In this section, we study data compression and show how it is related to entropy.

Data compression is possible because real signals do not have a uniform distribution: they have a lot of redundancy/predictability. This means that data compression can reduce the memory required to store or transmit the information with no or little loss. The departure from complete randomness comes in two forms:

We will answer two important question about data compression:

  1. What is the best we can do?
  2. How can we do it?

The answer to the first question is one of the most elegant and useful results in information theory. As we will see, regardless of the computational resources available to us and the type of compression algorithm we use, for a given source of information, there is a strict mathematical limit for the compression ratio that we can achieve.

As the answer to the second question above, we’ll learn about Huffman codes (well-suited for nonuniform probabilities) and Lempel-Ziv compression (great for exploiting dependence and predictable patterns). Both are in widespread everyday use (e.g., .jpeg, .mp3, .zip formats rely on these).

Fixed- and variable-length codes

First, a bit of terminology:

In a fixed-length code, the same number of bits is used to represent each symbol. A variable-length code uses different numbers of bits for different symbols.

Fixed-length code Variable-length code
Genome symbol Codeword
A 000
C 001
G 010
T 011
N 100
Genome symbol Codeword
A 0
C 10
G 110
T 1111
N 1110

Fixed-Length Codes

Fixed-length codes have a simple decoding process since the boundaries of each codeword are immediately clear. But fixed-length codes can be very wasteful. You use the same number of bits to represent symbols that occur frequently and infrequently.

Despite being potentially wasteful, due to their simplicity, fixed-length codes are common:


Variable-length Code

Variable length coding is an essential step in efficiently encoding text, audio and images. Morse Code is the longest-lasting electrical signaling code in existence. MP3 uses a variable-length coding method called the Huffman code which we are going to discuss later.

The fundamental idea is to use short codewords for common (high probability) symbols and use longer codewords to represent less-probable symbols. Advantages: less storage space, faster transmission. Disadvantage: more difficult to decode (computation/circuit cost).

ASCII and Mores code are given for some characters in the table below. The alphabet of Mores code are ‘.’ (dot, short beep) and ‘-‘ (dash, long beep).

Symbol ASCII Binary ASCII Decimal Morse Code
Symbol ASCII Binary ASCII Decimal Morse Code


Rank Word   Rank Word
1 the   901 shot
2 of   902 poet
3 and   903 sever
4 a   904 moving
5 to   905 mass
6 in   906 plane
7 is   907 proper
8 be   908 propose
9 that   909 drink
10 was   910 obviously

Other examples of variable-length codes:

Prefix codes – dealing with variable length

One difficulty with variable-length codes is determining when one codeword ends and another begins. There are two simple ways to solve this problem:

  1. Using a separator symbol: space in English, comma in csv. This is generally wasteful and especially so in binary since we have to use either 0 or 1 as separator.
    Design a binary variable-length code for representing capital English letters using method 1 above. If all letters are equally likely, what is the expected number of bits needed to encode one letter?

  2. Prefix(-free) codes: A prefix code assigns a unique bit string to each possible symbol in the message such that no codeword is the prefix of another codeword. Here is a code for digits 0-9 to binary strings:
    $$0 \to 0000, 1 \to 0001, 2 \to 001, 3 \to 0100, 4 \to 0101, 5 \to 0110, 6 \to 0111, 7 \to 100, 8 \to 101, 9 \to 11$$
    Decode 0010011100000100 (from left to right).


Prove that UTF-8 is a prefix code.

Entropy and expected code length for iid sequences

What is the smallest number of bits/symbols for storing data produced by a source?

Let’s make this question more precise. First, we assume the data is produced by a source according to a distribution. As an example, consider a DNA sequence with symbols {A,C,G,T,N} where

\begin{equation}\label{eq:probs} P(A)=P(T) = 1/4, \quad P(C)=P(G) = P(N) = 1/6 \end{equation}

and also consider the following code

Genome symbol Codeword
A 00
T 01
C 100
G 101
N 111

The length of the binary sequence used for different symbols is different. But we can find the expected code length. For $x\in${A,C,G,T,N}, let $e(x)$ be the codeword representing $x$ and let $l(x)$ be the length of $e(x)$. Then the expected code length is

\[L = \sum_x l(x)P(X=x).\]
For the code in the table above and the probabilities given in \eqref{eq:probs}, find the expected code length.


For the codes given in these table and probabilities given in \eqref{eq:probs}, find the expected code lengths.


If we encode an iid sequence of length $N$ using a code with expected length $L$, then the expected length of the binary representation of this sequence is $NL$. The true value will also be close to $NL$ based on the frequentist interpretation of probability. So approximately, we use $L$ bits per symbol. So we are after codes for which the expected code length is as small as possible.

Consider a loaded die with the following probabilities: $$P(1) = P(2) = 1/4,\quad P(3)=P(4)=P(5) = 3/20,\quad P(6)=1/20.$$
  1. Consider the code: $$1\to 00,\quad 2\to 01,\quad 3\to 111, \quad 4\to 100, \quad 5\to 101, \quad 6\to 110$$ What is the expected length of this code?
  2. What is the expected length of encoding an iid sequence of 10 die rolls?
  3. Find the length of the encoding of the following sequence of die rolls: 5144225213. Compare with previous part.
  4. If we use a fixed-length code, how many bits would we need to represent a sequence of 10 die rolls?


For a source with a given distribution, what is the smallest possible code length?

A fundamental lower bound

Let us start with a lower bound that is always satisfied, regardless of how good our code is:

The fundamental lower bound on expected length: The expected code length is not less than the entropy, i.e., $$L \geq H.$$

We will provide an informal proof of this statement:

This result is both deep and useful. It holds regardless of what method we use to find the code or what resources we use. The lower bound is mathematical and cannot be violated. In practice, if we get close to it, we know that we are achieving the theoretical limit, making a substantially better code impossible (so looking for it would be pointless).

Find the entropy of the source of information given in \eqref{eq:probs} and compare it with the expected code length for this code, as well as these codes.


Find the entropy of the die roll given in the example above. How does the entropy compare with the expected code length.


A close upper bound

We now know that the expected code length cannot be smaller than the entropy. But how close to the entropy can we get? What is the expected length of the best code?

As an example, let us consider storing DNA data, where the symbols are {A, C, G, T}. The length of the fixed-length code is 2 bits/symbol. This is actually equal to the entropy when all symbols are equally likely, since the entropy is $\log_2 4 = 2$ bits per symbol. So for equally likely outcomes, our code matches the entropy of the signal and this is the best possible.

But what if some symbols are more probable. Then the entropy will be less than 2 bits/symbol and our fixed-length code is no longer necessarily optimal. For example, suppose

\[P(A)=1/2,\quad P(C)=1/4,\quad P(G)=1/8,\quad P(T)=1/8.\]

Then, the entropy is $H=1.75$ bits,

\begin{align}\label{eq:entGen} H &= P(A) I(A) + P(C) I(C) + P(G) I(G) + P(T) I(T) \\ & =\frac12\times 1 + \frac14\times 2 + \frac18\times 3 + \frac18\times 3 = 1.75\ bits\nonumber \end{align}

which is less than 2. Can we encode this source so that we use 1.75 bits/symbol? Yes, we can use a variable-length code:

Symbol (x) A C G T
Probability $P(x)$ 1/2 1/4 1/8 1/8
Self-information $I(x)$ $\log_2 2=1$ $\log_2 4=2$ $\log_2 8=3$ $\log_2 8=3$
Codeword $e(x)$ 0 10 110 111
Codeword length $l(x)$ 1 2 3 3

What is the expected code length?

\begin{align}\label{eq:entGen2} L &= P(A) l(A) + P(C) l(C) + P(G) l(G) + P(T) l(T) \\ & =\frac12\times 1 + \frac14\times 2 + \frac18\times 3 + \frac18\times 3 = 1.75\ bits\nonumber \end{align}

The value is exactly equal to entropy. Not only that, equation \eqref{eq:entGen} for computing the entropy and equation \eqref{eq:entGen2} for computing the expected code length look almost identical. This is because we have managed to choose a codeword for each symbol with length equal to the self-information of that symbol. In other words, $I(x) = l(x)$ for all outcomes.

But of course, this is not possible if $I(x)$ is not always an integer. As an example, let us revisit our die roll example above:

Symbol 1 2 3 4 5 6
Probability, $P(x)$ 1/4 1/4 3/20 3/20 3/20 1/20
Self-information, $I(x)$ 2 2 2.74 2.74 2.74 4.32
Codeword, $e(x)$            
Codeword length $l(x)$            

What is the best way to fill in the rest of this table? We can’t have codewords with length $I(x)$ for all symbols $x$ but we can always find codewords whose length is $\lceil I(x)\rceil$, where $\lceil y\rceil$ is the smallest integer that is equal to or larger than $y$, e.g., $\lceil 2.7\rceil=3, \lceil 2\rceil=2$. So we can fill in the table as

Symbol 1 2 3 4 5 6
Probability, $P(x)$ 1/4 1/4 3/20 3/20 3/20 1/20
Self-information, $I(x)$ 2 2 2.74 2.74 2.74 4.32
$\lceil I(x)\rceil$ 2 2 3 3 3 5
Codeword, $e(x)$ 00 01 100 101 110 11100
Codeword length $l(x)$ 2 2 3 3 3 5


Find the expected length of the code for the code in the table above.


Note that $\lceil y\rceil < y+1$. Hence,

\[L = \sum_x P(x) l(x) = \sum_x P(x) \lceil I(x)\rceil < \sum_x P(x) (I(x)+1) = \sum_x P(x) I(x) + \sum_x P(x) = H + 1.\]
Shannon Coding: We can always find a code such that the length of each code word satisfies $l(x) = \lceil I(x)\rceil$. Such a code is called a Shannon code. The expected length of the Shannon code satisfies $$L<H+1. $$

Note that we didn’t provide a way to find a Shannon code since we will discuss a better method called the Huffman code later.

Putting the upper and lower bounds together, we arrive at Shannon’s source coding theorem:

Mathematics of Information Tenet II (Shanon's source coding theorem): The expected code length satisfies $$H\le L < H+1.$$

Optimal codes

While Shannon codes can get close to entropy, they are not necessarily optimal. For example, in the table above, we could simply choose 111 to represent 5. Even after such obvious improvements are applied, this coding method still may not be optimal.

Binary search: toward optimal codes

Suppose you are participating in a 20-question game with the goal of identifying one of the following foods, which your friend has chosen at random:

What is your strategy?

The optimal strategy is binary search:

For the food items in the image above, this strategy can be implemented by dividing the possible items into the left half and the right half. Then after identifying the half that contains the chosen food, we can divide it again into the left half and the right half. At this point we will have arrived at a column that contains the selected food. We cannot divide three items into equal halves so we divide them into {top item, middle item} and {bottom item}. If it turns out to be in {top item, bottom item}, we will ask one more question to determine which one it is.

For example, if the item is 🍕 then your questions and your friend’s answers will be:

Write the questions/answers for an item other than pizza.


An informal proof of the optimality of this method is as follows: Binary search works well because you get the maximum amount of information for every question. Since the two sets are as close to half the items as possible, the entropy of each answer is either 1 or as close to 1 as possible. So on average, you will ask the smallest number of questions.

What is the expected number of questions, assuming your friend chooses the food at random with all foods having equal probability?


Consider an alternative strategy in which you ask about each item in order. That is, you first ask "Is it a burger?". If the answer is no, you ask "Is it pancakes?". If the answer is no, you ask "Is it cheese?" and so on. What is the expected number of questions in this case? Is it smaller or larger than binary search?


20 questions and codes

We can represent our strategy for 20 question graphically, as shown below. Specifically, our first question is “Is the food in the left half?” If the answer is “yes”, we follow the branch labeled with 0, which gives us the new set of options. If “no,” we follow the branch labeled with 1. And continue in the same manner. The structure shown below is called a tree. A tree consists of nodes and edges. The edges connect the nodes. Each node in the tree above indicates a set of possible choices. The two edges going out of each node, divide the possible choices into two set.

We can then turn our 20 question game into a prefix code for the food items. The bits represent answers to the questions, “0” for “yes” and “1” for “no”. Since the questions are already predetermined, we don’t actually have to ask them and the bit sequence uniquely identifies the selected food. For example, pizza is 0100.

The expected number of questions is equal to the average length of the code and very close the entropy:

\[L = \frac4{12}\times 3 + \frac8{12}\times 4 = 3.66\ bits/symbol\] \[H = \log 12 = 3.58\ bits/symbol\]

The above discussion shows a bijection between trees and prefix codes.

Suppose your friend rolls a die and you want to find out what it is using the minimum number of questions. Draw a tree for this problem and find the corresponding codeword for each value in the set {1,2,3,4,5,6}.


Unequal probabilities

Now let us consider a problem in which the probabilities are unequal: Suppose your friend has chosen a random point in the rectangle below and you want to find out its color with a minimum number of questions. The probability of each color is proportional to the area and is given in each rectangle.


What is the entropy of the color of the random point? Design a strategy for finding the color of the point and draw the corresponding tree. Two examples are given below. Your solution must be different from these but doesn't have to be optimal. What is the codeword corresponding to each color and what is the expected code length?


The entropy of the color is $0.4\log (10/4)+.2\log(10/2)+.3\log(1/0.15) = 2.15$. One possibility is the following tree. The numbers in the parenthesis show the probability of the colors. R stands for Red, B for Blue, and so on.

               RGBPY (1)
            _______|_______________
           |                       |
          RG (.5)                BPY (0.5)
         __|____                ___|____
        |       |              |        |
       R (.4)   G (0.1)       B (0.2)   PY (0.3)
                                      __|_________
                                     |            |
                                    P (0.15)     Y (0.15)

If we assign 0 to the left branch and 1 to the right branch, then the codeword for P is 110. The expected code length is 2.3 bits/symbol.

In the tree above we have seemingly divided the probability in half in each step to the extent possible. It turns out we still haven’t obtained the best possible code. This is because it is possible to divide the probability in a way that we’re closer to half. This optimal answer is given below, along with the probabilities and the codewords:

                RGBPY (1)
             ______|___________
            |                  |
           R (.4)            GBPY (.6)
             0            _____|__________
                         |                |
                      GY(.25)           BP (.35)
                   ___|___              __|______
                  |       |            |         |
                G (.1)   Y (0.15)    B (.2)      P(.15)
                 100      101         110         111
Find the expected code length for the optimal code given above.


Huffman codes

To find the optimal code, we must divide the probability into halves in each step. As we saw in the color example above, seemingly optimal answers may not be actually optimal. So is there a way to find the optimal code? Yes, the answer is Huffman codes, which are extensively used in various file formats for storage and communications. For a long time, many people searched for this optimal algorithm, including Shannon and Fano, who was Huffman’s professor (more about the story later). The problem was that everyone was trying to divide the probability mass into halves using a top-down approach. That is, they thought the first step should be to divide the whole set of outcomes into two sets with nearly equal probabilities. Huffman realized that solution is a bottom-up approach, first considering outcomes with smallest probabilities.

The following observations are critical to realizing a bottom-up approach is optimal.

In an optimal code:

A Huffman code is a prefix code with a binary tree structure, generated using the following algorithm:

  1. Initialize: each symbol gets its own node.
  2. Iterate until a tree is created:
    • Choose pair of nodes with lowest probabilities.
    • Create a new node and assign to it the sum of the two nodes’ probabilities.
    • Connect the original nodes to the new node with two edges.
  3. Assign left branches 0 or 1 and right branches the other.

Let’s see an example. Suppose the following DNA symbols have the given probabilities:

Genome symbol Probability
A 0.3
C 0.24
G 0.2
T 0.16
N 0.1

We start by setting up nodes for each outcomes, also writing their probabilities in parentheses.

	A (0.3)        C (0.24)         G (0.2)         T (0.16)         N (0.1)


We then identify the two nodes with the smallest probabilities, which are N and T, and combine them:

	A (0.3)        C (0.24)         G (0.2)         T (0.16)         N (0.1)
	                                                |________________|
	                                                NT (0.26)


We now have 4 nodes with probabilities 0.3, 0.24, 0.2, 0.26. The smallest pair are C, G, which we combine:

	A (0.3)        C (0.24)         G (0.2)         T (0.16)         N (0.1)
	               |________________|               |________________|
	               CG (0.44)                        NT (0.26)


We now have 3 nodes with probabilities 0.3, 0.44, 0.26. The smallest pair are those with probabilities 0.26 and 0.3. We rearrange the nodes to be able to combine them. If you are doing this with pen and paper, you don’t need to rearrange them.

	C (0.24)         G (0.2)         A (0.3)        T (0.16)         N (0.1)
	|________________|               |              |________________|
	CG (0.44)                        |______________NT (0.26)
	                                ANT (0.56)


Finally, we combine the last two remaining nodes to obtain a tree:

	C (0.24)         G (0.2)         A (0.3)        T (0.16)         N (0.1)
	|________________|               |              |________________|
	CG (0.44)                        |______________NT (0.26)
	|                                ANT (0.56)
	|________________________________|
	ACGTN (1)

We assign 0 to the left branches and 1 to the right branches,

	 C (0.24)         G (0.2)         A (0.3)        T (0.16)         N (0.1)
	 |                |               |              |                |
	0|                |1             0|             0|                |1
	 |________________|               |              |________________|
	 CG (0.44)                        |___________1__NT (0.26)
	 |                                ANT (0.56)
	0|                                |
	 |________________________________|1
	 ACGTN (1)

leading to the following code:

Genome symbol Probability Codeword
A 0.3 10
C 0.24 00
G 0.2 01
T 0.16 110
N 0.1 111


Find the expected code length and entropy.


Let’s see another example. Consider loaded die with the following probabilities:

\[P(1)=1/4, \quad P(2)=1/4, \quad P(3)=1/4,\quad P(4)=1/8, \quad P(5)=1/16, \quad P(6)=1/16\]

We start by setting up nodes for each outcome. We also write their probabilities. Then we proceed as in the Huffman algorithm.

1(1/4)     2(1/4)  3(1/4)     4(1/8)   5(1/16)    6(1/16)   Initialization: nodes (p)
 |         |       |          |           |_________|                                              
 |         |       |          |__________56 (1/8)           Combine 5 and 6      =>   56                          
 |         |       |_________456 (1/4)                      Combine 4 and 56     =>   456              
 |_________|     3456 (1/2)                                 Combine 3 and 456    =>   3456   
12(1/2)            |                                        Combine 1 and 2      =>   12
 |_____123456 (1)__|                                        Combine 12 and 3456  =>   12456

In this case, the entropy and the expected length of the Huffman code are both equal to 2.375 bits/symbol as the codeword lengths turn out to be equal to the self-informations.

Symbol $x$ Codeword $e(x)$ $ P(x)$ $I(x) $ $l(x) $
1 00 1/4 2 2
2 01 1/4 2 2
3 10 1/4 2 2
4 110 1/8 3 3
5 1110 1/16 4 4
6 1111 1/16 4 4

Note: The average code length of Huffman codes won’t exactly match H, unless the probabilities are all powers of two. But, it will still be a lot closer for nonuniform data than fixed-length codes.

The story of Huffman codes:

❝In 1951 David A. Huffman and his classmates in an electrical engineering graduate course on information theory were given the choice of a term paper or a final exam. For the term paper, Huffman’s professor, Robert M. Fano, had assigned what at first appeared to be a simple problem. Students were asked to find the most efficient method of representing numbers, letters or other symbols using a binary code. […] Huffman worked on the problem for months, developing a number of approaches, but none that he could prove to be the most efficient. Finally, he despaired of ever reaching a solution and decided to start studying for the final. Just as he was throwing his notes in the garbage, the solution came to him. “It was the most singular moment of my life,” Huffman says. “There was the absolute lightning of sudden realization.”[…] Huffman says he might never have tried his hand at the problem — much less solved it at the age of 25 — if he had known that Fano, his professor, and Claude E. Shannon, the creator of information theory, had struggled with it. “It was my luck to be there at the right time and also not have my professor discourage me by telling me that other good people had struggled with this problem,” he says. ❞

Scientific American, Sept. 1991

Huffman codes for binary data

For the binary alphabet, if each bit is equally likely to be 0 or 1, then the entropy is 1 and the expected code length of any code will also be 1, so nothing can be done to reduce the size of the data.

Now, suppose we have an information sources that produces an iid binary sequence where $P(0)=0.1$ and $P(1)=0.9$. What is the entropy?

\[H = 0.1\log 10 + 0.9\log \frac1{0.9} = 0.469\ bits\ per\ symbol\]

What is the Huffman code? Given that there are already only two symbols, the Huffman code is the same as the original symbols: $0\to 0, 1\to 1$. The expected length of the Huffman code is $L=1$. So the condition

\[H\le L < H+1\]

is satisfied. Yet the Huffman code does not provide any advantage, even though the entropy is well below 1.

Block encoding to improve code length: The solution is to combine bits to create more than two symbols. This is called block encoding. Let’s group every pair of symbols and using the iid property find their probability:

\[P(00) = 0.01,\quad P(01)= P(10) = 0.09,\quad P(11) = 0.81.\]

The Huffman code is given by

  00 (0.01)       01 (0.09)      10 (0.09)     11(0.81)
    |_________________|             |             |   
                  00,01(0.1)        |             |  
                      |_____________|             |  
                   00,01,10 (0.19)                | 
                         |________________________|
                       00,01,10,11 (1)

So the expected code length is

\[0.01\cdot 3 + 0.09\cdot 3 + 0.09\cdot 2 + 0.81\cdot 1 = 1.29\ bits\]

But this is the expected length for representing two symbols. The expected number of bits used to encode each input bit is 1.29/2 = 0.645, which is less than 1 and only 0.176 bits away from entropy.

Find the Huffman code and its expected length per binary symbol if we group each three input binary bits into a block.


If we group $n$ symbols, then the Huffman code will be at most $1/n$ bits away from entropy. Why?

Compression ratio

Find the code produced by the Huffman coding above and use it to encode the binary sequence 11110111110111111111. Find and compare the lengths of the original and encoded binary sequences.


When the input sequence and the encoded sequence are both binary, we define the compression ratio (CR) as

\[\text{CR}=\frac{\text{Length of original sequence}}{\text{Length of encoded sequence}}, \qquad (\text{binary input})\]

For the preceding exercise, the compression ratio (CR) is

\[\text{CR} = \frac{20}{14}=1.43.\]

If the input sequence is non-binary, but rather the symbols are from an alphabet with $M$ symbols, then the compression ratio is defined in reference to the shortest fixed-length binary code:

\[\text{CR}=\frac{\text{Length of original sequence}\times \lceil\log_2 M\rceil}{\text{Length of encoded sequence}}, \qquad (\text{non-binary input with $M$ symbols}).\]

For example, if the sequence \(AACGTAN\) is encoded into binary as \(0,0,100,101,110,0,111\) (commas added for clarity), then the compression ratio is \(7\times 3/15=1.4\).

Clearly, the higher the compression ratio the better.

If we are not dealing with a file and are instead discussing a probabilistic source of information and a code without actually using it for compression, we can define the compression ratio for the code as

\[\text{CR of a Code}=\frac{1}{\text{Expected Code Length Per Symbol}}\ .\]

For example, for the code of the previous exercise, the compression ratio is

\[\text{CR} = \frac{1}{0.645}=1.55,\]

which is reasonably close to the empirical CR of 1.43.

Similarly, if the input is non-binary with $M$ distinct symbols,

\[\text{CR of a Code}=\frac{\lceil\log_2 M\rceil}{\text{Expected Code Length Per Symbol}}, \qquad (\text{non-binary input with $M$ symbols}).\]

The codebook

When encoding data using Huffman codes or any other code, we need to also store the code, so that we can later use it for decoding. This information is referred to as the codebook or the dictionary. In practical problems, the number of bits used for the dictionary is typically much smaller than the length of the sequence, so we usually ignore it when computing the size of the compressed sequence.

Other coding methods

For a given block size (the number of symbols grouped together), the Huffman code is optimal. But to achieve a good code length, we may have to use large blocks. The issue is that the number of possible blocks quickly grows with the size of the block. For example, for binary, there are 4 possible blocks of length 2 bits, 8 blocks of length 3 bits, and so on. This leads to complex encoding and decoding tasks. In such situations, there are simpler alternatives:

Compression of Markov sequences and other non-iid data

Shannon’s result also holds for non-iid data, including Markov chains. In this case, it says if the length of data is long, the best expected code length that can be achieved is essentially equal to the entropy per symbol of the random sequence. We will discuss two approaches for achieving this, using Huffman codes on blocks of symbols and the Lempel-Ziv algorithm.

Huffman codes, again

Huffman codes can also be used when there is dependence between the sequence’s symbols, such as an image, where adjacent pixels are likely to be similar. But in this case, we have to create blocks larger than one symbol to be able to take advantage of the dependence between symbols. Let us see two examples:

A Markov chain

Let us consider the Markov chain with transition matrix:

  🌞 ☁️ 🌧️
🌞 .7 .2 .1
☁️ .1 .7 .2
🌧️ .2 .1 .7

The following sequence of length around 300 is generated by this Markov chain:

🌞🌞🌞🌞☁️☁️☁️☁️☁️☁️☁️☁️☁️🌞🌞🌞🌞🌞☁️🌧️🌧️🌧️🌧️🌧️🌞🌞🌞☁️🌧️🌧️🌞🌞🌞🌞🌞🌧️☁️🌞☁️🌧️🌧️🌧️🌞☁️🌧️☁️☁️☁️☁️☁️☁️☁️☁️☁️☁️☁️☁️🌧️🌞🌞🌞🌞🌞🌞🌞🌞🌞🌞☁️🌧️🌧️🌧️🌧️🌧️🌞🌞🌞🌞🌞🌞🌧️🌧️🌧️🌧️🌧️🌞🌞🌞🌞🌞🌧️🌧️🌧️🌧️🌞🌞🌞🌞🌞🌞🌞🌞🌞🌞🌞🌞🌞🌧️🌞🌞🌞🌞🌞🌞🌞🌧️☁️☁️☁️☁️☁️🌧️☁️☁️🌧️🌧️🌧️🌧️🌧️🌧️☁️🌞🌧️🌧️🌧️🌧️🌧️🌧️🌧️🌧️🌞🌞🌞🌞🌞☁️☁️☁️☁️☁️🌞☁️🌞☁️☁️☁️☁️☁️🌧️🌧️🌞🌞🌞☁️🌞☁️☁️☁️☁️🌧️🌞🌞🌞🌞🌞🌧️🌧️☁️🌧️🌧️🌧️🌧️🌧️🌧️🌞🌞🌞🌧️🌧️🌞🌞☁️🌞🌞🌧️🌧️☁️☁️☁️🌧️🌧️🌧️🌞☁️☁️☁️☁️🌧️🌧️🌞🌞🌞🌞🌞🌞🌞🌞🌞🌞🌞🌧️🌧️🌧️🌧️🌧️🌧️🌧️🌧️🌞🌞☁️🌧️🌧️☁️☁️☁️☁️☁️🌧️🌞☁️☁️☁️☁️☁️☁️☁️☁️☁️☁️☁️🌞🌧️☁️☁️☁️☁️☁️🌧️🌧️🌞☁️☁️🌞🌞☁️🌞☁️🌧️🌧️🌧️🌧️🌧️🌧️🌧️🌧️🌧️☁️☁️☁️☁️☁️☁️☁️☁️🌧️🌧️🌧️🌞🌞🌞☁️🌞🌞🌞☁️🌧️🌧️🌞🌞🌞🌞🌞🌞🌞🌞☁️☁️☁️☁️🌧️🌞🌞🌞🌧️☁️🌞☁️☁️☁️🌧️🌧️🌧️🌞☁️☁️🌧️🌧️🌧️☁️☁️🌞🌧️🌧️🌞🌞🌞🌞🌞🌞🌞🌞🌞🌞🌞🌞🌧️🌧️🌧️🌧️🌞🌞🌞🌞🌞☁️☁️☁️☁️☁️☁️☁️🌧️🌧️🌧️

How should we represent this data? The three weather types appear the same number of times. So we cannot assign a short codeword to the most common. We shouldn’t focus on individual symbols but rather on patterns as Markov chains are designed to represent relationships and patterns. More specifically, we should assign short codewords to common patterns. For the above example, we can form blocks by grouping each pair of symbols. Then we can assign short codewords to 🌞🌞, ☁️☁️, and 🌧️🌧️, and longer codewords to the other possibilities.

As a simpler example, consider the transition matrix,

\[A= \left(\begin{array} 0.9&.1\\.1&.9\end{array}\right).\]

State 1 may for example correspond to cloudy weather and state 2 to sunny weather (it does not rain, ever). To keep our alphabet binary, we use 0 instead of 2. The following sequence of length 50 is generated by this Markov chain:

\[00000000001000000000000111110000000011111111111111\]

We compress it using Huffman coding. We use the data itself to estimate the probabilities rather than the transition matrix of the Markov chain. We group the symbols into blocks of size 2:

\[00, 00, 00, 00, 00, 10, 00, 00, 00, 00, 00, 01, 11, 11, 00, 00, 00, 00, 11, 11, 11, 11, 11, 11, 11\]

Then

\[P(00)=\frac{14}{25},\quad P(11)=\frac{9}{25}, \quad P(01)=\frac1{25},\quad P(10) = \frac1{25},\]

leading to the Huffman code

\[00\to 0, \quad 11\to 10,\quad 01\to 110,\quad 10\to 111.\]

Then the sequence can be encoded as

\[0 0 0 0 0 111 0 0 0 0 0 110 10 10 0 0 0 0 10 10 10 10 10 10 10\]

which has length 38, or 0.76 bits per symbol and a compression ratio of $50/38=1.32$. The entropy of the Markov chain is 0.469 per symbol. If we had a longer chain and used larger blocks, Huffman code would be able to get closer to the entropy.

Consider a Markov chain with the transition matrix $$B= \left(\begin{array} 0.1&.9\\.9&.1\end{array}\right),$$ which has produced the following sequence of length 50 $$01011011010101010101010101010110101010101011011010$$ Use a Huffman code with an appropriate block size to compress the sequence and compare the length of the resulting sequence with the entropy of the Markov chain and the length of the original sequence.


An image

Consider the eye image we saw before:

We can group $2\times 2$ blocks of pixels as blocks for Huffman code. There will be many completely black blocks and some completely white blocks. There will be a few blocks with mixed colors too. The Huffman code will assign short codewords to black and to white blocks and long codewords to mixed blocks, leading to a smaller number of bits than the 744 original bits.

Lempel-Ziv, the universal compression method

Suppose we are given a text file to compress. What should we take as symbols? Letters? Words? Sentences?

Lempel-Ziv (LZ) compression does not use or track probabilities. It also operates “on the fly”, reading the data one symbol at a time. In fact, Lempel-Ziv is a universal compression method as it does not need to know the probabilities beforehand and so can be applied to any type of data. It is used in all-purpose compression tools such as gzip.

LZ vs HC:

The Lempel-Ziv algorithm

LZ works by identifying sequences of symbols that have appeared before and instead of repeating them, it points to their previous occurrence. The algorithm is given below in detail, but it’s optional.

  1. The sequence is divided into “phrases”, i.e., the shortest sequences that have not appeared before.
    • For example, if the sequences is aaabababaaabba..., we divide it as a,aa,b,ab,aba,aab,ba,....
    • The first symbol is a, which obviously has not been seen before, so this become a phrase.
    • The second symbol is a, but this is already a phrase, so we read the next symbol. This gives us aa, which has not been seen before, so aa becomes the second phrase.
    • The next phrase is b, as b has not appeared before. And so on.
  2. Assign a number to each phrase, from 1 to $C$, in order, where $C$ is the number of phrases.
    • a $\to$ 1, aa $\to$ 2, b $\to$ 3, ab $\to$ 4, aba $\to$ 5, aab $\to$ 6, ba $\to$ 7
  3. If we discard the last element of each phrase, it has occurred before. So we can represent each phrase as (n,s) where n is the index of the aforementioned previous occurrence and s is the last symbol. n is referred to as a pointer. We do this for all the phrases.
    • For example, for the phrase aab, aa has occurred before as phrase 2. So we can represent aab as the pair (2,b).
    • We represent phrases that contain a single symbol s as (0,s).
    • The whole sequence can then be represented as (0,a),(1,a),(0,b),(1,b),(4,a),(2,b),(3,a).
  4. Choose a binary sequences for each possible symbol s.
    • For example, we may choose a $\to$ 0 and b $\to$ 1.
  5. Represent each pair (n,s) by representing n as a binary number and representing s as its binary encoding.
    • Since the largest pointer is 4, we need 3 bits to represent the pointers.
    • Our sequence then becomes (000,0),(001,0),(000,1),(001,1),(100,0),(010,1),(011,0).
  6. We can remove the parentheses and commas since the length of each part is fixed.
    • Our encoded sequence becomes: 0000001000010011100001010110.

In our toy example, the sequence actually became longer. This is because the example is too small. As mentioned before, Lempel-Ziv is optimal when the size of the data is large.


Write a question that you still have about this section.