Mathematics of Information Teaching Group Publications Home

Codes, graphs, and algebra

Error-correcting codes are used for reliable communication and storage in the presence of noise and failure. In this section, we will learn about error detection and correction, graphs and their relationship to error-correcting codes, distances on graphs, and will also become familiar with some elementary codes based on parity bits and checksum.

Errors and reliability

Let’s start with two examples that again demonstrate the need for error correction.

Reliable data storage

From 2013 study of a Facebook datacenter:

Amazon, Facebook, Google, Microsoft, etc. continue to expand their datacenters, and these numbers are likely now off by an order of magnitude. How do we deal with this situation? Losing user data is probably the surest way of destroying your company?

Reliable communication

It’s even worse when communicating wirelessly. There is a tremendous amount of noise and interference making communication difficult. How can we watch an HD movie when the wireless link is so unreliable?

Two types of errors

We have already noted that real world measurements of signals are inherently noisy. If we threshold a digital signal to recover binary digits, the noise can cause random errors in our signal. This leads to a substitution or bit flip error.

But there is a second type of error, usually resulting from device failure. Bits stored on flash memory may become inaccessible due to cells failing. If a hard drive fails, the data is lost. These are erasure errors; similar to someone erasing your notes.

The difference between substitutions and erasures is that when a substitution occurs we don’t know which bits are affected. So correcting substitutions is more difficult.

The basics of error control

Error detection and correction

To deal with errors, we have multiple tools at our disposal. These tools can be used to detect, and sometimes correct, bit errors.

Together, error detection and correction are sometimes referred to as error control.

Since noise is random, we can only provide probabilistic guarantees for these methods, instead of strict, deterministic ones. More about that later…

The basic idea

We have seen before that English, as well as any other language, have error-correction capability, enabling spell check and autocorrect. But how do these work? We can find the answer by asking a related question: Why isn’t it possible to spell check/autocorrect email addresses? An email address can be any string of letters but not every sequence of letters is a word.

When autocorrect sees a sequence of letters that is not a word, it has two possible choices:

Same idea can be used for correcting bit errors

Fundamental Tenet V:

Correcting and detecting transmission and storage errors requires restricting the set of possible words (e.g., binary sequences).


Even parity. Suppose we can send 3-bits. We decide that only 3-bit words that have an even number of ones are allowed: {000,011,101,110}

Which errors can be detected?

Which errors can be corrected accurately?


Suppose only 4-bit sequences with an even number of 1s are allowed. List all permitted binary sequences.

If $1011$ is received, and assuming there is at most 1 error, what could be transmitted?


Suppose that the set of permitted binary sequences is ${000,001,010,011}$. Which errors can be corrected? Which errors can be detected? (Hints: The capability may depend on the position of the error. Also, this isn't a well designed code.)


Codes, encoding, and decoding

We said that to correct errors, we need to restrict the set of words (English words, binary sequences, …), so that when an error happens, we are likely to find out. This restricted set of messages is called a code. The message is what we need to transmit or store. But instead of sending the actual message, we send a codeword corresponding to that message. While the message can be anything, codewords are from a restricted set, allowing us to detect or correct errors that occur in transmission.

Let’s make our terminology a bit more precise. For each definition, we provide an instance from the 3-bit even-parity code from above and also from English.

Encoding Table
Message Codeword
00 000
01 011
10 101
11 110
Decoding Table
Received word Decoded message
000 00
001 Error
010 Error
011 01
100 Error
101 10
110 11
111 Error

Let’s compare using the even-parity code with using no coding at all. With no coding, if send 00 and 01 is received due to a bit flip, there is no way for the receiver to detect this. But if we use the even parity code with encoding table above and transmit 000, a bit flip may lead to 010. But as shown in the decoding table, this error can be detected.

The length of the codes is the number of bits in each codeword. The number of information bits or message bits are the number of bits in each message. For our 3-bit even parity code, the code length is 3 and the number of message bits is 2. The remaining bits are redundancy bits. We have the following equality

\[\text{Number of codewords} = 2^{\text{Number of message bits}}.\]

Graphs and distance

Graphs are a good way to visualize ideas about error control, especially “similarity” between codewords.

A graph $𝐺=(𝑉,𝐸)$ consists of a set of vertices $V$ and a set of edges $E$, where each element of $E$ represents a specific relationship between two vertices. Two vertices with an edge between them are called adjacent.

Examples:

Distance

The distance between two nodes in a graph is the number of edges on the shortest path between them

Distance $a$ $b$ $c$ $d$ $e$
$a$ 0 1   2  
$b$ 1        
$c$     0    
$d$          
$e$          


Complete the table above.


Hamming distance

The Hamming distance between two words is the number of positions in which they are different. In other words, it is the number of bits that you need to flip to convert one binary sequence to the other.

\[𝑑_𝐻 (000,101)=2\]

This coincides with the distance over the graph. This is because flipping one bit allows you to move along one edge, forming a path. For example, to convert 000 to 101 you can first flip the first bit, getting to 100 and then flip the last bit, getting 101.

Find the Hamming distance for each pair of sequences below:


Two binary sequences of length $5$ are chosen at random. What is the expected Hamming distance between them?





A few simple codes

We now discuss a few simple binary codes. We visualize these codes as graphs, which will help us better understand their error-detection and/or error-correction capabilities. When representing a binary code as a graph, we should indicate which of the vertices/sequences are codewords. The graph for the 3-bit even-parity code is given on the right. The codewords of this code are

\[\{000,011,101,110\},\]

which are shown with a blue color.

Double redundancy code

Let’s consider first the simple case of detecting an error via double redundancy.

The even-parity code

Only allowed to send 3-bit words that have an even number of 1s


The triple redundancy code

Repeat each bit twice! For decoding, detect but do not try to correct!


The same code can behave differently depending on the decoding strategy. The example below illustrates this fact.

The triple redundancy code

Repeat each bit twice! For decoding, find the closest valid codeword!

Decoding and errors

Decoding strategy

As we saw with the triple redundancy code, when an invalid word is received, we have a couple of choices:

The first method allows correcting errors but it increases the probability that an error is corrected in the wrong way. For example, suppose in the triple redundancy code we receive 010 when 111 was sent. Minimum distance decoding will output 000 which is incorrect.

The graphs below show the possibilities when different strategies can be used for different received words:

Which strategy is more appropriate for communications and which one for storage?


Minimum distance and errors

The minimum distance of a code is the smallest distance between codewords. The minimum distances of the even-parity code, the double redundancy code, and the triple redundancy code are 2, 2, and 3, respectively.

To detect $s$ errors, what should the minimum distance be?

To correct $t$ errors, what should the minimum distance be?

Fundamental Tenet VI:

A Hamming distance (between codewords) of $s+1$ is needed to detect $s$ errors per word, and a Hamming distance of $2t+1$ is needed to correct $t$ errors per word.

Erasure errors

So far we have mostly discussed bit flip errors, where it is not known which bits may be erroneous. But sometimes we may know which bits are unreliable or inaccessible: 01010001→010?0?01

Fundamental Tenet VII:

A Hamming distance (between codewords) of $u+1$ is needed to correct $u$ erasure errors.


Let $c_1=10111000,c_2=11011011$ be codewords. For each $z_i$ below, determine whether it may result by erasing symbols from $c_1$ or $c_2$.


A simple strategy can be used to recover data when a drive fails: We Store the bits of each codeword on different drives. An example with even-parity code is shown below. Example: even parity

Disk 1 $x_i$ 01011101111100100001…
Disk 2 $y_i$ 01111110010001110010…
Disk 3 $z_i = x_i+y_i$ 00100011101101010011…

The codes we have seen so far

Let us review all the codes that we have seen so far. Before doing so, let us also define the Code rate. The code rate is the number of message bits divided by the code length (number transmitted bits).

Example: the number of message bits in the 3-bit even-parity code is 2

Typically, when the code rate is high, the ability to correct error is low. (But there are also poorly designed codewords that have both low rate and low ability to correct errors.)

Code Graph Min distance Rate Error detection Error correction
Double redundancy 2 1/2 1 0
Even parity 2 2/3 1 0
Triple redundancy
(Declare error for invalid words)
3 1/3 2 0
Triple redundancy
(Minimum distance decoding)
3 1/3 1 1


These codes are not very impressive ... 😒. We'll learn how to do a better job!

Probability of decoding failure

Let the transmitted codeword be denoted by $x$ and the result of decoding be denoted by $y$. Three cases are possible:

When there is no coding, all errors will be undetected and the second event above does not occur.

We should also note that given that things can go wrong in multiple ways, the term ‘probability of error’ is ambiguous. It may refer to the raw bit error rate, i.e., the probability of each bit being flipped by the channel. It may also refer to the probability of the ‘undetected error’ event described above.

From a probabilistic point of view, the error-correcting capability of a code is determined by the probabilities of the three events above. We want the probability of correct decoding to be close to 1.

The double redundancy code

Let the bit error probability be denoted by $e$, and suppose bit errors are symmetric and independent of each other (the BSC model). Let’s analyze the error probabilities for the double redundancy code. We consider the number of errors and then determine the decoding event they correspond to. The binomial distribution will be of use here.

So the probabilities are:

How does this compare to no coding?

With no coding the probability of an undetected error is $e$, which is (much) larger than $e^2$. For example, if $e=0.1$, then $e^2 = 0.01$.

Suppose we need to send a message with 10 bits. Assume the channel error probability is 0.1. What is the probability of at least one undetected error if we


The cost of the transmission (time/energy) is doubled but undetected errors become much more unlikely.

The triple redundancy code

Consider the triple-redundancy code with minimum distance decoding:

So the probability that triple redundancy fails to produce the transmitted sequence is

\[3𝑒^2−2𝑒^3\]

How does this compare to the double/ no redundancy codes? For the double-redundancy code, the probability of undetected error is $e^2$ and with no coding, the probability of error is $e$.

We can use the same scheme to detect two-bit errors if we don’t bother correcting. With the decoding rule that rejects any sequence other than 000 or 111, the probability of an undetected error for the triple-redundancy code is $e^3$.

A 10-bit message is to be sent over a channel with bit error probability $e=0.1$. What is the probability of at least one undetected error if we use a triple redundancy code with minimum distance decoding?


In the previous example, if we send $n$ bits, the probability of correctly decoding is $0.97^n$. Observe as the message size grows or error probability increases, more redundancy may be needed to mitigate more errors.

\[0.97^n \to 0 \text{ as } n\to \infty\]

Better codes

The triple redundancy code can only guarantee correcting 1 error. What if we need to correct more errors? Can we do better?

Yes! We need to find codes in which words are farther apart.

But then the codes will have to be longer, making it more difficult to deal with them with the tools that we have. For example, the number of nodes in a graph will get very large

Need more powerful mathematics:

You may be familiar with linear algebra over real numbers. Here we will deal with linear algebra over finite fields.

Finite fields

Recall that a field is a set in which we can do elementary computation

We have seen examples of infinite fields:

For error correction, we deal with finite fields.

The simplest example of a finite field is $\mathbb F_p$, where $p$ is a prime number:

Fields with $p^m$ elements, where $p$ is a prime number and $m$ is a natural number also exist. But no other finite field exists. For example, there is no field of size 6.

Finite fields and error correction

Computers work with binary so the natural field is \(\mathbb{F}_2\). But working with \(\mathbb{F}_2\) is somewhat restrictive, given that it has only two elements. This makes designing binary codes directly difficult. Fortunately, fields of size $2^m$ exist for any $m$. These are used in practice to design codes and then map them to binary. Unfortunately, they are more complex than prime fields like \(\mathbb{F}_2, \mathbb{F}_3, \cdots\), so we’ll avoid them. To get an idea of how \(\mathbb{F}_{2^m}\) can be used for ECC, we’ll occasionally see examples with \(\mathbb{F}_p, p>2\), with the understanding that these are not common in practice.

Parity checks

Message Codeword
00 000
01 011
10 101
11 110

Recall the 3-bit even parity code: {000,011,101,110}

We can say that codewords $𝑥_1 𝑥_2 𝑥_3$ satisfy:

\[𝑥_1⊕𝑥_2⊕\color{blue}{𝑥_3}=0 \text{ or equivalently } \color{blue}{𝑥_3}=𝑥_1⊕𝑥_2\]

The first two bits are message bits and the third bit is a redundant bit which is called the parity check bit, since it ensures a specific parity for the whole sequence. This code has minimum distance 2, so it can detect 1 error

Message Codeword
000 000000
100 100110
010 010___
001 001___
110 110___
101 101___
011 011___
111 111___

We can have more than one parity bit to make codes more reliable:

This is essentially a system of equations and, as we will see, that’s where linear algebra will come in.



How many errors can this code correct? What is the rate of the code?


The relationships enforced by the parity equations ensure that the codewords are at a certain distance from each other!

ISBN-10



$\mathbb F_{11}$ Multiplication table

An example of a code over a finite field is ISBN 10, which is over $\mathbb F_{11}$.

ISBN is a 10-digit string $\color{green}{𝑥_{10} 𝑥_9 𝑥_8⋯}\color{blue}{𝑥_1}$ where $𝑥_1$ is chosen such that in $\mathbb F_{11}$

\[\sum_{i=1}^{10} 𝑖𝑥_𝑖=0\]


A smudged ISBN reads as 0-471-2?195-4. What is the missing digit? What is the book?


Parity bits and redundancy

With parity bits, we are adding redundant information.

More redundancy $\to$ lower data rate, reduced transmission efficiency

A code is said to approach capacity if its rate is close to the capacity of the channel $𝐶$ and the probability of uncorrected errors can be made as small as we want.

More sophisticated codes can get better efficiency with same error correction performance (next time: block codes / Hamming codes)

Linear algebra is the key to formulating these block codes and showing how many errors they can detect or correct. To do that, we need to extend what we’ve learned about vectors and matrices to those involving binary variables.