Error-correcting codes - Alin Tomescu

tl;dr: Too many FRI¹ conjectures that need to be understood, so here we are…

Notation

We use $\a \bydef (a_0, \ldots a_n)$ to denote vectors, typically codewords, as you’ll learn later.

Codes

An error-correcting code (a.k.a. a code) is a function that converts a message into a codeword with enough “redudancy” in it such that the original message can be recovered from a potentially-corrupted codeword.

Formally, we will restrict ourselves to (block?) codes of the form: \begin{align} C : \alphabet^k \rightarrow \alphabet^n \end{align} Here, $\alphabet$ denotes the alphabet made up of symbols.

So, a message is made up of $k$ symbols while a codeword is made up of $n \ge k$ symbols.

Note that we restrict ourselves to codes where messages and codewords share the same alphabet of symbols because this makes it natural to define a notion of rate.

We interchangeably use “code” (and $C$) to refer to either the encoding function above and its image $\text{Im}(C) \bydef \{C(m)\ \vert\ m \in \Sigma^k\}$.

As an example of a code, consider Reed-Solomon (RS) codes $C : \F^k \rightarrow \F^n$, where $\Sigma = \F$ is a finite field²: \begin{align} C(\m) = (f(1), f(2),\ldots,f(n)),\ \text{where}\ f(X) = \sum_{i=0}^{k-1} m_i X^i \end{align} We will likely study RS codes more carefully later but, for now, the key idea behind them is that a message in $\F^k$ can be viewed as a degree $k-1$ polynomial $f(X)$. As a result, the message can be encoded in a redundant fashion by evaluating the $f(X)$ polynomial at $n>k$ (distinct) points.

Clearly: Using Lagrange interpolation, we can recover the polynomial $f(X)$ from any set of $\ge k$ evaluations (i.e., any set of $\ge k$ symbols in the codeword). So, in this sense, Reed-Solomon codes can recover messages in the presence of lost codeword symbols, a.k.a. erasures.

Not so obvious: Reed-Solomon can recover messages even if codeword symbols are altered or corrupted rather fully missing. In this sense, a corrupted symbol in the size-$n$ codeword is referred to as an error. We will explain later how the rate and the distace of a code dictates how many such errors can be corrected.

Beautifully, RS admits an algorithm³$^,$⁴ that can recover the original $f(X)$ as long as there are less than $\le \floor{\frac{n-k}{2}}$ errors in the size-$n$ codeword. (We’ll see why this is the best we can hope for when we talk about the singleton bound.)

Rate

The rate of a code $C : \alphabet^k \rightarrow \alphabet^n$ is: \begin{align} \rho\bydef k/n \in (0,1) \end{align}

We want the rate to be as close as possible to 1, since this minimizes codeword size, keeping the overhead of encoding the message low while guaranteeing recoverability of it in the presence of errors!

Distance

The Hamming distance $\dist{\a,\b}$ between two codewords $\a$ and $\b$ is the # of positions $i$ where $a_i\ne b_i$: \begin{align} \dist{\a,\b}\bydef \left|\{i\in[n]\ \vert\ a_i \ne v_i \}\right| \end{align}

Continue. Relative distance. Hamming weight?

Glossary

Basics

Advanced

subcode (e.g., $\CRS$ vs $\RS$)
unique decoding regime
list decoding
capacity bound
Johnson bound
proximity
$(\delta,\varepsilon)$-correlated agreement for $\RS[\F,\L, m]$
$(\delta, \varepsilon)$ mutual correlated agreement for $\RS[\F,\L, m]$
constrained Reed Solomon codes $\CRS[\F,\L,m, \hat{w}, \sigma]$

References

For cited works, see below 👇👇

Fast Reed-Solomon Interactive Oracle Proofs of Proximity, by Eli Ben-Sasson and Iddo Bentov and Yinon Horesh and Michael Riabzev, in 45th International Colloquium on Automata, Languages, and Programming (ICALP 2018), 2018, [URL] ↩
Reed-Solomon works over all finite fields (a.k.a., any $\F_q$ where $q$ is a product of prime powers). ↩
e.g., Berlekamp–Welch, or Sudan-Guruswami for list-decoding beyond the singleton bound ↩
A New Algorithm for Decoding Reed-Solomon Codes, by Gao, Shuhong, in Communications, Information and Network Security, 2003, [URL] ↩

PREVIOUSPairings or bilinear maps

NEXTHidden-order group assumptions