I often find myself looking for organizing principles that explain why things work. Recently, I've started to realize that equivariance and invariance explain a lot of we do in deep learning.
Convolution and pooling
Consider a classifier that detects cats in images. The network should output the same classification whether the cat appears on the left or right side of the image. Translation should not affect the decision.
Convolution slides a small filter across an image. Here's a simple 3x3 image with a 2x2 filter: $$ \text{Image} = \begin{bmatrix} 1 & 1 & 0 \\ 1 & 1 & 0 \\ 0 & 0 & 0 \end{bmatrix}, \quad \text{Filter} = \begin{bmatrix} 1 & 1 \\ 1 & 1 \end{bmatrix} $$
At position (0,0), we place the filter over the top-left corner: $$ \begin{bmatrix} 1 & 1 \\ 1 & 1 \end{bmatrix} \odot \begin{bmatrix} 1 & 1 \\ 1 & 1 \end{bmatrix} = 1 + 1 + 1 + 1 = 4 $$ At position (0,1), we shift the filter one position to the right: $$ \begin{bmatrix} 1 & 0 \\ 1 & 0 \end{bmatrix} \odot \begin{bmatrix} 1 & 1 \\ 1 & 1 \end{bmatrix} = 1 + 0 + 1 + 0 = 2 $$ Continuing for all positions gives: $$ \text{Output} = \begin{bmatrix} 4 & 2 \\ 2 & 1 \end{bmatrix} $$
Now shift the entire image one position to the right: $$ \text{Shifted Image} = \begin{bmatrix} 0 & 1 & 1 \\ 0 & 1 & 1 \\ 0 & 0 & 0 \end{bmatrix} $$ Applying the same convolution: $$ \text{Shifted Output} = \begin{bmatrix} 2 & 4 \\ 1 & 2 \end{bmatrix} $$ The values in the output shifted right, just like the input. This is equivariance.
Pooling creates invariance. Max pooling with a 2x2 window on a 4x4 output: $$ \begin{bmatrix} 8 & 7 & 5 & 3 \\ 7 & 8 & 6 & 4 \\ 5 & 6 & 8 & 7 \\ 3 & 4 & 7 & 8 \end{bmatrix} \rightarrow \begin{bmatrix} 8 & 6 \\ 6 & 8 \end{bmatrix} $$ Small shifts within each 2x2 region leave the maximum unchanged. This creates invariance to small translations.
Learning is about discarding nuisances
Now, let's generalize this idea beyond convolution and pooling. In supervised learning, we want to find a function \( f \) such that \( y = f(x) \). Nature gives us data \( x \) (images, sentences, audio) and targets \( y \) (labels, translations, transcriptions). The challenge is that we cannot train on all possible examples that nature could produce. Instead, we draw a finite set of training examples and assume noise. Then, we hope our model generalizes.
The input \( x \) often contains variations that don't matter for predicting \( y \). We can call these nuisances. A cat shifted three pixels to the right is still a cat. A sentence with a small typo still conveys the same meaning. We want our models to be invariant to these irrelevant transformations.
For images, translation is a common nuisance, and we can use a network-side solution. We handle this by building translational equivariance into the architecture through convolution (to be more precise here, convolution was introduced for parameter efficiency, but translational equivariance emerges naturally as a side effect). For language, literal translation between languages might preserve meaning, and we can use a data-side solution. We can train on parallel corpora.
But many transformations, unlike the translational ones in image, are harder to encode with clean mathematical operations. For example, we might want language models to be invariant to small typos, or want alignment finetuning to generalize across different phrasings of the same request. For these cases, we use data augmentation to explicitly show the model examples of these transformations during training. We essentially force the model to generalize to a larger test set.
In a sense, compression in deep learning is the natural consequence of the network optimizing for invariance. When a model becomes invariant to irrelevant transformations, it can try to collapse multiple configurations of the same underlying phenomenon into a single representation. Equivariance still remains desirable earlier in the hierarchy, where we want transformations in input space to map predictably onto representation space. But deeper layers can optimize for invariance, yielding compressed, semantically meaningful representations (epistemic status: intuition). Often, these representations occupy far fewer dimensions than the network's capacity, which is a redundancy that limits efficiency but also enables cool techniques like activation steering.