Reasoning about Neural Network Training with Bias-Variance Tradeoff

Bias-Variance Decomposition

First, we start with data generated by some true function plus noise: $y = f(x) + \epsilon$. The noise $\epsilon$ has mean zero and variance $\sigma^2$. We train a model $\hat{f}(x)$ on a dataset $D$, and the model depends on which dataset we happened to sample. Notice that when we make an error for a single prediction $(y - \hat{f}(x))^2$, there are two sources of randomness, which are the noise in the data and which training set we used (note that taking the expection of this squared error becomes MSE). Let's expand what we're trying to predict: $$(y - \hat{f}(x))^2 = (f(x) + \epsilon - \hat{f}(x))^2$$

Taking the expectation over this, we want to understand $\mathbb{E}[(f(x) + \epsilon - \hat{f}(x))^2]$. Here, let's use a trick. We add and subtract the average prediction $\bar{f}(x) = \mathbb{E}_D[\hat{f}(x)]$. Starting with: $$f(x) + \epsilon - \hat{f}(x)$$ We add zero in the form of $+\bar{f}(x) - \bar{f}(x)$: $$f(x) + \epsilon - \hat{f}(x) + (\bar{f}(x) - \bar{f}(x))$$ Rearranging the terms: $$(f(x) - \bar{f}(x)) + \epsilon + (\bar{f}(x) - \hat{f}(x))$$ So when we square this: $$(f(x) + \epsilon - \hat{f}(x))^2 = ((f(x) - \bar{f}(x)) + \epsilon + (\bar{f}(x) - \hat{f}(x)))^2$$

When we expand this square and take expectations, the cross terms involving $\epsilon$ vanish because $\mathbb{E}[\epsilon] = 0$. The cross term between bias and variance also vanishes because $\mathbb{E}[\bar{f}(x) - \hat{f}(x)] = 0$ by definition. What remains are three terms: $$(f(x) - \bar{f}(x))^2 + \mathbb{E}[\epsilon^2] + \mathbb{E}[(\hat{f}(x) - \bar{f}(x))^2]$$

Now $\mathbb{E}[\epsilon^2] = \sigma^2$ because the noise has mean zero. When the mean is zero, the variance equals the expected squared value: $\text{Var}(\epsilon) = \mathbb{E}[\epsilon^2] - (\mathbb{E}[\epsilon])^2 = \mathbb{E}[\epsilon^2] - 0 = \mathbb{E}[\epsilon^2]$. So we get: $$\mathbb{E}[(y - \hat{f}(x))^2] = \underbrace{(f(x) - \bar{f}(x))^2}_{\text{Bias}^2} + \underbrace{\mathbb{E}[(\hat{f}(x) - \bar{f}(x))^2]}_{\text{Variance}} + \underbrace{\sigma^2}_{\text{Noise}}$$

This is a useful result. The expected prediction error decomposes into three sources: bias (systematic mistakes our model class tends to make), variance (sensitivity to which particular training set we use), and irreducible noise. To arrive at this decomposition, we imagine training our model on many different datasets drawn from the same distribution and averaging the squared errors.

Can Bias-Variance Tradeoff Help Us Reason about NN Training?

To a certain extent, yes. But modern deep learning often violates the assumptions of classical bias-variance theory. Neural networks can simultaneously have low bias and low variance in overparameterized regimes, which classical theory would suggest impossible.

Nonetheless, the framework provides useful intuition for debugging training problems. Several heuristic diagrams attempt to visualize the tradeoff, but this one is my favorite:

This diagram avoids falsely prescribing an "optimal" point. Instead, it helps identify which regime we occupy and what interventions might help.

Regime 1 represents high variance. Training error sits well below our acceptable threshold $\epsilon$, but test error remains unacceptably high. The model memorizes the training data without generalizing. To escape this regime, we can add more training data, reduce model complexity, or apply variance reduction techniques like bagging or dropout.

Regime 2 represents high bias. Even training error fails to reach acceptable levels. The model lacks capacity to capture the underlying patterns. Escaping requires increasing model complexity, adding features, or reducing regularization strength.

Regularization

The regime framework helps explain why certain regularization techniques work. We examine dropout as a variance reduction method.

During training, dropout randomly zeros out each activation with probability $p$. The standard formulation is: $$\tilde{h}_i = \begin{cases} h_i & \text{with probability } 1-p \\ 0 & \text{with probability } p \end{cases}$$ At test time, we must scale by $(1-p)$ to maintain expected values since we're using all neurons.

In practice, most implementations use inverted dropout, which scales during training instead: $$\tilde{h}_i = \begin{cases} \frac{h_i}{1-p} & \text{with probability } 1-p \\ 0 & \text{with probability } p \end{cases}$$ This maintains the expected value during training ($\mathbb{E}[\tilde{h}_i] = h_i$), so no adjustment is needed at test time.

To understand why this reduces variance, consider bagging. Bagging trains K models on different bootstrapped datasets and averages their predictions. Let's assume each model $f_k(x)$ has the same variance σ² (they vary around their predictions, not necessarily around zero). The key assumption is that the models are independent.

For the average of K independent models: $$\text{Average} = \frac{1}{K}(f_1(x) + f_2(x) + ... + f_K(x)) = \frac{1}{K}\sum_{k=1}^K f_k(x)$$

Now let's find the variance of this average. First, we pull out the constant 1/K: $$\text{Var}\left[\frac{1}{K}\sum_{k=1}^K f_k(x)\right] = \left(\frac{1}{K}\right)^2 \cdot \text{Var}\left[\sum_{k=1}^K f_k(x)\right]$$ This uses the rule Var(cX) = c²Var(X).

Next, because the models are independent, the variance of the sum equals the sum of variances: $$\text{Var}\left[\sum_{k=1}^K f_k(x)\right] = \text{Var}[f_1(x)] + \text{Var}[f_2(x)] + ... + \text{Var}[f_K(x)] = K\sigma^2$$

Putting it together: $$\text{Var}[\text{Average}] = \frac{1}{K^2} \cdot K\sigma^2 = \frac{K\sigma^2}{K^2} = \frac{\sigma^2}{K}$$

So averaging K models reduces variance by a factor of K, with some assumptions.

Dropout lacks explicit data resampling. Instead, it creates implicit model averaging by randomly masking neurons. Each forward pass uses a different subnetwork. With $n$ neurons and dropout probability $p$, we sample from $2^n$ possible subnetworks. At test time, using all neurons with scaling by $(1-p)$ (or inverted dropout) approximates averaging over this exponentially large ensemble.

Monte Carlo dropout extends this idea by keeping dropout active at test time. Instead of using all neurons, we run the same input through the network multiple times with different dropout masks. Each run gives a slightly different prediction. If these predictions vary wildly, it suggests the model hasn't learned robust features (different subnetworks disagree about the answer). If predictions are consistent, the model has learned something stable.

What does this variance actually measure? Each dropout mask tests whether the model still works when certain neurons are removed. High variance means the model relies heavily on specific neurons and lacks redundancy. With more data, different subnetworks would likely converge to similar solutions, reducing this variance. So Monte Carlo dropout roughly estimates how much the model might improve with more training data, though this interpretation has limitations.

In some sense, this measure of epistemic uncertainty can be far more informative than naively looking at log probs. A single pass's log probability probably measures the model's confidence, but a model can be confidently wrong (e.g., when given an out-of-distribution sample). The MC dropout variance, in contrast, measures th estability of that belief. High variance signals that the model's belief is fragile and not trustworthy, even if the average log prob is high, providing a check against misleading overconfidence.

Linear Probing?

Here's a concrete example from modern LLM literature on how bias-variance reasoning remains useful. Consider linear probes trained on LLM activations to detect lying. These probes often achieve near-perfect training accuracy but poor test performance. How can a "simple" linear model overfit?

But consider that LLM activations typically have thousands of dimensions. With d features and n training samples, when d >> n (which is the case for many activation probing literature), this model now becomes extremely flexible (note that a model's "simplicity" isn't just about its form but about the relationship between its degrees of freedom and the amount of data.). In high-dimensional space, random points are almost always linearly separable. The probe finds some hyperplane that perfectly splits the training data, but this hyperplane can exploit spurious correlations rather than meaningful trends. Despite being functionally simple (just a linear boundary), the model has too many degrees of freedom. That is, it's in Regime 1, suffering from high variance. Here, a very strong regularization scheme might be needed, even for seemingly "simple" models.