Cross-entropy can be derived in two ways. One perspective starts from information theory and measuring differences between distributions. The other (a more predictive one) starts from logistic regression and finding parameters that explain the data. Both arrive at the same loss function.
Approaching from information theory
Route: Surprise -> Entropy -> Cross-EntropyWe start with this perspective because it is more intuitive. Let's first discuss surprise: (1) we are surprised when an unlikely event happens, but (2) we are not surprised when a likely event happens. A good model of reality should minimize our (average) surprise when we observe actual events. But how can we quantify surprise?
We start from unlikeliness. The most straightforward way of quantifying unlikeliness is probability \(p\). That is, an unlikely event has small \(p\), a likely event has \(p\) close to 1. But notice how probabilities get smaller through multiplication when multiple rare events occur together (winning the lottery twice has probability 1/10 \(\times\) 1/10 = 1/100). Yet intuitively, the surprise should add up. Some might think that winning a lottery twice in a row should feel twice as surprising as winning once. Therefore we use \(\log p\), which converts multiplication into addition: $$ \log(p_1 \times p_2) = \log p_1 + \log p_2 $$ Since \(\log p\) is negative for probabilities (because \(0 < p \leq 1\)), we use \(-\log p\) to make surprise a positive quantity. So our measure of surprise for an event with probability \(p\) is: $$\text{Surprise} = -\log p$$ Now rare events yield high surprise, while certain events have surprise near zero.
Now that we can measure surprise for individual events, what about multiple events? Consider a biased coin that lands heads 70% of the time and tails 30% of the time. When we see heads, our surprise is \(-\log_2(0.7) \approx 0.51\). When we see tails, our surprise is \(-\log_2(0.3) \approx 1.74\). Over many flips, we see heads 70% of the time and tails 30% of the time. So our average surprise is: $$0.7 \times 0.51 + 0.3 \times 1.74 \approx 0.88$$ More generally, if we have a probability distribution \(P\) over possible outcomes \(x\), our average surprise is: $$H(P) = \sum_x P(x) \times \text{Surprise}(x) = \sum_x P(x) \times (-\log P(x)) = -\sum_x P(x) \log P(x)$$ This is called entropy. It represents the inherent uncertainty in a probability distribution.
But what if our model of the world is wrong? Suppose we think the coin lands heads 50% of the time, but it actually lands heads 70% of the time. Then this happens:
- When heads appears (70% of the time): we compute surprise as \(-\log_2(0.5) = 1\) instead of \(-\log_2(0.7) \approx 0.51\).
- When tails appears (30% of the time): we compute surprise as \(-\log_2(0.5) = 1\) instead of \(-\log_2(0.3) \approx 1.74\).
Our average surprise using the wrong model becomes: $$0.7 \times (-\log_2 0.5) + 0.3 \times (-\log_2 0.5) = 0.7 \times 1 + 0.3 \times 1 = 1$$ Notice that this is higher than the 0.88 we got with the correct model. Being wrong about the distribution increases our average surprise.
Formally, when events come from distribution \(P\) but we model them with distribution \(Q\), our average surprise becomes: $$H(P, Q) = -\sum_x P(x) \log Q(x)$$ This is cross-entropy. We observe events with frequency \(P(x)\), but we compute their surprise using our model probabilities \(Q(x)\).
Cross-entropy is always at least as large as entropy. That is, \(H(P, Q) \geq H(P)\), with equality only when \(Q = P\). In machine learning, we minimize cross-entropy to make our model distribution \(Q\) match the true data distribution \(P\) as closely as possible.
Approaching from logistic regression
Route: Log-Odds -> Maximum Likelihood -> Binary Cross-Entropy -> Softmax -> Cross-EntropySurprisingly, we can also derive the same equation from a completely different angle. Instead of thinking about surprise, let's think about classification. We want to predict whether something belongs to class 0 or class 1 based on features \(x\). In logistic regression, we interpret the output of a linear model as log-odds (the logarithm of the ratio of probabilities): $$\log \frac{p(y=1 \mid x)}{p(y=0 \mid x)} = w^T x$$ If you're not convinced this gives us logistic regression, let's derive the familiar sigmoid form. Exponentiate both sides: $$\frac{p(y=1 \mid x)}{p(y=0 \mid x)} = e^{w^T x}$$ Since \(p(y=0 \mid x) = 1 - p(y=1 \mid x)\), we can substitute: $$\frac{p(y=1 \mid x)}{1 - p(y=1 \mid x)} = e^{w^T x}$$ Let's solve for \(p(y=1 \mid x)\) step by step. First, multiply both sides by the denominator: $$p(y=1 \mid x) = e^{w^T x} \cdot (1 - p(y=1 \mid x))$$ Expand the right side: $$p(y=1 \mid x) = e^{w^T x} - e^{w^T x} \cdot p(y=1 \mid x)$$ Move all terms with \(p(y=1 \mid x)\) to the left: $$p(y=1 \mid x) + e^{w^T x} \cdot p(y=1 \mid x) = e^{w^T x}$$ Factor out \(p(y=1 \mid x)\): $$p(y=1 \mid x) \cdot (1 + e^{w^T x}) = e^{w^T x}$$ Finally, divide both sides by \((1 + e^{w^T x})\): $$p(y=1 \mid x) = \frac{e^{w^T x}}{1 + e^{w^T x}}$$ We can rewrite this in the more common form by multiplying numerator and denominator by \(e^{-w^T x}\): $$p(y=1 \mid x) = \frac{e^{w^T x}}{1 + e^{w^T x}} \cdot \frac{e^{-w^T x}}{e^{-w^T x}} = \frac{1}{e^{-w^T x} + 1} = \frac{1}{1 + e^{-w^T x}}$$ This is the sigmoid function \(\sigma(z) = \frac{1}{1 + e^{-z}}\), so: $$p(y=1 \mid x) = \sigma(w^T x)$$ Now we have our model: \(p(y=1 \mid x) = \sigma(w^T x)\).
Now, we want to find parameters \(w\) that make our observed data most probable under the model. Suppose we have \(n\) independent data points \((x_i, y_i)\), where each label \(y_i \in \{0,1\}\). The probability that the model assigns to observing this specific dataset is the product of all individual probabilities: $$ \text{Likelihood}(w) = \prod_{i} p(y_i \mid x_i) $$
For one sample \((x_i, y_i)\), the model predicts \(p(y_i=1 \mid x_i) = \sigma(w^T x_i)\) and \(p(y_i=0 \mid x_i) = 1 - \sigma(w^T x_i)\), where \(\sigma(z) = \frac{1}{1 + e^{-z}}\) is the sigmoid function. We can express both cases compactly as: $$ p(y_i \mid x_i) = \sigma(w^T x_i)^{y_i} \, (1 - \sigma(w^T x_i))^{1 - y_i} $$
Taking the logarithm converts the product to a sum (and avoids numerical underflow): $$ \begin{align} \log \text{Likelihood}(w) &= \sum_i \log p(y_i \mid x_i) \\ &= \sum_i \left[ y_i \log \sigma(w^T x_i) + (1 - y_i) \log (1 - \sigma(w^T x_i)) \right] \end{align} $$ Since training seeks to maximize likelihood, minimizing the negative log-likelihood gives our loss function: $$ \text{Loss}_{\text{BCE}} = -\frac{1}{n}\sum_i \left[ y_i \log \sigma(w^T x_i) + (1 - y_i) \log (1 - \sigma(w^T x_i)) \right] $$
This is the binary cross-entropy loss. It measures how surprised the model is by the true labels. If the model assigns low probability to the correct class, \(-\log p\) becomes large, increasing the loss. In essence, it penalizes confident wrong predictions much more than uncertain ones.
Mathematically, this loss is identical to the cross-entropy that we derived from information theory: $$ H(P, Q) = -\sum_x P(x) \log Q(x) $$ Here, the true labels \(y_i\) form the empirical distribution \(P\) (a one-hot vector that assigns probability 1 to the correct class), and the model's predicted probabilities \(\sigma(w^T x_i)\) form \(Q\). The loss $$ -\big[y_i \log \sigma(w^T x_i) + (1-y_i) \log (1-\sigma(w^T x_i))\big] $$ is simply the two-class version of that general formula. Minimizing this loss therefore makes the model's predicted distribution \(Q\) match the true data distribution \(P\), which achieves the same goal as minimizing expected surprise in information theory.
For multiple classes (\(K > 2\)), the sigmoid generalizes to the softmax function: $$ p(y=k \mid x) = \frac{e^{w_k^T x}}{\sum_{j=1}^K e^{w_j^T x}} $$ The softmax guarantees that all class probabilities are positive and sum to one. In fact, the sigmoid is simply a special case of the softmax when \(K = 2\): $$ \sigma(z) = \frac{e^{z}}{e^{z} + e^{0}} $$
Using the softmax, the multi-class cross-entropy loss becomes: $$ \text{Loss}_{\text{Softmax-CE}} = -\frac{1}{n}\sum_i \sum_k y_{ik} \log p(y=k \mid x_i) $$ where \(y_{ik} = 1\) if sample \(i\) belongs to class \(k\), and 0 otherwise. This loss has exactly the same interpretation as in the binary case. It measures how close the predicted distribution \(p(y \mid x_i)\) is to the true one-hot distribution \(y_i\).