16 min read
May 16, 2025
May 13, 2025 · 9 min read
Cross-entropy is is widely used in deep learning. It is what neural networks minimise when learning to model an underlying data distribution, especially in classification tasks like language modelling, where the models predict the probability of a token. Understanding this loss and deriving it will require us to walk through a few ideas from information theory: entropy, divergence, and modelling probability distributions.
This article lays out that path, starting from first principles. It lays out the groundwork for understanding the next article on Apple's recent optimization cross entropy. If you are already familiar with cross-entropy loss, you can skip this derivation and jump straight to article breaking down how Apple did it.
In information theory, entropy is a measurement of "uncertainty"—or perhaps more intuitively, surprise. We can derive Shannon's entropy by thinking of how we might "model surprise" as a function.
Modelling Surprise
The smaller the probability of an event, the more information it conveys when it occurs.
A more intuitive way of understanding this is that an outcome is "more surprising" when it was "less likely" to occur. It wasn't expected. Put another way, the information conveyed by an event we didn't expect is higher! We can put this in math terms:
where , the information conveyed by an event , is large when is small.
Small note on sign
Since , the negative sign ensures that is positive
From here, entropy emerges as the expected information content over a distribution:
Defining Entropy
Entropy is the "average surprise" across all possible outcomes.
So deriving Shannon's entropy () involves computing the expected information from . Each event, carries information, and the probability of each event is . So by weighing each event by it's probability, we get the expected information content:
In machine learning and data science, we are often trying to approximate a true distribution with a model. Hence, the problem now becomes: how well does our model's distrbution approximate the true distribution?
Cross-entropy is hence best understood as a measure of alignment between two different probability distributions. One produced by the trained model, and one of the data it is trying to learn from.
To answer our new question, we use Kullback-Leibler (KL) divergence, which quantifies the "extra information" between two distributions. Another way of thinking about it is the "information cost" of using one distribution to model another (how much extra information you would need to throw in to fill in the gap).
KL Divergence
How inefficient is it to encode data from using a code optimized for ?
KL-divergence can hence be framed as a problem: suppose you need to use as a base to model , how much extra information would you need? Intuitively, if is a poor appproximation of , then a lot of "extra explaining" would be needed the KL divergence would be high. Likewise, if is a good appproximation of , then "no modifications are required", and the KL divergence would be low.
Mathematically:
Expressed in terms of expectations ( is the true distribution, we're using it's probability distribution to compute the expectation):
KL-divergence hence tells us how far apart two distributions are. It is the expected difference in information between the two distributions. To anchor this back on cross-entropy, we can express KL-divergence in terms of entropy:
Where is the cross-entropy, and is the true entropy.
Thus, cross-entropy consists of two parts:
Put another way, cross-entropy equals the true entropy plus the divergence.
Summary
Cross-entropy is the total cost of using to represent .
KL-divergence is the excess cost of it.
To get it's final form, we expand the KL divergence term:
Notice that
Hence we are left with just the final term:
In the case of classification where we deal with one-hot labels, the true class has probability 1 and all others are 0. This simplifies the cross-entropy expression further.
Only the log-probability of the correct class matters:
as all other terms of are either 1 or 0 and hence disappear. This is the negative log-likelihood: the log loss associated with the correct label. It's minimal when the model assigns high probability to the right answer. This looks familiar, but not quite the form we encounter in deep learning.
Neural networks output logits—raw, unnormalized scores for each class. To interpret these as probabilities, we apply the softmax function:
Where is the raw-pre-activation value for class .
If we plug this into the negative log-likelihood, we get:
This is what’s minimized during training: penalizing low scores for the true class, and encouraging the model to assign it higher probability.
During model training, we compute gradients of the loss with respect to its input parameters—specifically, embeddings and biases. This section is put in more math terms, and is only necessary for understanding the mechanics of the backpropagation step.
The loss function is:
Differentiating w.r.t. (Candidate Embeddings):
This means:
Thus, computing involves:
By symmetry:
Again:
If you've been following along, you'll find that cross-entropy emerges quite naturally from first principles: from measuring information content, to comparing distributions, to finally minimizing the loss in a prediction model.
The loss function's structure reflects the task: to assign high scores to a correct label(s), and to do so with a distrbution that resembles the data we are asking it to model.