Derivation: Cross-Entropy

deep learning

math

May 13, 2025 · 9 min read

Cross-entropy is is widely used in deep learning. It is what neural networks minimise when learning to model an underlying data distribution, especially in classification tasks like language modelling, where the models predict the probability of a token. Understanding this loss and deriving it will require us to walk through a few ideas from information theory: entropy, divergence, and modelling probability distributions.

This article lays out that path, starting from first principles. It lays out the groundwork for understanding the next article on Apple's recent optimization cross entropy. If you are already familiar with cross-entropy loss, you can skip this derivation and jump straight to article breaking down how Apple did it.

Derivation: Entropy

In information theory, entropy is a measurement of "uncertainty"—or perhaps more intuitively, surprise. We can derive Shannon's entropy by thinking of how we might "model surprise" as a function.

Modelling Surprise

The smaller the probability of an event, the more information it conveys when it occurs.

A more intuitive way of understanding this is that an outcome is "more surprising" when it was "less likely" to occur. It wasn't expected. Put another way, the information conveyed by an event we didn't expect is higher! We can put this in math terms:

I(y) = -\log p(y)

where $I(y)$ , the information conveyed by an event $y$ , is large when $p(y)$ is small.

Small note on sign

Since $p(y) \in [0, 1]$ , the negative sign ensures that $I(y)$ is positive

From here, entropy emerges as the expected information content over a distribution:

Defining Entropy

Entropy is the "average surprise" across all possible outcomes.

So deriving Shannon's entropy ( $H(Y)$ ) involves computing the expected information from $y$ . Each event, carries $I(y)$ information, and the probability of each event is $p(y)$ . So by weighing each event by it's probability, we get the expected information content:

H(Y) = \mathbb{E}[I(Y)] = \sum_{y} p(y) I(y)

H(Y) = - \sum_{y} p(y) \log p(y)

Derivation: Cross-Entropy

In machine learning and data science, we are often trying to approximate a true distribution with a model. Hence, the problem now becomes: how well does our model's distrbution approximate the true distribution?

Cross-entropy is hence best understood as a measure of alignment between two different probability distributions. One produced by the trained model, and one of the data it is trying to learn from.

To answer our new question, we use Kullback-Leibler (KL) divergence, which quantifies the "extra information" between two distributions. Another way of thinking about it is the "information cost" of using one distribution to model another (how much extra information you would need to throw in to fill in the gap).

KL Divergence

How inefficient is it to encode data from $p$ using a code optimized for $q$ ?

KL-divergence can hence be framed as a problem: suppose you need to use $q(y)$ as a base to model $p(y)$ , how much extra information would you need? Intuitively, if $q(y)$ is a poor appproximation of $p(y)$ , then a lot of "extra explaining" would be needed the KL divergence would be high. Likewise, if $q(y)$ is a good appproximation of $p(y)$ , then "no modifications are required", and the KL divergence would be low.

Mathematically:

D_{KL}(p \parallel q) = \sum_{y} p(y) \log \frac{p(y)}{q(y)}

Expressed in terms of expectations ( $p(y)$ is the true distribution, we're using it's probability distribution to compute the expectation):

D_{KL}(p \parallel q) = \mathbb{E}[\log(p) - \log(q)]

KL-divergence hence tells us how far apart two distributions are. It is the expected difference in information between the two distributions. To anchor this back on cross-entropy, we can express KL-divergence in terms of entropy:

D_{KL}(p \parallel q) = H(p, q) - H(p)

Where $H(p, q)$ is the cross-entropy, and $H(p)$ is the true entropy.

H(p, q) = H(p) + D_{KL}(p \parallel q)

Thus, cross-entropy consists of two parts:

The true entropy $H(p)$ , which is the best that can be achieved (i.e. $p = q$ )
The KL divergence, which measures how much extra cost is incurred by using $q$ instead of $p$ .

Put another way, cross-entropy equals the true entropy plus the divergence.

Summary

Cross-entropy is the total cost of using $q$ to represent $p$ .

KL-divergence is the excess cost of it.

To get it's final form, we expand the KL divergence term:

H(p, q) = H(p) + \sum_{y} p(y) \log p(y) - \sum_{y} p(y) \log q(y)

Notice that

H(p) = -\sum_{y} p(y) \log p(y)

Hence we are left with just the final term:

H(p, q) =- \sum_{y} p(y) \log q(y)

Classification and the One-Hot Case

In the case of classification where we deal with one-hot labels, the true class has probability 1 and all others are 0. This simplifies the cross-entropy expression further.

Only the log-probability of the correct class matters:

H(p, q) =- \log q(y_{true})

as all other terms of $p(y)$ are either 1 or 0 and hence disappear. This is the negative log-likelihood: the log loss associated with the correct label. It's minimal when the model assigns high probability to the right answer. This looks familiar, but not quite the form we encounter in deep learning.

Using logits in deep learning

Neural networks output logits—raw, unnormalized scores for each class. To interpret these as probabilities, we apply the softmax function:

q(y_j \mid x) = \frac{e^{z_j}}{\sum_{k} e^{z_k}}

Where $z_j$ is the raw-pre-activation value for class $j$ .

If we plug this into the negative log-likelihood, we get:

\mathcal{L} = - \log \frac{e^{z_{y_i}}}{\sum_{k} e^{z_k}} = - z_{y_i} + \log \sum_{k} e^{z_k}

This is what’s minimized during training: penalizing low scores for the true class, and encouraging the model to assign it higher probability.

Calculating gradients

During model training, we compute gradients of the loss with respect to its input parameters—specifically, embeddings and biases. This section is put in more math terms, and is only necessary for understanding the mechanics of the backpropagation step.

Gradient w.r.t. $e$ (Query Embeddings)

The loss function is:

\mathcal{L} = - e \cdot c_{y_i} + \log \sum_{j} e^{z_j}

Differentiating w.r.t. $c$ (Candidate Embeddings):

\frac{\partial \mathcal{L}}{\partial e} = - c_{y_i} + \sum_{j} \frac{e^{z_j}}{\sum_k e^{z_k}} c_j

This means:

We subtract the correct class embedding $c_{y_i}$ .
We add a weighted sum of all class embeddings $c_j$ , weighted by the softmax probabilities.

Thus, computing $\frac{\partial \mathcal{L}}{\partial e}$ involves:

Selecting the correct class embedding.
Computing softmax over all logits.
Computing the weighted sum of embeddings.

Gradient w.r.t. $c$ (Candidate Embeddings)

By symmetry:

\frac{\partial \mathcal{L}}{\partial c_{y_i}} = - e + \sum_{j} p(y_j) e

Again:

We subtract the query embedding $e$ for the correct class.
We add a softmax-weighted sum of embeddings.

Gradient w.r.t. $b$ (Bias)

\frac{\partial \mathcal{L}}{\partial b_j} = p(y_j) - \mathbb{1}_{j=y_i}

If you've been following along, you'll find that cross-entropy emerges quite naturally from first principles: from measuring information content, to comparing distributions, to finally minimizing the loss in a prediction model.

The loss function's structure reflects the task: to assign high scores to a correct label(s), and to do so with a distrbution that resembles the data we are asking it to model.

Revisit Garden

16 min read

May 21, 2025

Saving VRAM with Apple's Cut Cross Entropy

Revisit Garden

Derivation: Cross-Entropy

Derivation: Entropy

Derivation: Cross-Entropy

Classification and the One-Hot Case

Using logits in deep learning

Calculating gradients

Gradient w.r.t. eee (Query Embeddings)

Gradient w.r.t. ccc (Candidate Embeddings)

Gradient w.r.t. bbb (Bias)

Related Posts

Saving VRAM with Apple's Cut Cross Entropy

Gradient w.r.t. $e$ (Query Embeddings)

Gradient w.r.t. $c$ (Candidate Embeddings)

Gradient w.r.t. $b$ (Bias)