HomePortfolioPosts

Derivation: Cross-Entropy

deep learning
math

February 17, 2025 · 5 min read

Derivation

Let's revisit first principles of entropy: "modelling surprise" in information theory: An outcome is "more surprising" when it has a lower probability of occurring. The information conveyed by a low probability event is higher! Hence, we have the model

I(y)=logp(y) I(y) = -\log p(y)

where I(y)I(y) is large when p(y)p(y) is small for the information content of an event, yy

  • Small note on sign: since p(y)[0,1]p(y) \in [0, 1], the negative sign makes I(y)I(y) positive
  • To then derive Shannon's entropy from this, we compute the expected information from yy, which represents the average information we would get from observing the random event yy.
H(Y)=E[I(Y)]=yp(y)I(y) H(Y) = \mathbb{E}[I(Y)] = \sum_{y} p(y) I(y) H(Y)=yp(y)logp(y) H(Y) = - \sum_{y} p(y) \log p(y)
  • With Shannon's entropy out of the way, we now explore cross-entropy, which is best understood as the a measure of alignment between two different probability distributions.
  • Let's use the Kullback-Leibler (KL) divergence to understand cross-entropy. The KL-divergence is the measure of "extra information" between the distributions p(y)p(y) and q(y)q(y)
  • KL-divergence can be framed as a problem: suppose you need to use q(y)q(y) as a base to model p(y)p(y), how much extra information would you need?
DKL(pq)=yp(y)logp(y)q(y) D_{KL}(p \parallel q) = \sum_{y} p(y) \log \frac{p(y)}{q(y)}

Put another way, this is actually:

E[log(p)log(q)] \mathbb{E}[\log(p) - \log(q)]

And since p(y)p(y) is the true distribution, we're using it's probability distribution to compute the expectation. Now let's go back to cross entropy, which is very similar to KL-divergence. Cross entropy is:

H(p,q)=H(p)+DKL(pq) H(p, q) = H(p) + D_{KL}(p \parallel q)

Thus, cross-entropy consists of two parts:

  1. The true entropy H(p)H(p), which is the best we can achieve.
  2. The KL divergence, which measures how much extra cost is incurred by using qq instead of pp. To get it's final form, we expand the KL divergence term:
H(p,q)=H(p)+yp(y)logp(y)yp(y)logq(y) H(p, q) = H(p) + \sum_{y} p(y) \log p(y) - \sum_{y} p(y) \log q(y)

Notice that

H(p)=yp(y)logp(y)H(p) = -\sum_{y} p(y) \log p(y)

Hence we are left with just the final term:

H(p,q)=yp(y)logq(y)H(p, q) =- \sum_{y} p(y) \log q(y)

Negative log likelihood:

  • In the case of classification, where there is only one correct class/label (one hot labels), then p(y)=1    yi=yjp(y) = 1 \iff y_i = y_j
  • Hence, we end up with:
H(p,q)=ylogq(y)H(p, q) =- \sum_{y} \log q(y)

as all other terms of p(y)p(y) are either 1 or 0 and hence disappear. This looks familiar, but not quite the form we encounter in deep learning.

  • In deep learning, the we get an output of logits from the neural network (shape: classes x batch size). The logits are mapped with a SoftMax function to become a valid probability distribution.
  • SoftMax:
q(yjx)=ezjkezkq(y_j \mid x) = \frac{e^{z_j}}{\sum_{k} e^{z_k}}

Where zjz_j is the raw-pre-activation value for class jj (notice the shape of the logit)

  • Substituting the values into the negative log likelihood version of this loss function for simplicity:
L=logezyikezk=zyi+logkezk\mathcal{L} = - \log \frac{e^{z_{y_i}}}{\sum_{k} e^{z_k}} = - z_{y_i} + \log \sum_{k} e^{z_k}

Calculating gradients:

  1. Gradient w.r.t. ee (Query Embeddings) The loss function is:
L=ecyi+logjezj\mathcal{L} = - e \cdot c_{y_i} + \log \sum_{j} e^{z_j}

Differentiating w.r.t. cc (Candidate Embeddings):

Le=cyi+jezjkezkcj\frac{\partial \mathcal{L}}{\partial e} = - c_{y_i} + \sum_{j} \frac{e^{z_j}}{\sum_k e^{z_k}} c_j

This means:

  • We subtract the correct class embedding cyic_{y_i}.
  • We add a weighted sum of all class embeddings cjc_j, weighted by the softmax probabilities. Thus, computing Le\frac{\partial \mathcal{L}}{\partial e} involves:
  • Selecting correct class embedding.
  • Computing softmax over all logits.
  • Computing the weighted sum of embeddings.
  1. Gradient w.r.t. cc (Candidate Embeddings) By symmetry:
Lcyi=e+jp(yj)e\frac{\partial \mathcal{L}}{\partial c_{y_i}} = - e + \sum_{j} p(y_j) e

Again:

  • We subtract the query embedding ee for the correct class.
  • We add a softmax-weighted sum of embeddings.
  1. Gradient w.r.t. bb (Bias)
Lbj=p(yj)1j=yi\frac{\partial \mathcal{L}}{\partial b_j} = p(y_j) - \mathbb{1}_{j=y_i}

Related Posts

5 min read

February 17, 2025

Overview: Apple's smol Cut Cross-Entropy