Derivation
Let's revisit first principles of entropy: "modelling surprise" in information theory:
An outcome is "more surprising" when it has a lower probability of occurring.
The information conveyed by a low probability event is higher!
Hence, we have the model
I(y)=−logp(y)
where I(y) is large when p(y) is small for the information content of
an event, y
- Small note on sign: since p(y)∈[0,1], the negative sign makes
I(y) positive
- To then derive Shannon's entropy from this, we compute the expected
information from y, which represents the average information we would get
from observing the random event y.
H(Y)=E[I(Y)]=y∑p(y)I(y)
H(Y)=−y∑p(y)logp(y)
- With Shannon's entropy out of the way, we now explore cross-entropy, which
is best understood as the a measure of alignment between two different
probability distributions.
- Let's use the Kullback-Leibler (KL) divergence to understand
cross-entropy. The KL-divergence is the measure of "extra information"
between the distributions p(y) and q(y)
- KL-divergence can be framed as a problem: suppose you need to use q(y)
as a base to model p(y), how much extra information would you need?
DKL(p∥q)=y∑p(y)logq(y)p(y)
Put another way, this is actually:
E[log(p)−log(q)]
And since p(y) is the true distribution, we're using it's probability
distribution to compute the expectation.
Now let's go back to cross entropy, which is very similar to
KL-divergence. Cross entropy is:
H(p,q)=H(p)+DKL(p∥q)
Thus, cross-entropy consists of two parts:
- The true entropy H(p), which is the best we can achieve.
- The KL divergence, which measures how much extra cost is incurred
by using q instead of p.
To get it's final form, we expand the KL divergence term:
H(p,q)=H(p)+y∑p(y)logp(y)−y∑p(y)logq(y)
Notice that
H(p)=−y∑p(y)logp(y)
Hence we are left with just the final term:
H(p,q)=−y∑p(y)logq(y)
Negative log likelihood:
- In the case of classification, where there is only one correct class/label
(one hot labels), then p(y)=1⟺yi=yj
- Hence, we end up with:
H(p,q)=−y∑logq(y)
as all other terms of p(y) are either 1 or 0 and hence disappear.
This looks familiar, but not quite the form we encounter in deep learning.
- In deep learning, the we get an output of logits from the neural network
(shape: classes x batch size). The logits are mapped with a SoftMax function
to become a valid probability distribution.
- SoftMax:
q(yj∣x)=∑kezkezj
Where zj is the raw-pre-activation value for class j (notice
the shape of the logit)
- Substituting the values into the negative log likelihood version of this
loss function for simplicity:
L=−log∑kezkezyi=−zyi+logk∑ezk
Calculating gradients:
- Gradient w.r.t. e (Query Embeddings)
The loss function is:
L=−e⋅cyi+logj∑ezj
Differentiating w.r.t. c (Candidate Embeddings):
∂e∂L=−cyi+j∑∑kezkezjcj
This means:
- We subtract the correct class embedding cyi.
- We add a weighted sum of all class embeddings cj, weighted by the
softmax probabilities.
Thus, computing ∂e∂L involves:
- Selecting correct class embedding.
- Computing softmax over all logits.
- Computing the weighted sum of embeddings.
- Gradient w.r.t. c (Candidate Embeddings)
By symmetry:
∂cyi∂L=−e+j∑p(yj)e
Again:
- We subtract the query embedding e for the correct class.
- We add a softmax-weighted sum of embeddings.
- Gradient w.r.t. b (Bias)
∂bj∂L=p(yj)−1j=yi