January 15, 2026 · 2 min read
It's 2021, I'm a fresh graduate working at MindPointEye, working on object detection models (YOLOX).
At the time, semi-supervised methods were all the rage. Within the year, I also just discovered distil.pub (which I miss dearly) and was absolutely possed by this beautiful article called "Why Momentum really works".
As a result, the two biggest training optimisations I pursued were:
And it worked, almost like magic. It felt crazy to make such an impact as a junior in a huge team of research engineers and scientists.
So I took both of these to their natural conclusions:
The latter wasn't working as well as I thought it should w.r.t learning rate and momentum, so I did some digging. And that's when I found that PyTorch's SGD implementation wasn't what I thought it was.
As outlined in PyTorch docs:
The implementation of SGD with Momentum/Nesterov subtly differs from Sutskever et. al. and implementations in some other frameworks.
To describe it as "subtle" is a little misleading to me. This was in fact the "bug" that I was trying to hunt down.
The original implementation looks like this:
Where , , , and denote the parameters, gradient, velocity, and momentum respectively. Note how the learning rate, , is applied to gradient term directly here.
Whereas in the case of PyTorch's implementaion:
Here the learning rate, , is applied to the entire velocity term! Both gradient and the momentum terms are affected!
This means that if we employ learning rate scheduling and gradually decrease learning rates, then momentum also gets decreased.
If we interpret "learning rate" as scale factor for how much the parameters can change (like a "step size"), then PyTorch's implementation makes sense.
But if we're trying to recreate the monentum and learning rate mechanics like in the "Why Momentum Really Works" article, then it fails because reducing the learning rate also reduces the effect of momentum.
When I was first trying to "debug" this, I considered if I needed a momentum scheduler or if my idea of a "large momentum" wasn't quite right.
Turns out, a one-line change to make the momentum term invariant of learning rate was all we need.