AdaBelief Optimizer: fast as Adam, generalizes as well as SGD | by Kaustubh Mhaisekar | Dec, 2020
Now that you know how Adam works, let’s look at AdaBelief.
As you can see, the AdaBelief optimizer is extremely similar to the Adam optimizer, with one slight difference. Here instead of using v-t, the EMA of gradient squared, we have this new parameter s-t:
And this s-t replaces v-t to form this update direction:
Let us now see what difference does this one parameter make and how does it affect the performance of the optimizer.
s-t is defined as the EMA of (g-t – m-t)², that is, the square of the difference between the gradient and the EMA of the gradient(m-t). This means that AdaBelief takes a large step when the value of the gradient is close to its EMA, and a small step when the two values are different.
Let’s look at this graph here to better understand AdaBelief’s advantage over Adam –
In the given graph, look at region 3:
In region 3, the value of g-t is going to be big as the curve is really steep in that area. The value of v-t is going to be big as well, and thus if we used Adam here, the step size in this region is going to be really small as v-t is in the denominator.
But, in AdaBelief, we calcuate s-t as the moving average of the difference between the gradient and its moving average squared. And since both of these values are really close, the value of s-t is going to be really small this time, thus if we use AdaBelief, since s-t is really small and is in the denominator, we will end up taking big steps in this region, as an ideal optimizer should.
We see that AdaBelief can take care of regions with “Large gradient, small curvature” cases, while Adam can’t.
Also note that in the graph regions 1 and 2 can be used to demonstrate the advantage of AdaBelief and Adam over optimizers such as momentum or SGD in the following way:
- Region 1: The curve is very flat with the gradient almost equal to 0, we would ideally want to have large steps here. If we use momentum or SGD, where we multiply the step size with the moving average,it would result in small steps, while in AdaBelief and Adam large steps will be taken as we will be dividing by the moving averages.
- Region 2: The curve here is very steep with a high gradient, we would ideally want small steps here. If we use momentum or SGD, on multiplying by the large moving averages we will get large update steps, while in Adam and AdaBelief we would divide by the moving averages thus resulting in smaller steps.
Let us gain some more intuition by looking at this 2D example, consider a loss function – f(x,y) = |x| + |y|:
Here the blue arrows represent the gradients and the x on the right side is the optimal point. As you can see, the gradient in the x direction is always 1, while in the y direction it keeps oscillating between 1 and -1.
So in Adam, v-t for x and y directions will always be equal to 1 as it considers only the amplitude of the gradient and not the sign. Hence Adam will take the same sized steps in both x and y directions.
But in AdaBelief, both the amplitude and sign of the gradient is considered. So in the y direction s-t will be equal to 1, in the x direction it will become 0, thus taking much larger steps in the x direction than the y direction.
Read More …