Gradient Descent Unraveled. Understanding how gradient descent… | by Manpreet Singh Minhas | Nov, 2020
Understanding how gradient descent optimization works right from the basics
In the current era of Deep Learning, you might have heard the term gradient descent before. If you didn’t understand what it is and how it works, this post is for you. In this post, I’ll be explaining what is it and how it works.
Maxima vs Minima and Global vs Local
First, let us begin with the concepts of maxima, minima, global and local.
I’ll explain these concepts for functions of a single variable because they are easy to visualize. However, they extend to multivariate cases.
Let us start with a few definitions. 
- Global Maximum: A real-valued function f defined on a domain X has a global (or absolute) maximum point at x∗ if f(x∗) ≥ f(x) for all x in X.
- Global Minimum: A real-valued function f defined on a domain X has a global (or absolute) maximum point at x∗ if f(x∗) ≤ f(x) for all x in X.
- Local Maximum: If the domain X is a metric space then f is said to have a local (or relative) maximum point at the point x∗ if there exists some ε > 0 such that f(x∗) ≥ f(x) for all x in X within distanceε of x∗.
- Local Minimum: If the domain X is a metric space then f is said to have a local (or relative) maximum point at the point x∗ if there exists some ε > 0 such that f(x∗) ≤ f(x) for all x in X within distanceε of x∗.
Graphics tend to make the concepts easier to understand. I’ve summarized these four type of points in the following figure.
As the name suggests minimum is the lowest value in a set and maximum is the highest value. Global means it is true for the entire set and local means it is true in some vicinity. A function can have multiple local maxima and minima. However there can be only one global maximum as well as minimum. Note that for Figures (a) and (b) the function domain is restricted to the values you are seeing. If it were to be infinite then there is no global minimum for the graph in Figure (a).
Now that we understand these concepts, the next step is how to find these extremum points.
Turns out derivatives in calculus are useful for finding these points. I won’t be going into the details of derivatives. However, I’ll explain enough to understand the following discussion.
Derivative gives you a rate of change of something with respect to something. For example, how quickly a medicine would be absorbed by your system can be modeled and analysed using calculus.
Now, let us understand what is a critical point.
So we know that at these critical points there will be a either a local or global maximum or minimum. The next step is to identify which category it belongs to.
You can use either of the two tests i.e. the first and second derivative test to classify the maximum and minimum values. When I was in my high school I used to find the second derivative test faster since I’d calculate only one value (without using a calculator). I’ll show you one example of how it is actually done.
For finding whether the point is global you’ll have to evaluate the function at all the critical points and see which point is the lowest. In our examples, we have seen a polynomial function. It is smooth and differentiable. There were limited points to test for and evaluating the function is easy if you have the equation.
However, now let us move to the real world. We never know the actual equation of the real life processes that we deal with. Additionally, there are several variables involved in the equation. These tests won’t be useful in those cases. For training a neural network you need to minimize the loss with respect to the network parameters. This is a multi-dimensional surface and multiple factors come into play. And the tests I discussed above won’t be effective. So we turn to optimization for this task.
Optimization for finding minima/maxima of a function
What is optimization?
Maximizing or minimizing some function relative to some set, often representing a range of choices available in a certain situation. The function allows a comparison of the different choices for determining which might be “best.”
Common applications: Minimal cost, maximal profit, minimal error, optimal design, optimal management, variational principles.
In mathematics, computer science and operations research, mathematical optimization or mathematical programming is the selection of the best element (with regard to some criterion) from some set of available alternatives.
Optimization is a vast ocean in itself and is extremely interesting. In the context of deep learning the optimization objective is to minimize the cost function with respect to the model parameters i.e. the weight matrices.
What is a Gradient?
Gradient: In vector calculus, the gradient is the multi-variable generalization of the derivative. The gradient of a scalar function f(x₁, x₂, x₃, …., xₙ) [hereafter referred to as f] is denoted by ∇ f, where ∇ (the nabla symbol) is known as the del operator. It packages all the partial derivatives information into a vector.
∇ f is a vector valued function and points in the direction of steepest ascent. Imagine if you are standing a some point in the input space of f, the gradient vector tells you which direction you should travel, to increase the value of f rapidly.
If you are interested in learning more. Khan academy is a wonderful place to learn about gradients and mathematics in general.
This is a type of optimization technique where you move towards the minimum in an iterative manner.
The steps involved are as follows:
- Guess a point (randomly or through some technique).
- Choose a step size which in deep learning is called the learning rate ρ and is also one of the most important hyper-parameter in deep learning.
- Calculate the gradient of the function
- Move in the opposite direction by subtracting this from the initial guess. This is because we want to descend and gradient gives you steepest ascent direction.
- Repeat steps 3 and 4 for n times or until a stop criterion is reached
In equation form for deep learning applications it can be written as:
ρ is the learning rate and governs how much you dampen the gradient value while taking the step.
β is a network parameter.
One more point to mention here is that unless the function is convex the algorithm can get stuck at a local minimum instead of converging to the global minimum.
In mathematics, a real-valued function defined on an n-dimensional interval is called convex (or convex downward or concave upward) if the line segment between any two points on the graph of the function lies above or on the graph. This means that the function itself has one minimum for the strictly convex case.
In summary, the gradient descent is an optimization method that finds the minimum of an objective function by incrementally updating its parameters in the negative direction of the gradient of the function which is the direction of steepest descent.
Gradient Descent 1D Example
Let us look at few examples. We will start with a 1D case, since it is easiest to visualize. The equation used is the same as the one I used earlier in the maxima and minima section.
I’ll show how to manually implement the code in python using numpy and matplotlib.
The graphs of various combinations of the start point and learning rates are shown in the following figure.
The equation under consideration has two minima. We want to find the global minimum using the gradient descent algorithm. I’ll be explaining the figures filled by row. The number of iterations is kept as 10 for all the examples. I’ve varied the learning rate and the initial start point to show you how it affects the algorithm. First five plots show that if start point and the number of iterations is kept the same, larger the learning rate faster the algorithm descends. But in those cases, the algorithm got stuck at a local minimum. In plot seven we see that due to the large learning rate coupled with the steep slope at the start point the algorithm took a large jump to the global minimum.
Even though large learning rates can cause faster convergence, there are chances of your algorithm moving away from the minimum because of the large effective step size. Take a look at the last figure, the starting point enabled the algorithm to go to the global minimum but the large learning rate caused the algorithm to diverge. One more case is shown in the second last figure. The point was initialized at a critical point which has a zero derivative at that point. No matter how large your learning rate or how many steps you take the algorithm won’t move from that point.
From the above examples, we see that learning rate has to be chosen such that it is not too low so that the gradient is over-damped or too high for the gradient to blow up the step size. The initialization of the start point is crucial can get you the global minimum or the local. The number of steps to be taken also matter.
Gradient Descent 1D Example
Now let us walk through a few 2D examples.
The first examples are of a paraboloid and its equation is as follows.
I’ll be using sympy for this tutorial. This allows you to do symbolic computations on equations. Even if you don’t know calculus, you can use the code to try out different equations and experiment around.
Following code creates a function f1 using the symbolic representation of sympy and calculates the partial derivatives w.r.t. x and y to calculate the gradient. Which is then used to perform gradient descent by updating the x and y values using the appropriate components.
The following figure shows two views of the 3D plot.
One thing immediately evident from the plot is as the algorithm reaches near the minimum the gradient value decreases and therefore the step size decreases. As I mentioned at the start of the post, the technique remains the same for n dimensional case. The observations that we draw in lower dimensions can be almost always extended to the higher dimensional case with little or no modifications.
To give you a flavor for different surfaces, following are two more examples of 2D gradient descent. As discussed earlier the initialization matters and will ultimately decide where you end up at the end.
Python utility to try out different gradient descent parameters interactively
I’ve made a python utility that lets you play around with the learning rate, a number of iterations and initial start point to see how these affect the algorithm performance for the 1D example. The code for this, as well as the 2D function plots, is available at the following repository.
So far we’ve seen the algorithm and taken 1D and 2D examples to analyse how their choice affects the convergence.
Gradient Descent in Deep Learning 
In the context of deep learning gradient descent can be classified into following categories.
- Stochastic Gradient Descent
- Mini-batch Gradient Descent
- Batch Gradient Descent
Stochastic Gradient Descent: Suppose you want to minimize an objective function that is written as a sum of differentiable functions.
Each term Qᵢ is usually associated with the iᵗʰ data point.
Standard gradient descent (batch gradient descent) is given as follows.
where η is the learning rate (step size).
Stochastic Gradient Descent (SGD) considers only a subset of summand functions at every iteration. This can be effective for large-scale problems. The gradient of Q(w) is approximated by a gradient at a single example:
This update needs to be done for each training example. Several passes might be necessary over the training set until the algorithm converges.
Pseudo-code for SGD:
- Choose an initial value of w and η.
- Repeat until converged.
- Randomly shuffle data points in the training set.
- For i = 1,2,3,…,n, do:
- w = w − η ∇ Qᵢ(w)
Let us consider the following equation:
The objective function is:
The update rule for SGD then is as follows.
Mini Batch Gradient Descent:
- Batch gradient descent uses all n data points in each iteration.
- Stochastic gradient descent uses 1 data point in each iteration.
- Mini-batch gradient descent uses b data points in each iteration. b is a parameter called mini-batch size.
Pseudo-code for Mini-Batch Gradient Descent:
- Choose an initial value of w and η.
- Choose for example b = 10
- Repeat until converged.
- Randomly shuffle data points in the training set.
- For i = 1,11,21,…,n-9, do:
Discussion on the three types of Gradient Descent
Few points on the three variants.
- SGD is noisier than the other two variants because it is updated per example.
- SGD is often called an online machine learning algorithm because it updates the model for each training example.
- SGD is computationally more expensive than the other variants because it per example.
- SGD is jittery and causes convergence to a minimum difficult.
- Mini-batch gradient descent allows you to take the advantage of your GPU and process multiple examples in one go therefore reducing the training time.
- Batch gradient descent is practically infeasible unless you have a small data-set which can fit into the available memory.
Choice of mini-batch size parameter b depends on your GPU, data-set and the model. A rule of thumb is to choose a size in multiples of 8. Generally, 32 is a good number to start with. If you choose your number too high or low then the algorithm can become slower. In the former case, computations could become slower because you are putting a lot of load on the GPU. While in the latter case, lower mini-batch size underutilizes your GPU. Finding the right balance for your case is important.
Tuning the learning rate η:
If η is too high, the algorithm diverges. If η is too low, it makes the algorithm slow to converge.
A common practice is to make ηₙ a decreasing function of iteration number n. Following are two examples.
- ηₙ = k/(n+c) , where k and c are constants.
- ηₙ = η₀ e⁻ᵏⁿ, where η₀ is the initial learning rate and k decides the steepness of the exponential decay.
One can use interval based learning rate schedule too. For example:
where n is the iteration number.
The first iterations cause large changes in w, while the later ones do only fine tuning.
There is something called as cyclical learning rates where you use a periodic function for scheduling. You can read more about it here.
This wraps up the discussion for Gradient Descent.
- Gradient descent is an optimization algorithm that is used in deep learning to minimize the cost function w.r.t. the model parameters.
- It does not guarantee convergence to the global minimum.
- The convergence depends on the start point, learning rate and number of iterations.
- In practice, mini-batch gradient descent is used with a batch size value of 32 (this can vary depending on your application).
Hope this was useful. Your suggestions, feedback and comments are welcome.
 University of Waterloo STAT 946 Deep Learning Course https://www.youtube.com/playlist?list=PLEAYkSg4uSQ0MAdcjYsKFEo3guPTuUH66