 [ad_1]

## Understanding how gradient descent optimization works right from the basics

In the current era of Deep Learning, you might have heard the term gradient descent before. If you didn’t understand what it is and how it works, this post is for you. In this post, I’ll be explaining what is it and how it works.

## Maxima vs Minima and Global vs Local

First, let us begin with the concepts of maxima, minima, global and local.

I’ll explain these concepts for functions of a single variable because they are easy to visualize. However, they extend to multivariate cases.

Let us start with a few definitions. 

• Global Maximum: A real-valued function f defined on a domain X has a global (or absolute) maximum point at x∗ if f(x∗) ≥ f(x) for all x in X.
• Global Minimum: A real-valued function f defined on a domain X has a global (or absolute) maximum point at x∗ if f(x∗) ≤ f(x) for all x in X.
• Local Maximum: If the domain X is a metric space then f is said to have a local (or relative) maximum point at the point x∗ if there exists some ε > 0 such that f(x∗) ≥ f(x) for all x in X within distanceε of x∗.
• Local Minimum: If the domain X is a metric space then f is said to have a local (or relative) maximum point at the point x∗ if there exists some ε > 0 such that f(x∗) ≤ f(x) for all x in X within distanceε of x∗.

Graphics tend to make the concepts easier to understand. I’ve summarized these four type of points in the following figure.

As the name suggests minimum is the lowest value in a set and maximum is the highest value. Global means it is true for the entire set and local means it is true in some vicinity. A function can have multiple local maxima and minima. However there can be only one global maximum as well as minimum. Note that for Figures (a) and (b) the function domain is restricted to the values you are seeing. If it were to be infinite then there is no global minimum for the graph in Figure (a).

Now that we understand these concepts, the next step is how to find these extremum points.

Turns out derivatives in calculus are useful for finding these points. I won’t be going into the details of derivatives. However, I’ll explain enough to understand the following discussion.

Derivative gives you a rate of change of something with respect to something. For example, how quickly a medicine would be absorbed by your system can be modeled and analysed using calculus.

Now, let us understand what is a critical point.

So we know that at these critical points there will be a either a local or global maximum or minimum. The next step is to identify which category it belongs to.

You can use either of the two tests i.e. the first and second derivative test to classify the maximum and minimum values. When I was in my high school I used to find the second derivative test faster since I’d calculate only one value (without using a calculator). I’ll show you one example of how it is actually done.

For finding whether the point is global you’ll have to evaluate the function at all the critical points and see which point is the lowest. In our examples, we have seen a polynomial function. It is smooth and differentiable. There were limited points to test for and evaluating the function is easy if you have the equation.

However, now let us move to the real world. We never know the actual equation of the real life processes that we deal with. Additionally, there are several variables involved in the equation. These tests won’t be useful in those cases. For training a neural network you need to minimize the loss with respect to the network parameters. This is a multi-dimensional surface and multiple factors come into play. And the tests I discussed above won’t be effective. So we turn to optimization for this task.

## Optimization for finding minima/maxima of a function

What is optimization?

Maximizing or minimizing some function relative to some set, often representing a range of choices available in a certain situation. The function allows a comparison of the different choices for determining which might be “best.”

Common applications: Minimal cost, maximal profit, minimal error, optimal design, optimal management, variational principles.

In mathematics, computer science and operations research, mathematical optimization or mathematical programming is the selection of the best element (with regard to some criterion) from some set of available alternatives.

Optimization is a vast ocean in itself and is extremely interesting. In the context of deep learning the optimization objective is to minimize the cost function with respect to the model parameters i.e. the weight matrices.

## What is a Gradient?

Gradient: In vector calculus, the gradient is the multi-variable generalization of the derivative. The gradient of a scalar function f(x₁, x₂, x₃, …., xₙ) [hereafter referred to as f] is denoted by ∇ f, where ∇ (the nabla symbol) is known as the del operator. It packages all the partial derivatives information into a vector.

[ad_2]

Source link