Linear regression and gradient descent for absolute beginners | by Lily Chen | Nov, 2020
Gradient Descent Algorithm
In machine learning terminology, the sum of squared error is called the “cost”. This cost equation is:
This equation is therefore roughly “sum of squared errors” as it computes the sum of predicted value minus actual value squared.
1/2mis to “average” the squared error over the number of data points so that the number of data points doesn’t affect the function. See this explanation for why we divide by 2.
In gradient descent, the goal is to minimize the cost function. We do this by trying different values of slope and intercept. But which values to try and how do you go about changing those values?
We change their values according to the gradient descent formula, which comes from taking the partial derivative of the cost function. The exact math can be found in this link.
By taking the partial derivative, you arrive at the formula:
This formula computes by how much you change your theta with each iteration.
The alpha (α) is called the learning rate. The learning rate determines how big the step would be on each iteration. It’s critical to have a good learning rate because if it’s too large your algorithm will not arrive at the minimum, and if it’s too small, your algorithm will take forever to get there. For my example, I picked the alpha to be 0.001
To summarize, the steps are:
- Estimate θ
- Compute cost
- Tweak θ
- Repeat 2 and 3 until you reach convergence.
Here’s my implementation for simple linear regression using gradient descent.
I started at 0,0 for both the slope and intercept.
Note: In machine learning, we use theta to represent the vector [y-intercept, slope]. Theta0 = y-intercept. Theta1=slope. That’s why you see theta as variable name in the implementation below.
Using this algorithm and the dataset above for mothers and daughters heights, I got to a cost of 3.4 after 500 iterations.
The equation after 500 iterations is
y = 0.998x + 0.078. The actual regression line is
y = 1.2x -12.87 with cost of approximately 3.1.
With an estimate of [0,0] as initial value for [y-intercept, slope], it’s impractical to get to
y = 1.2x -12.87 . To get close to that without tons and tons of iterations, you’d have to start with a better estimate.
For example, [-10, 1] will get roughly
y = 1.153x — 10 and cost of 3.1 after less than 10 iterations.
Adjusting parameters like learning rate and starting estimate is commonplace in the world of machine learning.
There it is, the gist of gradient descent in linear regression.
Gradient descent is an algorithm that approaches the least squared regression line via minimizing sum of squared errors through multiple iterations.