Coding Linear Regression from Scratch | by Kumud Lakara | Jan, 2021


This post follows the linear regression post in the ‘Basics and Beyond’ series so if you are just getting started with machine learning I would recommend going through that post first and then starting with this tutorial. If you already have an idea about what linear regression is then lets get started!

In this post we will be coding the entire linear regression algorithm from absolute scratch using python so we will really be getting our hands dirty today!

Let’s go!

The first step for any machine learning problem is getting the data. There is no machine “learning” if there is nothing to “learn” from. So for this tutorial we will be using a very common dataset for linear regression i.e. the house-price prediction dataset. The dataset can be found here.

This is a simple dataset containing housing prices in Portland, Oregon. The first column is the size of the house (in square feet), the second column is the number of bedrooms, and the third column is the price of the house. You might have noticed that we have more than one feature in our dataset (i.e. the house_size(in sqft) and the number of rooms) hence we will be looking at multivariate linear regression and the label (y) will be the house price as that is what we are going to be predicting.

Lets define the function for loading the dataset:

We will be calling the above function later to load the dataset. This function returns x and y (note x is made up of the first 2 columns of the dataset whereas y is the last column of the dataset as that is the price column hence in order to return x and y we are returning data[:,:2] and data[:,-1] respectively from the function).

Normalize the data

The above code not only loads the data but also normalizes it and plots the data points. We will look at the plot of the data in a bit but first lets understand what the normalize(data) line is doing above. If you look at the raw dataset you will notice that the values in the second column (i.e. the number of rooms) are much smaller than the those in the first (i.e house size). Our model does not evaluate this data as number of rooms or size of house. For the model its all just numbers. This can create an unwanted bias in your machine learning model towards the columns (or features) that have higher numerical values than the others. It can also create imbalance in the variance and mathematical mean. For these reasons and also to make the job easier it is always advised to scale or normalize your features so that they all lie within the same range ( e.g. [-1 to 1] or [0 to 1] ). This makes training much easier. So for our purpose here we will be using feature normalization which in the mathematical sense means:

Z = (x — μ) / σ

μ : mean

σ : standard deviation

In the above formula z is our normalized feature and x is the non-normalized feature. Don’t worry if you are not very familiar with these mathematical concepts. A quick review should get you going. Alright so now that we have our normalization formula lets make a function for normalization:

This code does exactly what we have discussed. It goes through each column and normalizes all data elements of that column using the mean and standard deviation of those elements.

Plot the data

Now before we jump to coding our linear regression model one thing we need to ask is WHY?

Why are we solving this problem using linear regression? This is a very valid question and before actually jumping to any concrete code you should be very clear about what algorithm you want to use and if that really is the best option given the dataset and the problem you are trying to solve. One way we can prove why using linear regression will work for our current dataset is by plotting it. For that purpose we have called the plot_data function in load_data above. Lets define the plot_data function now:

This function on being called generates the following plot:

Plot of house size vs house price (source: image by the author)

You can see that it is possible to roughly fit a line through the above plot. This means a linear approximation will actually allow us to make pretty accurate predictions and hence we go for linear regression.

Well now that we have the data ready lets move on to the fun part. Coding the algorithm!

First of all we need to define what our hypothesis function looks like because we will be using this hypothesis for calculating the cost later on. We know for linear regression our hypothesis is:

hθ(x) = θ0 + θ1×1 + θ2×2 + θ3×3 +…..+ θnxn

Our dataset however has only 2 features, so for our current problem the hypothesis is:

hθ(x) = θ0 + θ1×1 + θ2×2

where x1 and x2 are the two features (i.e. size of house and number of rooms). Lets put this in a simple python function which returns the hypothesis:

Woah what’s with the matrix multiplication?! Don’t worry it still gives us the same hypothesis equation and we will take a deeper look into why this is mathematically correct later in this post.

Okay so now we have the hypothesis function, the next important thing is the cost function.

To evaluate the quality of our model we make use of the cost function. Again this post is the exact “code version” of:

So you can go through it if anything here doesn’t make sense or just follow along both the posts. Alright so the equation for the cost function is:

source: holehouse

and the code for our cost function is:

On a closer look you will probably notice that all the python functions we have defined so far are exactly the same as the mathematics we had defined earlier for linear regression. Now that we have the cost we must minimize it and for that we use… yes gradient descent indeed!

Gradient descent in our context is an optimization algorithm that aims to adjust the parameters in order to minimize the cost function .

The main update step for gradient descent is:

source: holehouse

So we multiply the derivative of the cost function with the learning rate(α) and subtract it from the present value of the parameters(θ) to get the new updated parameters(θ).

The gradient_descent function returns theta and J_all. theta is obviously our parameter vector which contains the values of θs for the hypothesis and J_all is a list containing the cost function after each epoch. The J_all variable isn’t exactly essential but it helps to analyze the model better as you will see later in the post.

Now all that’s left to do is call our functions in the correct order:

We first call the load_data function to load the x and y values. x contains the training examples and y contains the labels (the house prices in our case).

You might have noticed that in the code throughout we have been using matrix multiplication to achieve the expressions we want. For example in order to get the hypothesis, we had to multiply each parameter(θ) with each feature vector(x) we could use for loops for this and loop over each example and perform the multiplication each time however this would not be the most efficient method if we were to have say 10 million training examples. A more efficient approach here would be to use matrix multiplication. If you aren’t very familiar with matrix multiplication I would suggest you go over it once, it’s fairly simple. For our dataset we have two features (i.e. the house size and the number of rooms) so we will have (2+1) 3 parameters. The extra parameter θ0 can be accounted for by considering that the hypothesis is nothing but a line in the graphical sense. So the extra θ0 accounts for this line to be as required.

Plot of a favorable hypothesis function (source: image by the author)

Okay so we have 3 parameters and 2 features. This means our θ or parameter vector (1-D matrix) will have the dimensions (3,1) but our feature vector will have the dimensions (46,2) {according to our dataset}. You probably have noticed by now that its not mathematically possible to multiply these two matrices. Lets take a look at our hypothesis once again:

hθ(x) = θ0 + θ1×1 + θ2×2

If you look closely it is actually quite intuitive that if we add an extra column of ones in the beginning of our feature vector(x){ making it have the dimensions (46, 3)} and if we perform matrix multiplication on x and theta we in fact will arrive at the above equation for hθ(x). If it still isn’t obvious then just try working out an example on a piece of paper.

Remember when we actually run our code for implementing this function we won’t be returning the expression like for hθ(x) instead we are returning the mathematical value that this expression evaluates to.

In the above code the line x = np.hstack((np.ones((x.shape[0],1)), x)) adds an extra column of ones to the beginning of x in order to allow matrix multiplication as required.

After this we initialize our theta vector with zeros. You can also initialize it with some small random values. We also specify the learning rate and the number of epochs (an epoch is the number of times the algorithm will go through the entire dataset) we want to train for .

Once we have all our hyper-parameters defined, we call the gradient descent function which returns a history of all the cost functions and the final vector of parameters theta. This theta vector is essentially what defines our final hypothesis. You may observe that the shape of the theta vector that is returned by the gradient descent function has the dimensions (3,1). Remember our hypothesis function?

hθ(x) = θ0 + θ1×1 + θ2×2

Well we needed 3 θs and our theta vector has the dimensions (3,1) hence each of theta[0], theta[1]and theta[2]is in fact θ0, θ1 andθ2 respectively. The J_all variable is nothing but the history of all the cost functions. You can print the J_all array to see how the cost function progressively decreases for each epoch of gradient descent.

Plot of cost vs number of epochs (source: image by the author)

This graph can be plotted by defining and calling a plot_cost function like so:

Now we can use these parameters to find the label i.e. the price of any house (in Portland, Oregon) given the house size and number of rooms.

You may now test your code calling a test function that will take as input the size of the house, the number of rooms and the final theta vector that was returned by our logistic regression model and will give us the price of the house.

Believe it or not that’s actually all there is to coding linear regression. Congratulations! You have now successfully coded a linear regression model from absolute scratch. Being able to understand and code the entire algorithm is not easy so you can pat yourself on the back for getting through. Linear regression is usually the first algorithm we usually start machine learning with so if you understood what we did here I would suggest you pick up another dataset (for linear regression) and try to apply linear regression on your own. Happy Coding 🙂

Read More …


Write a comment