How Lasso Regression Works in Machine Learning
Whenever we hear the term “regression,” two things that come to mind are linear regression and logistic regression. Even though the logistic regression falls under the classification algorithms category still it buzzes in our mind.
These two topics are quite famous and are the basic introduction topics in Machine Learning. There are other types of regression, like
- Lasso regression,
- Ridge regression,
- Polynomial regression,
- Stepwise regression,
- ElasticNet regression
The above mentioned techniques are majorly used in regression kind of analytical problems.
Learn how the lasso regression works and implementation in python #regression #lassoregression #machinelearning #artificialintelligence #datascience
When we increase the degree of freedom (increasing polynomials in the equation) for regression models, they tend to overfit. Using the regularization techniques we can overcome the overfitting issue.
To popular methods for that is lasso and ridge regression. In our ridge regression article we explained the theory behind the ridge regression also we learned the implementation part in python.
In this article we are going to focus on lasso regression in this article. Before we drive further below are a list of topics you will learn in this article.
Before we go further, let’s recap about regression.
What is regression?
Regression is a statistical technique used to determine the relationship between one dependent variable and one or many independent variables. In simple words, a regression analysis will tell you how your result varies for different factors.
What determines a person’s salary?
Many factors,like educational qualification, experience, skills, job role, company, etc., play a role in salary.
You can use regression analysis to predict the dependent variable – salary using the mentioned factors.
y = mx+c
Do you remember this equation from our school days?
It is nothing but a linear regression equation. In the above equation, the dependent variable estimates the independent variable.
In mathematical terms,
- Y is the dependent value,
- X is the independent value,
- m is the slope of the line,
- c is the constant value.
The same equation terms are called slighted differently in machine learning or the statistical world.
- Y is the predicted value,
- X is feature value,
- m is coefficients or weights,
- c is the bias value.
The line in the above graph represents the linear regression model. You can see how well the model fits the data. It looks like a good model, but sometimes the model fits the data too much, resulting in overfitting.
To create the line (red) using the actual value, the regression model will iterate and recalculate the m(coefficient) and c (bias) values while trying to reduce the loss values with the proper loss function.
The model will have low bias and high variance due to overfitting. The model fit is good in the training data, but it will not give good test data predictions. Regularization comes into play to tackle this issue.
What Is Regularization?
Regularization solves the problem of overfitting. Overfitting causes low model accuracy. It happens when the model learns the data as well as the noises in the training set.
Noises are random datum in the training set which don’t represent the actual properties of the data.
Y ≈ C0 + C1X1 + C2X2 + …+ CpXp
Y represents the dependent variable, X represents the independent variables and C represents the coefficient estimates for different variables in the above linear regression equation.
The model fitting involves a loss function known as the sum of squares. The coefficients in the equation are chosen in a way to reduce the loss function to a minimum value. Wrong coefficients get selected if there is a lot of irrelevant data in the training set.
This will not go well for model predictions in the future.
In cases like this, we can use regularization to regularize or shrink these wrongly learned coefficients to zero. Lasso regression is one of the popular techniques used to improve model performance.
Definition of lasso regression
Lasso regression is like linear regression, but it uses a technique “shrinkage” where the coefficients of determination are shrunk towards zero.
Linear regression gives you regression coefficients as observed in the dataset. The lasso regression allows you to shrink or regularize these coefficients to avoid overfitting and make them work better on different datasets.
This type of regression is used when the dataset shows high multicollinearity or when you want to automate variable elimination and feature selection.
When to use lasso regression?
Choosing a model depends on the dataset and the problem statement you are dealing with. It is essential to understand the dataset and how features interact with each other.
Lasso regression penalizes less important features of your dataset and makes their respective coefficients zero, thereby eliminating them. Thus it provides you with the benefit of feature selection and simple model creation.
So, if the dataset has high dimensionality and high correlation, lasso regression can be used.
The Statistics of lasso regression
d1, d2, d3, etc., represents the distance between the actual data points and the model line in the above graph.
Least-squares is the sum of squares of the distance between the points from the plotted curve.
In linear regression, the best model is chosen in a way to minimize the least-squares.
While performing lasso regression, we add a penalizing factor to the least-squares. That is, the model is chosen in a way to reduce the below loss function to a minimal value.
D = least-squares + lambda * summation (absolute values of the magnitude of the coefficients)
Lasso regression penalty consists of all the estimated parameters. Lambda can be any value between zero to infinity. This value decides how aggressive regularization is performed. It is usually chosen using cross-validation.
Lasso penalizes the sum of absolute values of coefficients. As the lambda value increases, coefficients decrease and eventually become zero. This way, lasso regression eliminates insignificant variables from our model.
Our regularized model may have a slightly high bias than linear regression but less variance for future predictions.
How to implement lasso regression in Python
Let us take a regression problem statement and solve it using lasso regression to learn the implementation in Python.
Real estate is a fairly big industry and the housing prices keep varying regularly based on different factors. The problem statement here is to predict housing prices as accurately as possible.
The housing dataset has 506 rows and 13 numerical inputs and one numerical output.
Data Attribute Information
1. CRIM – the per capita crime rate by town
2. ZN – the proportion of residential land zoned for lots over 25,000 sq.ft.
3. INDUS – the proportion of non-retail business acres per town
4. CHAS – Charles River dummy variable (= 1 if tract bounds river; 0 otherwise)
5. NOX – nitric oxides concentration (parts per 10 million)
6. RM – the average number of rooms per dwelling
7. AGE – the proportion of owner-occupied units built before 1940
8. DIS – weighted distances to five Boston employment centers
9. RAD – index of accessibility to radial highways
10. TAX – full-value property-tax rate per $10,000
11. PTRATIO – pupil-teacher ratio by town
12. B – 1000(Bk – 0.63)^2 where Bk is the proportion of blacks by town
13. LSTAT – % lower status of the population
14. MEDV – Median value of owner-occupied homes in $ 1000’s
We will use two evaluation metrics, RMSE & R-square to evaluate our model performance. Root Mean Squared Error(RMSE) is the standard deviation of residuals. Residuals show the distance between the predicted data points and actual data points.
This shows how good the build regression model was. In the same way to identify how good the build classification model we are having various evaluation metrics.
Since this is a direct measure of prediction errors, we should aim for a low value. The R-squared value represents how good a model fit is and how close the data are to the regression line. A high R-squared shows a good model fit.
If we are building the regression models with many features along with R-squared value we need to use Adjusted R-squared measure.
If you want learn about R-squared and Adjusted R-squared measure you can read this article.
We will follow the following steps to produce a lasso regression model in Python,
- Step 1 – Load the required modules and libraries
- Step 2 – Load and analyze the dataset given in the problem statement
- Step 3 – Create training and test dataset
- Step 4 – Build the model and find predictions for the test dataset
- Step 5 – Evaluate the lasso model
Let’s start the workflow with the first by loading the required libraries.
Load the required modules and libraries
We will import the pandas and numpy module to handle the dataset and train_test_split module to create training and test datasets.
The r2_score, sqrt and mean_squared_error modules are imported to calculate evaluation metrics. The lasso module from scikit-learn will be used to build our lasso regression model.
Load and analyze the dataset given in the problem statement
Let us load the dataset and analyze the basics like shape and summary statistics of the dataset.
Create training and test dataset
We are going to split the dataset into a training set and test set. We will build our lasso model on the training set and evaluate it using our test set.
Specify the input columns as X and the target column as Y and use the test_size argument in the train_test_split module to split the dataset. We are splitting our dataset into 70% training data and 30% test data here.
Build the model and find predictions for the test dataset
Let us instantiate the lasso model and fit the model to the training set. We will use this fitted model to predict the housing prices for the training set and test set.
Evaluate the lasso model
Evaluate the model by finding the RMSE and R-Square for both the training and test predictions.
As you can see, we have set the lasso hyperparameter – alpha as 1 or a full penalty. This alpha value is giving us a decent RMSE as of now. But, there might be a different alpha value which can provide us with better results.
Let us tune our model to check this.
The sci-kit learn library has a built-in algorithm called LassoCV which will do the tuning for us. This algorithm will find the best alpha value and complete the model tuning simultaneously during training itself. Predictions can then be made using the fit model.
By default, the model will do the tuning using 100 alpha values. We can control this by specifying the alphas argument with a grid of alpha values. The range of alpha values has been set between 0-1 with an interval of 0.02 in the below code.
LassoCV has chosen the best alpha value as 0, meaning zero penalty. You can see that the RMSE and R-Square scores have improved slightly with the alpha value selected.
We have learned about the lasso regression model in machine learning in this article. We have also covered a few interesting topics like regression, overfitting, regularization, lasso model evaluation and tuning.
- Regression is a popular statistical technique used in machine learning to predict an output.
- Overfitting happens while doing regression due to the irrelevant noises in the training dataset.
- Regularization can be used to avoid overfitting by regularizing the regression models.
- Lasso regression is a regularization algorithm which can be used to eliminate irrelevant noises and do feature selection and hence regularize a model.
- Evaluation of the lasso model can be done using metrics like RMSE and R-Square.
- Alpha is a hyper-parameter in the lasso model which can be tuned using lassoCV to control the regularization.
Recommended Machine Learning Courses
Machine Learning A to Z Course
Python Data Science Specialization Course
Complete Supervised Learning Algorithms