How Ridge Regression Works
It’s often, people in the field of analytics or data science limit themselves with the basic understanding of regression algorithms as linear regression and multilinear regression algorithms. Very few of them are aware of ridge regression and lasso regression.
In the majority of the time, when I was taking interviews for various data science roles. People say they worked in advanced technologies like transformers, Zero short, meta learnings, but when I ask,
Could you please give examples of 4 or 5 various regression algorithms or methods.
The answers are limited to simple linear regression and multilinear regression algorithms.
So I would suggest you to learn about these basic regression algorithms while you are preparing for data scientist jobs.
This gives me a little bit of motivation to explain how ridge regression works.
The core aim of the article is to let you better understand ridge regression working and not limiting yourself to use a library’s function to plugin and get the work done.
How ridge regression works #datascience #machinelearning #linearregression #ridgeregression #python #artificialintellegence
So, how can we define,
“What is ridge regression?”
The best answer could be as:
“Ridge regression is the regularized form of linear regression.”
If you are not convinced about the answer, don’t worry at all. By the end of this article, you will get to know the true significance of the justification about ridge regression.
In the last section, we are going to learn, how we can implement a ridge regression algorithm in Python.
In summary, after completing this article, you will gain a better insight into the following concepts.
- How the ridge regression works
- How can we use ridge regression for better usage
- Where ridge regression comes into play
- How to implement the ridge regression model in python
Before we dive into the details of how ridge regression works, let’s see the flow of concepts you are going to learn in this article.
Let’s start our discussion with the basic building block linear regression.
Before we learn about ridge regression, we should know about how linear regression works. Don’t forget, These pools of regression algorithms fall under the supervised learning algorithms category.
Any modeling task that involves predicting a numerical value given a set of input features termed as regression. In other words, regression tries to estimate the expected target value when we provide the known input features.
Linear regression is assumed to be the standard algorithm for identifying the linear relationship between the target variable and the input features.
In the above image, the green dots are the actual values, and the red line is the regression line, fitted for the actual data. To populate the equation, we use the line equation.
Y = mX + C
In mathematical terms,
- Y is the dependent value,
- X is the independent value,
- m is the slope of the line,
- c is the constant value.
The same equation terms are called slighted differently in machine learning or the statistical world.
- Y is the predicted value,
- X is feature value,
- m is coefficients or weights,
- c is the bias value.
To create the line (red) using the actual value, the regression model will iterate and recalculate the m (coefficient) and c (bias) values while trying to reduce the loss values with the proper loss function.
In an extension to the linear regression that encourages the models which use small coefficient values, penalties added to the loss function during the training period.
These extensions were termed as the penalized linear regression or regularized linear regression.
So, ridge regression is a famous regularized linear regression which makes use of the L2 penalty. This penalty shrinks the coefficients of those input variables which have not contributed less in the prediction task.
With this understanding, let’s learn about ridge regression.
How Ridge Regression Works
In linear regression, a linear relationship exists between the input features and the target variable. The association is a line in the case of a single input variable.
Still, with the higher dimensions, the relationship can be assumed to be a hyperplane which connects the input features to the target variable. The coefficients can be found by the optimization method to minimize the error between the predicted output i-e; that and the expected output i-e; y.
Linear regression may encounter problems in which the model’s estimated coefficients can become relatively large, making the model so unstable that it becomes sensitive to the inputs. Which relates to the problems with a few observations or variables.
An approach can be adopted to regain the regression model’s stability. In which the loss function is modified and includes additional costs for a model with relatively large coefficients.
The linear regression models having the revised version of the loss functions referred to as “Penalized or Regularized Linear Regression.”
Ordinary Least Squares (OLS) of ridge regression
The analysis method estimates the relationship between independent variables (Features) and a dependent variable (Target) termed Ordinary least squares (OLS) regression. The process predicts the ties by minimizing the sum of the squares in the difference between the observed and predicted values of the dependent variable.
On the other hand, the linear regression model whose coefficients are not estimated by OLS but by an estimator, commonly known as the ridge estimator, that is biased but has a lower variance than the estimator of Ordinary Least Squares (OLS), is termed as ridge regression.
In regression modeling, the presence of multicollinearity significantly leads to inconsistent parameter estimates. Ordinary Least Squares (OLS), a standard method in regression analysis, results in an inaccurate and unstable model because it is not robust to the multicollinearity problem.
Several methods have been proposed in the literature to address this model instability issue, and the most common one is ridge regression.
Finding Unbiased Coefficients with OLS
We know that the coefficients that best fit the data found by the least square method. It also helps in finding the unbiased coefficients. In this scenario, the word unbiased means that OLS does not discriminate between independent variables.
It does not consider which independent variable is more significant than others. It merely finds the coefficients for a given data set. Comprehensively, only one set of betas are found, resulting in the lowest’ Residual Sum of Squares (RSS)’.
Here the question arises
Is the best model the one which has the lowest RSS?
To answer, we need to know the performance of the build regression models in both training and the testing phase. Where comest the concept of bias and variance.
Bias & Variance Tradeoff
So still, the question is
Is the best model the one that has the lowest RSS?
The answer to the question above is, “Not really.”
In the word ‘Unbiased,’ we need to consider ‘Bias’ too. Bias means equal care a model gives to its predictors. Say there are two models to predict a mango price with two predictors ‘sweetness’ and ‘shine’; one of the models is unbiased while the other is biased.
First, the unbiased model finds the relationship between the two features and the prices, just as the OLS method does. This model will fit the observations in such a way as to minimize the RSS entirely.
However, the consequences may not be favorable, and overfitting issues may arise. Comprehensively, the model will not be performing well with the new data set.
Because it’s specifically built for the given data set that it may not fit the new set of data.
We can assume that the bias is related to the given model’s failure to fit the training set. On the contrary, the variance was associated with the model’s inability to fit the testing set.
Both bias and variance are in a trade-off relationship with one another over the model’s complexity, which implies that a straightforward model would have low variance and a high bias and the other way around.
Overfitting problems may lead to inaccurate and unstable model building. So, a technique that helps minimize the overfitting problem in machine learning models is known as regularization.
We call it regularization because it keeps the parameters usual or regularized. Different regression models use different regularization techniques. The regression model using the L1 regularization technique is termed as Lasso regression.
While the regression model uses L2 is termed as Ridge regression.
In this article our focus is on ridge regression, so let’s discuss L2 regularization in detail. In the lasso regression article, we will explain L1 regularization techniques.
In the above figure, the error function is computed based on the training data set. When our given model fits too closely to the training data, then it is called model overfitting.
In this scenario, the model’s performance is excellent on the training dataset but highly inadequate on the testing data set.
Regularization comes into play and helps in keeping the parameters regular in optimizing the errors.
Elements in L2 Regularization
In the figure below, the L2 regularization element is represented by the highlighted part. “Squared magnitude” of coefficient as penalty term is added to the loss function by ridge regression.
In the formula above, if lambda is zero, then we get OLS.
However, the high value of lambda will add too much weight. Which will result in model under-fitting.
Therefore, it is important how we choose the parameter lambda for our model. We are not able to cover the lasso in this article, So will give a high level comparison between lasso and ridge regression.
Difference between Lasso and Ridge regression
The significant difference between lasso and ridge regression is the penalty term. The other differences are listed in the tabular form below.
Use of Ridge Regression
We know that the Ordinary Least Square Method (OLS) treats all the variables in an unbiased manner. So, as more variables are incorporated, the OLS model becomes more complicated.
The OLS model is on the right side with the low bias and high variance in the figure below. The OLS model’s position is stationary and fixed, but the change in position can occur when ridge regression comes into play.
In ridge regression, the model coefficients will change as we tune the lambda parameter.
Geometric Understanding of Ridge Regression
The figure below represents the geometric interpretation to compare OLS and ridge regression.
Each contour is connected to spots where the RSS is the same, centered with the OLS estimate, where the RSS is the lowest. Also, the OLS estimate is the point where it best fits the training set (low-bias).
The vector norm is nothing but the following definition.
The subscript ‘2’ is as in ‘L2 norm’. We only care about the L2 norm at this moment, so we can construct the equation we’ve already seen.
In the following equation, the first term is OLS and the second term with the lambda parameter makes ridge regression.
What We Really Want to Find
Having the lambda parameter is often called “penalty,” as it causes a significant RSS increase. We try and iterate certain values onto lambda and then evaluate the model with a measurement like
Mean Square Error (MSE). The value of the lambda, which minimizes MSE, is selected as the final one. The ridge regression model is far better than the OLS model in prediction.
In the formula below, if lambda is equal to zero i-e no penalty, ridge regression becomes the same as OLS.
Ridge Regression Python Implementation
So, let’s start some basic ridge regression implementation in Python. First of all, we have to import the following libraries.
To create the sample data we are using the scikit-learn library.
Now, we will define alpha i-e hyperparameter, which determines the strength of regularization. The larger the value of the hyperparameter, the greater the strength of regularization.
In short, when the alpha is large, the model has a very high bias. With the value=1 of alpha, the model acts identically to that of Linear Regression.
Now, let’s have a look at how the regression line will fit the data.
Now, we do the same thing using the scikit-learn implementation of Ridge Regression. For this, we need first to create and train an instance of the Ridge Class.
As we can see that the regression line was pretty much better when we opted for 0.5 as the value of alpha. Now, we try by putting ’10’ as the value of our hyperparameter.
Now, putting the value of alpha=100 and then observing the results as:
When the value of alpha tends towards positive infinity, then the regression line will tend towards a zero mean, effectively minimizing the variance across different datasets.
For the complete code, please check out our Github account.
So, we studied ridge regression and compared it with Lasso regression along with Least Square Method. We dived deeply into the ridge regression by viewing it from different angles like a mathematical formula, vectorized notation, and geometric explanation.
We got an idea that ridge regression is a linear regression with a penalty. Learned that no equation could find the best value of lambda. We iterated and tried several random different values to evaluate the prediction performances with MSE correctly.
Consequently, by doing so, we found that the ridge regression model has a better performance than the simple regression model. At the end we implemented the ridge regression in python.
Machine Learning A to Z Course
Python Data Science Specialization Course
Complete Supervised Learning Algorithms