Five Key Assumptions of Linear Regression Algorithm

[ad_1]


Linear Regression Assumptions

  Nearly 80% of the people build linear regression models without checking the basic assumptions of linear regression.

Just hold for a second and think. How many times have you built linear regression models without checking the linear regression assumptions?

If you are not aware about the linear regression algorithm. It is a famous supervised machine learning algorithm that represents the linear relationship between a dependent variable and independent variables.

It is easy to understand and implement. However, just writing a few lines of code won’t work as expected.

Because before implementing the linear regression, we have to take care of certain assumptions made by linear regression. 

Learn the 5 key linear regression assumptions, we need to consider before building the regression model. #datascience #machinelearning #ai #regression #python



Click to Tweet

It is important to understand these assumptions to improve the regression model’s performance

So In this article, we are going to discuss these assumptions in-depth and ways to fix them if violated. After gaining proper knowledge of linear regression assumptions, you can bring excessive improvement in regression models. 

Before we dive further, let’s look at the topic you are going to learn in this article.

Linear Regression Algorithm

Before explaining the algorithm, let’s see what regression is.

Linear Regression

Regression is a method used to determine the degree of relationship between a dependent variable(y) and one or more independent variables (x). 

Linear regression determines the relationship between one or more independent variable (s) and one target variable. 

In machine learning, linear regression is a commonly used supervised machine learning algorithm for regression kind of problems. It is easy to implement and understand. 

Supervised means that the algorithm can make predictions based on the labeled data feed to the algorithm.

 Mathematically, linear regression can be represented as

Y = mx+c

Here,

  •  y = dependent variable (Target variable)
  •  x = independent variable
  •  m = regression coefficient 
  •  c  = intercept of the line

In linear regression, the target variable has continuous or real values.

For example,

We are predicting the price of houses based on certain features. Here, the houses’ prices are the target(dependent) variable, and the features determining the price are independent variables. 

When the target variable can be determined using one independent variable, it is known as simple linear regression

When it’s(target) dependent on multiple variables, it is known as multiple linear regression. 

I hope we have given a high-level overview of the linear regression algorithm. If you want to know more, you can refer to the below articles.

Generally, most people don’t check the linear regression assumption before building any linear regression models. But we need to check these assumptions.

Let me list down the linear regression assumptions we need to check, and then we can discuss each of these in detail.

  1. Linear Relationship
  2. Normal Distribution of Residuals
  3. Multicollinearity
  4. Autocorrelation
  5. Homoscedasticity

Ideally you need to check these for Lasso regression and Ridge regression models too.

Linear Relationship

This is the first and most important assumption of linear regression. It states that the dependent and independent variables should be linearly related. It is also necessary to check for outliers because linear regression is sensitive to outliers. 

Now the question is 

How to check whether the linearity assumption is met or not. 

For determining this, we can use scatter plots. Scatter plots help you to visualize if there is a linear relationship between variables or not. Let me take an example to elaborate on it. 

Suppose you have to check the relationship between the student’s marks and the number of hours they study.

You can find this student’s marks dataset in our Github repo. Go to the inputs folder to download the file.
House Price Linear Relationship

From the above plot, we can see that devoting more hours does not necessarily increase marks, even though the relationship is still a linear one. 

Let’s take another example where the linear relationship doesn’t hold. 

In the given plot (Ozone vs. Radiation), we can see that the linear relationship isn’t held between ozone and radiation. 

Ozone Radiation Linear Relationship

Here, you can see there is no linear relationship between ozone and radiation.

It is important to check this assumption because if you fit a linear model to a non-linear one, the regression algorithm would fail to capture the trend. 

Hence, it will result in an inefficient model. Also, this will lead to erroneous predictions on the unseen data sets.

Now comes the question

What to do if the features and target relationship is not linear?

Let’s learn this.

What to do if linear relationship assumption isn’t met

Let us discuss the options you can go with. 

  1. You can apply nonlinear transformations to the independent and dependent variables.
  2. You can add another feature to the model.
    1. For example, if the plot of x’ vs. y’ has a parabolic shape, then it might be possible to add x2 as an additional feature in the model.

Normal Distribution of Residuals

The second assumption of linear regression is all the residuals or error terms should be normally distributed. If residuals are non-normally distributed, the estimation may become too wide or narrow. 

If there is non-normal distribution in residuals. You can conclude that there are some unusual data points that we have to observe closely to make a good model. 

Ways to Check Normal Distribution 

To check the normal distribution, we can leverage the help from the two plots

  • Distribution Plots
  • Q-Q Plots

Distribution Plot

We can use the distribution plot on the residuals to check if it is normally distributed. 

Normal Distribution

Here, the black line is showing the normal (standard) distribution, and the blue line is showing the current distribution. 

We can see that there is a slight shift in the normal and current distribution. We can use the non-linear transformation of the given features if the residuals are not normally distributed.

Q-Q Plot

Which stands for “quantile-quantile” plot, can also be used to check if the residuals of a model follow a normal distribution or not. 

If the residuals are normally distributed, then the plot will show a straight line. However, the deviation in the straight line shows the absence of normality. 

Normality can be checked by doing statistical tests, too, like – the Kolmogorov-Smirnov test, Jarque-Barre, or D’Agostino-Pearson.  

Residuals Q-Q Plots

Multicollinearity

The next assumption of linear regression is that there should be less or no multicollinearity in the given dataset. 

This situation occurs when the features or independent variables of a given dataset are highly correlated to each other. 

In a model having correlated variables, it becomes difficult to determine which variable is contributing to predict the target variable. Another thing is, the standard errors tend to increase due to the presence of correlated variables. 

Also, when independent variables are highly correlated, the predicted regression coefficient of a correlated variable depends on other variables that are available in the model. 

If you drop one correlated variable from the model, its predicted regression coefficients will change.  It can lead to wrong conclusions and poor performance of our model

How to Test Multicollinearity

We can test multicollinearity by using the following approaches.

  • Correlation Matrix
  • Tolerance
  • Variance Inflation Factor

Let’s discuss the above in detail.

Correlation matrix

Correlation represents the changes between the two variables. While calculating Pearson’s Bivariate Correlation matrix, it is recommended that the correlation coefficient among all independent variables should be less than 1. 

Let us check the correlation of the variables in our student_score dataset. 

Heatmap

In this dataset, we are having one independent variable(hours) only to determine our target variable (score). We can see that hours devoted are highly correlated with marks scored by the student. 

Tolerance

Tolerance helps us to determine the effect of one independent variable on all other independent variables. 

Mathematically, it can be defined as T = 1-R², where R2 is computed by regressing the independent variable of concern onto the remaining independent variables. If the value of T is less than 0.01, i.e., T<0.01, then your data has multicollinearity.   

Variance Inflation Factor

VIF approach chooses each feature and regresses it against the remaining features. It is calculated by using the given formula 

VIF = 1 / 1 – R^2

  • If VIF value <=4, it implies no multicollinearity
  • If VIF value>=10, it implies significant multicollinearity

Methods to handle Multicollinearity

  1. You can drop one of those features which are highly correlated in the given data.
  2. Derive a new feature from collinear features and drop these features (used for making new features).

Autocorrelation

One of the analytical assumptions of linear regression is that the given dataset should not be autocorrelated. This phenomenon occurs when residuals or error terms are not independent of each other.  

In simple terms, when the value of f(x+1) is not independent of the value of f(x).  This situation usually occurs in the case of stock prices, where the price of a stock is dependent on its previous one. 

Auto Correlation Example

How to Test Autocorrelation Assumption is met?

The easiest way to check if this assumption is met to look at a residual time series plot. This is a plot of residuals vs. time.

Usually, most of the residual autocorrelations should fall within the 95% confidence intervals around zero. Which are located at about +/- 2-over the square root of N, where N is the dataset’s size. 

It can also be checked using the Durbin-Watson test.

Durbin-Watson test statistics can be implemented using statsmodels.durbin_watson() method. 

Formula:

Durbin watson test

Output :  0.07975460122699386

  • If the value of durbin_watson  = 2, it implies no autocorrelation
  • If the value of durbin_watson lies between  0 and 2, it implies positive autocorrelation.
  • If the value of durbin_watson lies between  2 and 4, it implies negative autocorrelation.

Methods to Handle Autocorrelation

  1. Include the dummy variables in the data.
  2. Predicted Generalized Least Squares
  3. Include a linear sequence,  if the residuals showing a consistent increment  or decrement in pattern

Homoscedasticity

The fifth assumption of linear regression analysis is homoscedasticity. Homoscedasticity depicts a circumstance in which the residuals  (that is, the “noise” or error terms in between the independent variables and the dependent variable) is the same across all values of the independent variables. 

Simply put, residuals should have constant variance. If this condition is not followed, it is known as heteroscedasticity.

Heteroscedasticity leads to the unbalanced scatter of residuals or error terms. Generally, non-constant variation arises in the presence of outliers.

It seems like these values get too much importance, thereby disproportionately impact the model’s performance. The presence of heteroscedasticity in a regression analysis makes it difficult to trust the results of the analysis. 

How to Test if Homoscedasticity Assumption is met?

The most basic approach to test for heteroscedasticity is by plotting fitted values against residual values.

The plot will show a funnel-shaped pattern if heteroscedasticity exists.

Homoscedasticity Vs Heteroscedasticity

The presence of heteroscedasticity can also be computed using the statistical approach. They are as following:

The Breush – Pegan Test: 

It determines whether the variance of the residuals from regression depends on the values of the independent variables. If it is so then, heteroscedasticity is present.

White Test:

White test determines if the variance of the residuals in a regression analysis model is fixed or constant.

Methods to handle Heteroscedasticity

We are having two ways to handle the Heteroscedasticity, let’s understand both.

Transform the Dependent Variables 

We can transform the dependent variables to avoid heteroskedasticity. The most commonly used transformation is taking the log of dependent variables. 

For instance,

If we are using independent variables(input features)  to predict the number of cosmetic shops in a city (target variable). We may try to use input features to predict the log of the number of cosmetic shops in a city.

Using the log of the target variable helps to reduce the heteroskedasticity. To some extent. 

Use weighted regression

Another approach to deal with heteroskedasticity is by using weighted regression. In this method, a weight is assigned to each data point based on the variance of its fitted value.

Conclusion 

This is the end of this article. We discussed the assumptions of linear regression analysis, ways to check if the assumptions are met or not, and what to do if these assumptions are violated. 

It is necessary to consider the assumptions of linear regression for statistics. The model’s performance will be very good if these assumptions are met.

The classical linear regression model is one of the most systematic predictors if all the assumptions hold. 

The best thing about this concept is that the efficiency increases as the sample size increases to infinity. 

What next

After reading the article, please take any of the regression algorithm you have developed in the past and check these linear regression assumptions.

For implementing and understanding the linear regression concepts. I would suggest reading this article to understand the linear regression concept in a more practical way.

Also, explore remaining machine learning algorithms on our platform to enhance your knowledge.

Recommended Machine Learning Courses

Deep Learning python

Machine Learning A to Z Course

Deep-learning-for-computer-vision2.png

Python Data Science Specialization Course

supervised learning

Complete Supervised Learning Algorithms



Read More …

[ad_2]


Write a comment