 ## Linear Regression Analysis in R. A walk-through about setup, diagnostic… | by Jinhang Jiang | Dec, 2020

R is a great free software environment for statistical analysis and graphics. In this blog, I will demonstrate how to do linear regression analysis in R by analyzing correlations between the independent variables and dependent variables, estimating and fitting a model, and evaluating the results’ usefulness and effectiveness.

I think R studio’s interface (the most commonly used editor for R) is more user friendly than the editors that I have used with Python. Moreover, I also personally believe that even though Python is a more powerful programming language in general, R is much easier to learn for starters and the way of coding is more straightforward.

I’ve used both Python and R to do the linear regression analysis for different projects and competitions. In the previous blog, I’ve also talked about conducting a linear regression analysis in Python. You may find the article here: Linear Regression Analysis on House Price in Python. If you are interested, you may read both articles and compare the similarities and differences.

I used a dataset from one of my assignments, which has information about 392 vehicles. The goal is to find a model to predict the variable mpg (= miles per gallon).

Here is the data dictionary of the possible independent variables:

cylinders = Number of cylinders between 4 and 8

displacement = Engine displacement (cu. inches)

horsepower = Engine horsepower

weight = Vehicle weight (lbs.)

acceleration = Time to accelerate from 0 to 60 mph (sec.)

year = Model year (modulo 100)

origin = Origin of the car (1. American, 2. European, 3. Japanese)

`#Read the dataauto=read.csv("Auto.csv")head(auto)`

Since we are not using a high-dimensional dataset, one of the normal practice is to plot either scatter plots or boxplots to help understand the possible relationship between each of the independent variables and the dependent variable mpg.

`#Tell R which table you are manipulatingattach(auto)# set the plotting area into a 4*2 arraypar(mfrow=c(4,2))    # when you use "~" to connect the variables, Y variable goes firstplot(mpg~cylinders) plot(mpg~displacement) # "origin" is a categorical variable, so we use the boxplotboxplot(mpg~as.factor(origin))# when you use "," to connect the variables, X variable goes firstplot(horsepower,mpg)plot(weight,mpg)plot(acceleration,mpg)plot(year,mpg)`

Judging from the scatter plots above, I concluded a curved relationship between mpg and the three variables: weight, displacement, and horsepower; there was a linear relationship between mpg and another three variables: cylinders, acceleration, and year. For the boxplot, the medians were clearly different among the different origins, and the variation seemed relatively consistent; thus, origin would also be a good independent variable.

When I estimate the model, I applied the poly( ) command, which helps avoid writing out a long formula with powers of the variables with a curved relationship with mpg.

`fit=lm(formula = mpg ~ cylinders + poly(displacement, 2) + poly(horsepower,2) + poly(weight, 2) + acceleration + year + as.factor(origin), data = auto)summary(fit)`

In this case, we mainly look into the p-values and the R-squared scores.

P-value: It comes with 0 ~ 3 stars on the side, which indicates the level of statistical significance of the variable. We are looking for at least one star to keep the variable in the model. For the p-value itself, we are looking for values that are smaller than 0.05. Note: 2.2e-16 is the default number representing the extremely small number (infinitely close to 0).

R-squared: It tells you how much Y is explained by the X in your model, normally we are looking for values that are at least larger than 0.5. The R-squared score will always increase if you add more data to your model

Adjusted R-squared: It is an adjusted statistic based on the number of independent variables in the model. The Adjusted R-squared score will always be less than or equal to the R-squared score.

I excluded the variables that are not statistically significant. And the summary of the new model is in Figure 4:

`# Exclude the variables that are not statistically significantfit1=lm(mpg~poly(horsepower,2)+poly(weight,2)+year+poly(acceleration,2)+origin, data = auto)summary(fit1)`

As for now, all the variables are statistically significant.

For the diagnostic analysis, I essentially looked for evidence of whether the model followed a normal distribution.

`## diagnostic analysispar(mfrow=c(2,2)) plot(fit\$res~fit\$fitted)hist(fit\$res)qqnorm(fit\$res); qqline(fit\$res, col = 2,lwd=2,lty=2)shapiro.test(fit\$res)`

Residuals vs. Fitted Plot: I mainly looked for two things in this plot. First, if the plots violate linearity. Second, if the residuals randomly spread and the variance is constant.

Histogram: This plot also measures if the residuals follow a normal distribution. A good plot should be a downward bell shape with a peak in the middle.

Normal Q-Q Plot: This scatter plot also measures if the residuals follow a normal distribution. A good plot will have the scatters form a line.

Shapiro-Wilk Normality Test: If the value of the test is larger than 0.05, the data follows a normal distribution.

Judging from the test results above, I had enough evidence that the model needs a transformation. Then, I applied the boxcox() test.

`## boxcox testlibrary(MASS)boxcox(mpg~poly(horsepower,2)+poly(weight,2)+year+poly(acceleration,2)+origin, data = auto)`

Lambda: This value indicates what kind of transformation the dependent variable needs. For example, if the plot suggests lambda is equal to -2, then we need to take a root square of Y.

In this case, the plot suggests lambda equals 0, which means a log transformation is needed.

`## Log Transformationauto\$logmpg=log(auto\$mpg)fit2=lm(logmpg~horsepower+weight+year+poly(acceleration,2)+origin, data = auto)summary(fit2)`

All the variables were statistically significant.

The results of the diagnostic tests of the transformed model were still not perfect, but a great improvement from the previous model presented.

In this blog, I talked about implementing linear regression analysis in R in three steps: explore the data, estimate a model, and evaluate the results.

For estimating the linear regression model, I talked about the p-value and R-squared score.

I talked about residuals vs. fitted value plot, qqnorm plot, histogram plot, Shapiro-Wilk normality test, and boxcox plot for diagnostic analysis.

Special thanks to Prof. Steve Hillmer, one of my favorite professors at KU business school, who taught me everything about data mining in R.