Multiple regression as a machine learning algorithm | by Mahbubul Alam | Nov, 2020
You’ve got the intuition with a simplified example of how multiple regression makes prediction of the price of a used car based on two features: horsepower and high-way mpg.
But the world we live in is complex and messy. Each of the steps I’ve shown above will need to be branched out further. For example, in Step 2 we imported data and assigned features to X, y variables, without doing any further analysis. But who doesn’t know that data wrangling alone can take upwards of 80% of all tasks in any machine learning project?
I’m not going to go into every situation you’ll encounter as a data scientist in the real-world but I’ll talk about some fundamental issues which are unavoidable.
1. Import libraries
I’m going to use a few libraries and modules to do some mandatory tasks.
- data wrangling:
- multiple regression model:
- splitting training and testing data:
- model evaluation:
import pandas as pd
import numpy as np
from sklearn import linear_model
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score
2. Data wrangling
First I’m reading the dataset in my environment (I’m using Jupyter notebook, by the way) and as a ritual, taking a peek at the first few rows.
# import data
data = pd.read_csv("../automobile.csv")
As you can see, there are many columns that barely fit in the window. So I’m going to get a list of all columns.
# all column name
data.columns>> 'symboling', 'normalized-losses', 'make', 'fuel-type', 'aspiration', 'num-of-doors', 'body-style', 'drive-wheels', 'engine-location', 'wheel-base', 'length', 'width', 'height', 'curb-weight', 'engine-type', 'num-of-cylinders', 'engine-size', 'fuel-system', 'bore', 'stroke', 'compression-ratio', 'horsepower', 'peak-rpm', 'city-mpg', 'highway-mpg', 'price'
(I’m not a car expert so I really don’t know what some of those columns represent, but in real-world data science, this is not a good excuse, some domain knowledge is essential).
I did few other things as part of data exploration and preparation which I’m skipping here (such as checking and converting data types, removing a couple of rows that had symbols like ‘?’ etc.), but suffice to reiterate my previous point that getting the data in the right format can be a real pain in the neck.
3. Preparing input data
We want to predict price, so the dependent variable is already set. Now comes which features to use for prediction. Ideally I’d include all of the features in the initial model, but for this demo I’m choosing 1 categorical feature (i.e. make) and two numeric features (i.e. horsepower and highway-mpg) which I think most people would care about while choosing a car.
Following the ML convention, I’m designating the independent variables as X and the dependent variable as y.
# select data subset
df = data[["make", "horsepower", "highway-mpg", "price"]]# select data for modeling
X = df[["make", "horsepower", "highway-mpg"]]
y = df["price"]
I’ve purposefully chosen a categorical variable (i.e. make) to highlight that you need to do some extra work to convert it into a machine-readable format (a.k.a. numbers!). There are several ways to do that such as with Label Encoder or One Hot Encoder — both available in
sklearn module. But I’m going with the classical “dummy variable” approach, which converts categorical features into numerical dichotomous variables (0s and 1s).
At this stage, I’m also splitting data into training and testing set for model evaluation.
# create dummy variables
make_dummy = pd.get_dummies(X["make"], drop_first = True)
X = X.drop("make", axis = 1)
X = pd.concat([X, make_dummy], axis = 1)# split data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)
4. Model building
It’s always amazing to think that within the whole complex machine learning pipeline the easiest part is (in my opinion of course!) actually specifying the model. It’s just three easy steps of instantiate — fit — predict as in most ML algorithms. Of course, you have to parameterize the model and iterate it several times until you get the one that satisfies your criteria, but still, this step of the model building process gives me the least headache.
model = LinearRegression()
y_pred = model.predict(X_test)
5. Model evaluation
Now comes the moment of truth — how well does the model perform? There are many ways to evaluate model performance but in classical statistics, the performance of linear regression models is evaluated with R² — which gives a value between 0 and 1, and the higher the R² the better the model.
# model evaluation
score = r2_score(y_test, y_pred)
In our demonstration we are getting a R² value of 0.81, meaning 81% of the variation in the dependent variable (i.e. used car price) can be explained by the three independent variables (i.e. make, horsepower and highway-mpg). Of course, this metric can be improved by including more variables and trying with different combinations of them and by tuning model parameters — the topic of a separate discussion.
I felt like I could write forever on multiple regression, there are so many areas to cover but I have to stop somewhere. Here are few additional things to keep in mind while building a linear regression model for real-world application development:
- while selecting features check for correlation between dependent and each independent variable separately. If they are not correlated, remove the feature from the model;
- check for multicollinearity, the relationship between independent variables. Remove correlated features to avoid model overfitting;
- there are a few ways to choose variables for the model. Forward selection and backward elimination are two of them. As the names suggest, in this process you add or remove one variable at a time and check mode performance;
- I used R² for model performance evaluation, but some people choose other metrics such as AIC, BIC, p-value etc.