Multiple regression as a machine learning algorithm | by Mahbubul Alam | Nov, 2020


But the world we live in is complex and messy. Each of the steps I’ve shown above will need to be branched out further. For example, in Step 2 we imported data and assigned features to X, y variables, without doing any further analysis. But who doesn’t know that data wrangling alone can take upwards of 80% of all tasks in any machine learning project?

I’m not going to go into every situation you’ll encounter as a data scientist in the real-world but I’ll talk about some fundamental issues which are unavoidable.

1. Import libraries

  • data wrangling: pandas and numpy
  • multiple regression model: linear_model from sklearn
  • splitting training and testing data: train_test_split
  • model evaluation: r2_score
import pandas as pd
import numpy as np
from sklearn import linear_model
from sklearn.model_selection import train_test_split
from sklearn.metrics import r2_score

2. Data wrangling

# import data
data = pd.read_csv("../automobile.csv")

As you can see, there are many columns that barely fit in the window. So I’m going to get a list of all columns.

# all column name
>> 'symboling', 'normalized-losses', 'make', 'fuel-type', 'aspiration', 'num-of-doors', 'body-style', 'drive-wheels', 'engine-location', 'wheel-base', 'length', 'width', 'height', 'curb-weight', 'engine-type', 'num-of-cylinders', 'engine-size', 'fuel-system', 'bore', 'stroke', 'compression-ratio', 'horsepower', 'peak-rpm', 'city-mpg', 'highway-mpg', 'price'

(I’m not a car expert so I really don’t know what some of those columns represent, but in real-world data science, this is not a good excuse, some domain knowledge is essential).

I did few other things as part of data exploration and preparation which I’m skipping here (such as checking and converting data types, removing a couple of rows that had symbols like ‘?’ etc.), but suffice to reiterate my previous point that getting the data in the right format can be a real pain in the neck.

3. Preparing input data

Following the ML convention, I’m designating the independent variables as X and the dependent variable as y.

# select data subset
df = data[["make", "horsepower", "highway-mpg", "price"]]
# select data for modeling
X = df[["make", "horsepower", "highway-mpg"]]
y = df["price"]

I’ve purposefully chosen a categorical variable (i.e. make) to highlight that you need to do some extra work to convert it into a machine-readable format (a.k.a. numbers!). There are several ways to do that such as with Label Encoder or One Hot Encoder — both available in sklearn module. But I’m going with the classical “dummy variable” approach, which converts categorical features into numerical dichotomous variables (0s and 1s).

At this stage, I’m also splitting data into training and testing set for model evaluation.

# create dummy variables
make_dummy = pd.get_dummies(X["make"], drop_first = True)
X = X.drop("make", axis = 1)
X = pd.concat([X, make_dummy], axis = 1)
# split data into train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.3, random_state = 0)

4. Model building

# instantiate 
model = LinearRegression()
# fit, y_train)
# predict
y_pred = model.predict(X_test)

5. Model evaluation

# model evaluation
score = r2_score(y_test, y_pred)
# score
>> 0.8135478081839133

In our demonstration we are getting a R² value of 0.81, meaning 81% of the variation in the dependent variable (i.e. used car price) can be explained by the three independent variables (i.e. make, horsepower and highway-mpg). Of course, this metric can be improved by including more variables and trying with different combinations of them and by tuning model parameters — the topic of a separate discussion.

  • while selecting features check for correlation between dependent and each independent variable separately. If they are not correlated, remove the feature from the model;
  • check for multicollinearity, the relationship between independent variables. Remove correlated features to avoid model overfitting;
  • there are a few ways to choose variables for the model. Forward selection and backward elimination are two of them. As the names suggest, in this process you add or remove one variable at a time and check mode performance;
  • I used R² for model performance evaluation, but some people choose other metrics such as AIC, BIC, p-value etc.


Source link

Write a comment