Logistic Regression Explained. A High-Level Overview of Logistic… | by Jason Wong | Dec, 2020


Let’s go ahead and see how the concepts above can be easily implemented with Sklearn. Once again, I will be using the infamous titanic dataset. The dataset was obtained from Kaggle. The goal being to predict whether a given person survived or not.

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import OneHotEncoder, StandardScalerfrom sklearn.linear_model import LogisticRegressionfrom sklearn.model_selection import train_test_split, cross_val_scorefrom sklearn.metrics import confusion_matrix
df = pd.read_csv("titanic.csv")
df.drop(["Name", "Ticket", "Cabin"], axis=1, inplace=True)
df.Embarked = df.Embarked.fillna(value='S')
numerical_columns = ['Age', 'Fare']categorical_columns = ["Pclass", "Sex",
"SibSp", "Parch", "Embarked"]
  • Performing train test split
X = df.drop(['Survived'], axis=1)
y = df.Survived
y = LabelEncoder().fit_transform(y)
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=42)
#Defining train and test index variables for casting the scaled
#numerical values in a dataframe

X_train_index = X_train.index
X_test_index = X_test.index
#Instantiating OneHotEncoder and defining the train and test
#features to be encoded

ohe = OneHotEncoder()
X_train_ohe = X_train[categorical]
X_test_ohe = X_test[categorical]
#Fitting the encoder to the train set and transforming both
#the train and test set

X_train_encoded = ohe.fit_transform(X_train_ohe)
X_test_encoded = ohe.transform(X_test_ohe)
#Instantiating StandardScaler and defining continous variables
#to be scaled

ss = StandardScaler()
X_train_cont = X_train[numerical].astype(float)
X_test_cont = X_test[numerical].astype(float)
#Scaling the continuous features and casting results as dataframes
X_train_scaled = pd.DataFrame(ss.fit_transform(X_train_cont), columns=X_train_cont.columns,
X_test_scaled = pd.DataFrame(ss.transform(X_test_cont), columns=X_test_cont.columns,
#Defining the columns for the train and test splits
train_columns = ohe.get_feature_names(input_features=X_train_ohe.columns)
test_columns = ohe.get_feature_names(input_features=X_test_ohe.columns)
#Casting the encoded X_train and X_test as dataframes
X_train_processed = pd.DataFrame(X_train_encoded.todense(), columns=train_columns, index=X_train_index)
X_test_processed = pd.DataFrame(X_test_encoded.todense(), columns=test_columns, index=X_test_index)
#combining the encoded and scaled dataframes for a preprocessed
#X_train and X_test
X_train = pd.concat([X_train_scaled, X_train_processed], axis=1)
X_test = pd.concat([X_test_scaled, X_test_processed], axis=1)
#Fitting the logistic regression model
log_reg = LogisticRegression()
log_reg.fit(X_train, y_train)
#Defining the predictions as y_hat
y_hat_train = log_reg.predict(X_train)
y_hat_test = log_reg.predict(X_test)
#Scoring the model on the test set by mean accuracy
train_acc = log_reg.score(X_train, y_train)
test_acc = log_reg.score(X_test, y_test)
print('Training Accuracy: n{}n'.format(train_acc))
print('Testing Accuracy: n{}'.format(test_acc))
#Probability estimateslog_reg.predict_proba(X_test)[:10]
Class 0 (left) | Class 1 (right)

Hyperparameter Tuning the Inverse Regularization Strength (C)

In order to find the most optimal inverse regularization strength, we can create a list with Numpy containing 1000 values from 1 to 1000. In case the model performs the better with a really high regularization strength, we will insert 0 and 0.0001 to this list. By running a for loop and fitting a logistic regression model with each C value, we can store the each score and c value in a dictionary for reference.

#Performing a secondary train test split for tuning
X_train, X_val, y_train, y_val = train_test_split(X_train, y_train, random_state=42)
#Defining C_values list containing the inverse regularization strengths
C_values = np.linspace(1, 1000, 1000)
C_values = np.insert(C_values, 0, .0001)
#For loop to obtain the best C value with ridge regularization
C_scores = {}
for c in C_values:
lr = LogisticRegression(penalty='l2', C=c)
lr.fit(X_train, y_train)
acc = lr.score(X_val, y_val)
C_scores[c] = acc
#Defining the best c value and printing the value with the accuracy #score of the model
optimal_c = max(C_scores, key=C_scores.get)
print('Best C value: {}n'.format(optimal_c))
print('Accuracy score (Validation set) {}'.format(C_scores[optimal_c]))
final_model = LogisticRegression(penalty='l2', C=10)
final_model.fit(X_train, y_train)
train_acc = final_model.score(X_train, y_train)
test_acc = final_model.score(X_test, y_test)
print('Training Accuracy: n{}n'.format(train_acc))
print('Testing Accuracy: n{}'.format(test_acc))
probas = final_model.predict_proba(X_test)#Setting a higher threshold
higher_threshold = probas[:,1] > .65
#Defining the predictions for evaluation
y_hat_higher = higher_threshold.astype(int)
evaluation(y_test, y_hat_higher)

Read More …


Write a comment