Chapter 3 - Regression Models
Segment 3 - Logistic regression
Logistic Regression
Logistic regression is a simple machine learning method you can use to predict the value of a numeric categorical variable based on its relationship with predictor variables.
Logistic Regression Use Cases
- Customer Churn Prediction
- Employee Attrition Modeling
- Hazardous Event Prediction
- Purchase Propensity vs. Ad Spend Analysis
Logistic Regression Assumptions
- Data is free of missing values
- The predictant variable is binary(that is, it only accepts two values) or ordinal (that is, a categorical variable with ordered values)
- All predictors are independent of each other
- There are at least 50 observations per predictor variable (to ensure reliable results)
import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib.pyplot as plt
import sklearn
from pandas import Series, DataFrame
from pylab import rcParams
from sklearn import preprocessing
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_predict
from sklearn import metrics
from sklearn.metrics import classification_report
from sklearn.metrics import confusion_matrix
from sklearn.metrics import precision_score, recall_score
%matplotlib inline
rcParams['figure.figsize'] = 5, 4
sb.set_style('whitegrid')
Logistic regression on the titanic dataset
address = '~/Data/titanic-training-data.csv'
titanic_training = pd.read_csv(address)
titanic_training.columns = ['PassengerId','Survived','Pclass','Name','Sex','Age','SibSp','Parch','Ticket','Fare','Cabin','Embarked']
print(titanic_training.head())
PassengerId Survived Pclass
0 1 0 3
1 2 1 1
2 3 1 3
3 4 1 1
4 5 0 3
Name Sex Age SibSp
0 Braund, Mr. Owen Harris male 22.0 1
1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1
2 Heikkinen, Miss. Laina female 26.0 0
3 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1
4 Allen, Mr. William Henry male 35.0 0
Parch Ticket Fare Cabin Embarked
0 0 A/5 21171 7.2500 NaN S
1 0 PC 17599 71.2833 C85 C
2 0 STON/O2. 3101282 7.9250 NaN S
3 0 113803 53.1000 C123 S
4 0 373450 8.0500 NaN S
print(titanic_training.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 891 non-null int64
1 Survived 891 non-null int64
2 Pclass 891 non-null int64
3 Name 891 non-null object
4 Sex 891 non-null object
5 Age 714 non-null float64
6 SibSp 891 non-null int64
7 Parch 891 non-null int64
8 Ticket 891 non-null object
9 Fare 891 non-null float64
10 Cabin 204 non-null object
11 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB
None
VARIABLE DESCRIPTIONS
Survived - Survival (0 = No; 1 = Yes)
Pclass - Passenger Class (1 = 1st; 2 = 2nd; 3 = 3rd)
Name - Name
Sex - Sex
Age - Age
SibSp - Number of Siblings/Spouses Aboard
Parch - Number of Parents/Children Aboard
Ticket - Ticket Number
Fare - Passenger Fare (British pound)
Cabin - Cabin
Embarked - Port of Embarkation (C = Cherbourg, France; Q = Queenstown, UK; S = Southampton - Cobh, Ireland)
Checking that your target variable is binary
sb.countplot(x='Survived', data=titanic_training, palette='hls')
<matplotlib.axes._subplots.AxesSubplot at 0x7ff6fe9b95c0>
data:image/s3,"s3://crabby-images/a9e06/a9e06f2888eef82f32503e14da4e1f962cf6c169" alt="ML03-output_9_1"
Checking for missing values
titanic_training.isnull().sum()
PassengerId 0
Survived 0
Pclass 0
Name 0
Sex 0
Age 177
SibSp 0
Parch 0
Ticket 0
Fare 0
Cabin 687
Embarked 2
dtype: int64
titanic_training.describe()
|
PassengerId |
Survived |
Pclass |
Age |
SibSp |
Parch |
Fare |
count |
891.000000 |
891.000000 |
891.000000 |
714.000000 |
891.000000 |
891.000000 |
891.000000 |
mean |
446.000000 |
0.383838 |
2.308642 |
29.699118 |
0.523008 |
0.381594 |
32.204208 |
std |
257.353842 |
0.486592 |
0.836071 |
14.526497 |
1.102743 |
0.806057 |
49.693429 |
min |
1.000000 |
0.000000 |
1.000000 |
0.420000 |
0.000000 |
0.000000 |
0.000000 |
25% |
223.500000 |
0.000000 |
2.000000 |
20.125000 |
0.000000 |
0.000000 |
7.910400 |
50% |
446.000000 |
0.000000 |
3.000000 |
28.000000 |
0.000000 |
0.000000 |
14.454200 |
75% |
668.500000 |
1.000000 |
3.000000 |
38.000000 |
1.000000 |
0.000000 |
31.000000 |
max |
891.000000 |
1.000000 |
3.000000 |
80.000000 |
8.000000 |
6.000000 |
512.329200 |
Taking care of missing values
Dropping missing values
So let's just go ahead and drop all the variables that aren't relevant for predicting survival. We should at least keep the following:
- Survived - This variable is obviously relevant.
- Pclass - Does a passenger's class on the boat affect their survivability?
- Sex - Could a passenger's gender impact their survival rate?
- Age - Does a person's age impact their survival rate?
- SibSp - Does the number of relatives on the boat (that are siblings or a spouse) affect a person survivability? Probability
- Parch - Does the number of relatives on the boat (that are children or parents) affect a person survivability? Probability
- Fare - Does the fare a person paid effect his survivability? Maybe - let's keep it.
- Embarked - Does a person's point of embarkation matter? It depends on how the boat was filled... Let's keep it.
What about a person's name, ticket number, and passenger ID number? They're irrelavant for predicting survivability. And as you recall, the cabin variable is almost all missing values, so we can just drop all of these.
titanic_data = titanic_training.drop(['Name','Ticket','Cabin'], axis=1)
titanic_data.head()
|
PassengerId |
Survived |
Pclass |
Sex |
Age |
SibSp |
Parch |
Fare |
Embarked |
0 |
1 |
0 |
3 |
male |
22.0 |
1 |
0 |
7.2500 |
S |
1 |
2 |
1 |
1 |
female |
38.0 |
1 |
0 |
71.2833 |
C |
2 |
3 |
1 |
3 |
female |
26.0 |
0 |
0 |
7.9250 |
S |
3 |
4 |
1 |
1 |
female |
35.0 |
1 |
0 |
53.1000 |
S |
4 |
5 |
0 |
3 |
male |
35.0 |
0 |
0 |
8.0500 |
S |
Imputing missing values
sb.boxplot(x='Parch', y='Age', data=titanic_data, palette='hls')
<matplotlib.axes._subplots.AxesSubplot at 0x7ff6fc140828>
data:image/s3,"s3://crabby-images/6549c/6549cfa6b8adc0dfb9c286249fe5e93899a27f87" alt="ML03-output_16_1"
Parch_groups = titanic_data.groupby(titanic_data['Parch'])
Parch_groups.mean()
|
PassengerId |
Survived |
Pclass |
Age |
SibSp |
Fare |
Parch |
|
|
|
|
|
|
0 |
445.255162 |
0.343658 |
2.321534 |
32.178503 |
0.237463 |
25.586774 |
1 |
465.110169 |
0.550847 |
2.203390 |
24.422000 |
1.084746 |
46.778180 |
2 |
416.662500 |
0.500000 |
2.275000 |
17.216912 |
2.062500 |
64.337604 |
3 |
579.200000 |
0.600000 |
2.600000 |
33.200000 |
1.000000 |
25.951660 |
4 |
384.000000 |
0.000000 |
2.500000 |
44.500000 |
0.750000 |
84.968750 |
5 |
435.200000 |
0.200000 |
3.000000 |
39.200000 |
0.600000 |
32.550000 |
6 |
679.000000 |
0.000000 |
3.000000 |
43.000000 |
1.000000 |
46.900000 |
def age_approx(cols):
Age = cols[0]
Parch = cols[1]
if pd.isnull(Age):
if Parch == 0:
return 32
elif Parch == 1:
return 24
elif Parch == 2:
return 17
elif Parch == 3:
return 33
elif Parch == 4:
return 45
else:
return 30
else:
return Age
titanic_data['Age'] = titanic_data[['Age','Parch']].apply(age_approx, axis=1)
titanic_data.isnull().sum()
PassengerId 0
Survived 0
Pclass 0
Sex 0
Age 0
SibSp 0
Parch 0
Fare 0
Embarked 2
dtype: int64
titanic_data.dropna(inplace=True)
titanic_data.reset_index(inplace=True, drop=True)
print(titanic_data.info())
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 889 entries, 0 to 888
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 889 non-null int64
1 Survived 889 non-null int64
2 Pclass 889 non-null int64
3 Sex 889 non-null object
4 Age 889 non-null float64
5 SibSp 889 non-null int64
6 Parch 889 non-null int64
7 Fare 889 non-null float64
8 Embarked 889 non-null object
dtypes: float64(2), int64(5), object(2)
memory usage: 62.6+ KB
None
Converting categorical variables to a dummy indicators
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()
gender_cat = titanic_data['Sex']
gender_encoded = label_encoder.fit_transform(gender_cat)
gender_encoded[0:5]
array([1, 0, 0, 0, 1])
titanic_data.head()
|
PassengerId |
Survived |
Pclass |
Sex |
Age |
SibSp |
Parch |
Fare |
Embarked |
0 |
1 |
0 |
3 |
male |
22.0 |
1 |
0 |
7.2500 |
S |
1 |
2 |
1 |
1 |
female |
38.0 |
1 |
0 |
71.2833 |
C |
2 |
3 |
1 |
3 |
female |
26.0 |
0 |
0 |
7.9250 |
S |
3 |
4 |
1 |
1 |
female |
35.0 |
1 |
0 |
53.1000 |
S |
4 |
5 |
0 |
3 |
male |
35.0 |
0 |
0 |
8.0500 |
S |
# 1 = male / 0 = female
gender_DF = pd.DataFrame(gender_encoded, columns=['male_gender'])
gender_DF.head()
|
male_gender |
0 |
1 |
1 |
0 |
2 |
0 |
3 |
0 |
4 |
1 |
embarked_cat = titanic_data['Embarked']
embarked_encoded = label_encoder.fit_transform(embarked_cat)
embarked_encoded[0:100]
array([2, 0, 2, 2, 2, 1, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 1, 2, 2, 0, 2, 2,
1, 2, 2, 2, 0, 2, 1, 2, 0, 0, 1, 2, 0, 2, 0, 2, 2, 0, 2, 2, 0, 0,
1, 2, 1, 1, 0, 2, 2, 2, 0, 2, 0, 2, 2, 0, 2, 2, 0, 2, 2, 0, 0, 2,
2, 2, 2, 2, 2, 2, 0, 2, 2, 2, 2, 2, 2, 2, 2, 1, 2, 2, 2, 2, 2, 2,
2, 2, 2, 2, 2, 2, 2, 0, 0, 2, 2, 2])
from sklearn.preprocessing import OneHotEncoder
binary_encoder = OneHotEncoder(categories='auto')
embarked_1hot = binary_encoder.fit_transform(embarked_encoded.reshape(-1,1))
embarked_1hot_mat = embarked_1hot.toarray()
embarked_DF = pd.DataFrame(embarked_1hot_mat, columns = ['C','Q','S'])
embarked_DF.head()
|
C |
Q |
S |
0 |
0.0 |
0.0 |
1.0 |
1 |
1.0 |
0.0 |
0.0 |
2 |
0.0 |
0.0 |
1.0 |
3 |
0.0 |
0.0 |
1.0 |
4 |
0.0 |
0.0 |
1.0 |
titanic_data.drop(['Sex','Embarked'], axis=1, inplace=True)
titanic_data.head()
|
PassengerId |
Survived |
Pclass |
Age |
SibSp |
Parch |
Fare |
0 |
1 |
0 |
3 |
22.0 |
1 |
0 |
7.2500 |
1 |
2 |
1 |
1 |
38.0 |
1 |
0 |
71.2833 |
2 |
3 |
1 |
3 |
26.0 |
0 |
0 |
7.9250 |
3 |
4 |
1 |
1 |
35.0 |
1 |
0 |
53.1000 |
4 |
5 |
0 |
3 |
35.0 |
0 |
0 |
8.0500 |
titanic_dmy = pd.concat([titanic_data, gender_DF, embarked_DF], axis=1, verify_integrity=True).astype(float)
titanic_dmy[0:5]
|
PassengerId |
Survived |
Pclass |
Age |
SibSp |
Parch |
Fare |
male_gender |
C |
Q |
S |
0 |
1.0 |
0.0 |
3.0 |
22.0 |
1.0 |
0.0 |
7.2500 |
1.0 |
0.0 |
0.0 |
1.0 |
1 |
2.0 |
1.0 |
1.0 |
38.0 |
1.0 |
0.0 |
71.2833 |
0.0 |
1.0 |
0.0 |
0.0 |
2 |
3.0 |
1.0 |
3.0 |
26.0 |
0.0 |
0.0 |
7.9250 |
0.0 |
0.0 |
0.0 |
1.0 |
3 |
4.0 |
1.0 |
1.0 |
35.0 |
1.0 |
0.0 |
53.1000 |
0.0 |
0.0 |
0.0 |
1.0 |
4 |
5.0 |
0.0 |
3.0 |
35.0 |
0.0 |
0.0 |
8.0500 |
1.0 |
0.0 |
0.0 |
1.0 |
Checking for independence between features
sb.heatmap(titanic_dmy.corr())
<matplotlib.axes._subplots.AxesSubplot at 0x7ff6fc071dd8>
data:image/s3,"s3://crabby-images/b70d3/b70d34fe646818644c57c3b8bc8c0c9d5f5cfafb" alt="ML03-output_30_1"
titanic_dmy.drop(['Fare','Pclass'], axis=1, inplace=True)
titanic_dmy.head()
|
PassengerId |
Survived |
Age |
SibSp |
Parch |
male_gender |
C |
Q |
S |
0 |
1.0 |
0.0 |
22.0 |
1.0 |
0.0 |
1.0 |
0.0 |
0.0 |
1.0 |
1 |
2.0 |
1.0 |
38.0 |
1.0 |
0.0 |
0.0 |
1.0 |
0.0 |
0.0 |
2 |
3.0 |
1.0 |
26.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
1.0 |
3 |
4.0 |
1.0 |
35.0 |
1.0 |
0.0 |
0.0 |
0.0 |
0.0 |
1.0 |
4 |
5.0 |
0.0 |
35.0 |
0.0 |
0.0 |
1.0 |
0.0 |
0.0 |
1.0 |
Checking that your dataset size is sufficient
titanic_dmy.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 889 entries, 0 to 888
Data columns (total 9 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 PassengerId 889 non-null float64
1 Survived 889 non-null float64
2 Age 889 non-null float64
3 SibSp 889 non-null float64
4 Parch 889 non-null float64
5 male_gender 889 non-null float64
6 C 889 non-null float64
7 Q 889 non-null float64
8 S 889 non-null float64
dtypes: float64(9)
memory usage: 62.6 KB
X_train, X_test, y_train, y_test = train_test_split(titanic_dmy.drop('Survived',axis=1),
titanic_dmy['Survived'],test_size=0.2,
random_state=200)
print(X_train.shape)
print(y_train.shape)
(711, 8)
(711,)
X_train[0:5]
|
PassengerId |
Age |
SibSp |
Parch |
male_gender |
C |
Q |
S |
719 |
721.0 |
6.0 |
0.0 |
1.0 |
0.0 |
0.0 |
0.0 |
1.0 |
165 |
167.0 |
24.0 |
0.0 |
1.0 |
0.0 |
0.0 |
0.0 |
1.0 |
879 |
882.0 |
33.0 |
0.0 |
0.0 |
1.0 |
0.0 |
0.0 |
1.0 |
451 |
453.0 |
30.0 |
0.0 |
0.0 |
1.0 |
1.0 |
0.0 |
0.0 |
181 |
183.0 |
9.0 |
4.0 |
2.0 |
1.0 |
0.0 |
0.0 |
1.0 |
Deploying and evaluating the model
LogReg = LogisticRegression(solver='liblinear')
LogReg.fit(X_train, y_train)
LogisticRegression(solver='liblinear')
y_pred = LogReg.predict(X_test)
Model Evaluation
Classification report without cross-validation
print(classification_report(y_test, y_pred))
precision recall f1-score support
0.0 0.83 0.88 0.85 109
1.0 0.79 0.71 0.75 69
accuracy 0.81 178
macro avg 0.81 0.80 0.80 178
weighted avg 0.81 0.81 0.81 178
K-fold cross-validation & confusion matrices
y_train_pred = cross_val_predict(LogReg, X_train, y_train, cv=5)
confusion_matrix(y_train, y_train_pred)
array([[377, 63],
[ 91, 180]])
precision_score(y_train, y_train_pred)
0.7407407407407407
Make a test prediction
titanic_dmy[863:864]
|
PassengerId |
Survived |
Age |
SibSp |
Parch |
male_gender |
C |
Q |
S |
863 |
866.0 |
1.0 |
42.0 |
0.0 |
0.0 |
0.0 |
0.0 |
0.0 |
1.0 |
test_passenger = np.array([866,40,0,0,0,0,0,1]).reshape(1,-1)
print(LogReg.predict(test_passenger))
print(LogReg.predict_proba(test_passenger))
[1.]
[[0.26351831 0.73648169]]