Classification with Logistic Regression

CIS 241, Dr. Ladd

spacebar to go to the next slide, esc/menu to navigate

What is Logistic Regression?

AKA Logit Regression, Maximum-Entropy Classification

Even though it’s got “Regression” in its name, Logistic Regression is our first classification method.

How do we predict a category?

Regressors predict a numerical value.
Classifiers predict a category class.

Logistic Regression is a classifier.

Logistic regression is a linear model.

Image via Towards Data Science, Data Camp

Instead of predicting a value, we predict the probability of a category.

Traditional logistic regression is a binary classifier.

Calculating Logistic Regression

Let’s create a model to classify penguins by species.

First, load in the penguins dataset in Seaborn.

penguins = pd.read_csv("https://jrladd.com/CIS241/data/penguins.csv")

Now create a scatter plot showing two numeric variables from this dataset, using the species variable as different colors for the dots.

Make this about just two variables.

We will learn to train a multiclass logistic regression later. For now, we should filter our data so we have just two variables. Let’s create a gentoo_chinstrap dataframe that has just those two species.

Now let’s select some predictors.

Make a pairplot showing the relationship between all the numerical variables in this dataset. Also visualize the correlation matrix for the same variables.

Do we have any multicollinearity here? What should we do about it?

Split the data into training and test sets.

This works just like it did for linear regression. We don’t have any categorical predictors this time, but that would be the same too.

Run the train_test_split function now. What should you use as a test size?

Fit a logistic regression model

# We need a different class from sklearn
from sklearn.linear_model import LogisticRegression

Do you need to drop null values?

Setting model hyperparameters

penalty: By default, scikit-learn regularizes your predictors. This could lead to unpredictable results for non-normalized data! For now, always set this to None.
solver: This is the underlying algorithm scikit-learn will use to calculate the coefficients. The lbfgs solver is the default and is good for small datasets.
random_state: As in train_test_split, this should always be set to ensure repeatability.

For more on this, read Scikit-learns Defaults Are Wrong.

Interpreting and Validating Logistic Regression Results

We can print the intercept and the coefficents, just like in linear regression.

How do the odds change for each unit of the predictor?

Instead of predicting a value, we can predict the probability that our new data will fall into category.

We need to store predictions, probabilities, and categories (i.e. classes).

There is no RMSE or \(R^{2}\) for logistic regression.

So how do we assess our model instead?

Validate classifiers with the confusion matrix.

Confusion matrix for our penguin model.

From the confusion matrix, we get scores for our model.

accuracy: the proportion of cases classified correctly
precision: the proportion of predicted values that are correct
recall: the proportion of all values that are correctly classified
specificity: the recall score for the other category

Use the classfication report to get all of these metrics in scikit-learn.

There are individual functions for these, too.

Cross-validation lets you compare multiple runs of the model with different training data.

This works the same as it did for regression, but it returns accuracy instead of \(R^2\) score.

Plot the model’s recall with the ROC Curve.

This only works for binary classifiers!

ROC: Receiver Operating Characteristics
AUC: Area Under the Curve (This measure is written right on the ROC Curve plot!)