CIS 241, Dr. Ladd
spacebar
to go to the next slide, esc
/menu to navigate
AKA Logit Regression, Maximum-Entropy Classification
Even though it’s got “Regression” in its name, Logistic Regression is our first classification method.
Logistic Regression is a classifier.
Image via Towards Data Science, Data Camp
First, load in the penguins
dataset in Seaborn.
Now create a scatter plot showing two numeric variables from this dataset, using the species
variable as different colors for the dots.
We will learn to train a multiclass logistic regression later. For now, we should filter our data so we have just two variables. Let’s create a gentoo_chinstrap
dataframe that has just those two species.
Make a pairplot showing the relationship between all the numerical variables in this dataset. Also visualize the correlation matrix for the same variables.
Do we have any multicollinearity here? What should we do about it?
This works just like it did for linear regression. We don’t have any categorical predictors this time, but that would be the same too.
Run the train_test_split
function now. What should you use as a test size?
Do you need to drop null values?
None
.lbfgs
solver is the default and is good for small datasets.train_test_split
, this should always be set to ensure repeatability.For more on this, read Scikit-learns Defaults Are Wrong.
How do the odds change for each unit of the predictor?
We need to store predictions, probabilities, and categories (i.e. classes).
So how do we assess our model instead?
Confusion matrix for our penguin model.
There are individual functions for these, too.
This works the same as it did for regression, but it returns accuracy instead of \(R^2\) score.
This only works for binary classifiers!