CIS 241, Dr. Ladd
spacebar
to go to the next slide, esc
/menu to navigate
AKA Logit Regression, Maximum-Entropy Classification
Logistic Regression is our first classification method.
First, load in the penguins
dataset in Seaborn.
Now create a scatter plot showing two numeric variables from this dataset, using the species
variable as different colors for the dots.
We will learn to train a multiclass logistic regression later. For now, we should filter our data so we have just two variables. Let’s create a gentoo_chinstrap
dataframe that has just those two species.
Make a pairplot showing the relationship between all the numerical variables in this dataset. Also visualize the correlation matrix for the same variables.
Do we have any multicollinearity here? What should we do about it?
This works just like it did for linear regression. We don’t have any categorical predictors this time, but that would be the same too.
Run the train_test_split
function now. What should you use as a test size?
# We need a different class from sklearn
from sklearn.linear_model import LogisticRegression
# See next slide for discussion of parameters
logit_model = LogisticRegression(penalty=None,
solver='lbfgs',
random_state=42)
logit_model.fit(X_train, y_train)
Do you need to drop null values?
lbfgs
solver is the default and is good for small datasets.train_test_split
, this should always be set to ensure repeatability.For more on this, read Scikit-learns Defaults Are Wrong.
print(f"Intercept: {logit_model.intercept_[0]:.3f}")
print("Coefficients:")
for name, coef in zip(X_train.columns, logit_model.coef_[0]):
print(f"\t{name}: {coef:.4f}")
How do the odds change for each unit of the predictor?
# We can get prediction probabilities
probabilities = logit_model.predict_proba(X_test)
# We can get the predictions themselves
predictions = logit_model.predict(X_test)
# We can get the categories or classes we predicted
categories = logit_model.classes_
# Let's make the probabilities look nicer
probabilities = pd.DataFrame(probabilities, columns=categories)
probabilities
So how do we assess our model instead?
# Add cross_val_score to your train_test_split line
from sklearn.model_selection import train_test_split, cross_val_score
# These replace the r-squared score and RMSE
# You could put these all on one line
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metric import RocCurveDisplay
# You'll also need matplotlib this time
import matplotlib.pyplot as plt
# Calculate confusion matrix and transform data
conf_mat = confusion_matrix(y_test,predictions)
conf_mat = pd.DataFrame(conf_mat,index=categories,columns=categories)
conf_mat = conf_mat.melt(ignore_index=False).reset_index()
# Create heatmap
heatmap = alt.Chart(conf_mat).mark_rect().encode(
x=alt.X("variable:N").title("Predicted Response"),
y=alt.Y("index:N").title("True Response"),
color=alt.Color("value:Q", legend=None).scale(scheme="blues")
).properties(
width=400,
height=400
)
# Add text labels for numbers
text = heatmap.mark_text(baseline="middle").encode(
text=alt.Text("value:Q"),
color=alt.value("black"),
size=alt.value(50)
)
heatmap + text
There are individual functions for these, too.
This only works for binary classifiers!
# Create our ROC Curve plot
RocCurveDisplay.from_predictions(y_test,
probabilities[categories[0]],
pos_label=categories[0])
# Draw a green line for 0
plt.plot([0, 1], [0, 1], color = 'g')
ROC: Receiver Operating Characteristics
This measure is written right on the ROC Curve plot!