CIS 241, Dr. Ladd
AKA Logit Regression, Maximum-Entropy Classification
Logistic Regression is our first classification method.
First, load in the penguins
dataset in Seaborn.
Now create a scatter plot showing two numeric variables from this
dataset, using the species
variable as different colors for
the dots.
We will learn to train a multiclass logistic regression later. For
now, we should filter our data so we have just two variables. Let’s
create a gentoo_chinstrap
dataframe that has just those two
species.
Make a pairplot showing the relationship between all the numerical variables in this dataset. Also show the correlation matrix for the same variables.
Do we have any multicollinearity here? What should we do about it?
This works just like it did for linear regression. We don’t have any categorical predictors this time, but that would be the same too.
Run the train_test_split
function now. What should you
use as a test size?
# See next slide for discussion of parameters
logit_model = LogisticRegression(penalty=None,
solver='lbfgs',
random_state=42)
logit_model.fit(X_train, y_train)
Do you need to drop null values?
lbfgs
solver is the default and is good for small
datasets.train_test_split
, this should always be set to ensure
repeatability.For more on this, read Scikit-learns Defaults Are Wrong.
print(f"Intercept: {logit_model.intercept_[0]:.3f}")
print("Coefficients:")
for name, coef in zip(X_train.columns, logit_model.coef_[0]):
print(f"\t{name}: {coef:.4f}")
How do the odds change for each unit of the predictor?
So how do we assess our model instead?
# Add cross_val_score to your train_test_split line
from sklearn.model_selection import train_test_split, cross_val_score
# These replace the r-squared score and RMSE
# You could put these all on one line
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metric import RocCurveDisplay
# You'll also need matplotlib this time
import matplotlib.pyplot as plt
# Calculate confusion matrix and transform data
conf_mat = confusion_matrix(y_test,predictions)
conf_mat = pd.DataFrame(conf_mat,index=categories,columns=categories)
conf_mat = conf_mat.melt(ignore_index=False).reset_index()
# Create heatmap
heatmap = alt.Chart(conf_mat).mark_rect().encode(
x=alt.X("variable:N").title("Predicted Response"),
y=alt.Y("index:N").title("True Response"),
color=alt.Color("value:Q", legend=None).scale(scheme="blues")
).properties(
width=400,
height=400
)
# Add text labels for numbers
text = heatmap.mark_text(baseline="middle").encode(
text=alt.Text("value:Q"),
color=alt.value("black"),
size=alt.value(50)
)
heatmap + text
There are individual functions for these, too.
This only works for binary classifiers!
# Create our ROC Curve plot
RocCurveDisplay.from_predictions(y_test,
probabilities[categories[0]],
pos_label=categories[0])
# Draw a green line for 0
plt.plot([0, 1], [0, 1], color = 'g')
ROC: Receiver Operating Characteristics
This measure is written right on the ROC Curve plot!