Logistic Regression

CIS 241, Dr. Ladd

What is Logistic Regression?

AKA Logit Regression, Maximum-Entropy Classification

How do we predict a category?

Logistic Regression is our first classification method.

Logistic regression is a linear model.

Image via Towards Data Science, Data Camp

Instead of predicting a value, we predict the probability of a category.

Traditional logistic regression is a binary classifier.

Calculating Logistic Regression

Let’s create a model to classify penguins by species.

First, load in the penguins dataset in Seaborn.

penguins = pd.read_csv("https://jrladd.com/CIS241/data/penguins.csv")

Now create a scatter plot showing two numeric variables from this dataset, using the species variable as different colors for the dots.

Make this about just two variables.

We will learn to train a multiclass logistic regression later. For now, we should filter our data so we have just two variables. Let’s create a gentoo_chinstrap dataframe that has just those two species.

Now let’s select some predictors.

Make a pairplot showing the relationship between all the numerical variables in this dataset. Also show the correlation matrix for the same variables.

Do we have any multicollinearity here? What should we do about it?

Split the data into training and test sets.

This works just like it did for linear regression. We don’t have any categorical predictors this time, but that would be the same too.

Run the train_test_split function now. What should you use as a test size?

Fit a logistic regression model

# We need a different class from sklearn
from sklearn.linear_model import LogisticRegression
# See next slide for discussion of parameters
logit_model = LogisticRegression(penalty=None, 
                                 solver='lbfgs', 
                                 random_state=42)
logit_model.fit(X_train, y_train)

Do you need to drop null values?

Setting model parameters

  • penalty: By default, scikit-learn regularizes your predictors. This could lead to unpredictable results for non-normalized data! For now, always set this to ‘none’.
  • solver: This is the underlying algorithm scikit-learn will use to calculate the coefficients. The lbfgs solver is the default and is good for small datasets.
  • random_state: As in train_test_split, this should always be set to ensure repeatability.

For more on this, read Scikit-learns Defaults Are Wrong.

Interpreting Logistic Regression Results

We can print the intercept and the coefficents, just like in linear regression.

print(f"Intercept: {logit_model.intercept_[0]:.3f}")
print("Coefficients:")
for name, coef in zip(X_train.columns, logit_model.coef_[0]):
    print(f"\t{name}: {coef:.4f}")

How do the odds change for each unit of the predictor?

Instead of predicting a value, we can predict the probability that our new data will fall into category.

# We can get prediction probabilities
probabilities = logit_model.predict_proba(X_test)
# We can get the predictions themselves
predictions = logit_model.predict(X_test)
# We can get the categories or classes we predicted
categories = logit_model.classes_
# Let's make the probabilities look nicer
probabilities = pd.DataFrame(probabilities, 
                    columns=categories)
probabilities

There is no RMSE or \(R^{2}\) for logistic regression.

So how do we assess our model instead?

We need another set of metric functions

# Add cross_val_score to your train_test_split line
from sklearn.model_selection import train_test_split, cross_val_score
# These replace the r-squared score and RMSE
# You could put these all on one line
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metric import RocCurveDisplay

# You'll also need matplotlib this time
import matplotlib.pyplot as plt

Validate classifiers with the confusion matrix.

Confusion matrix for our penguin model.
# Calculate confusion matrix and transform data
conf_mat = confusion_matrix(y_test,predictions)
conf_mat = pd.DataFrame(conf_mat,index=categories,columns=categories)
conf_mat = conf_mat.melt(ignore_index=False).reset_index()
# Create heatmap
heatmap = alt.Chart(conf_mat).mark_rect().encode(
    x=alt.X("variable:N").title("Predicted Response"),
    y=alt.Y("index:N").title("True Response"),
    color=alt.Color("value:Q", legend=None).scale(scheme="blues")
).properties(
    width=400,
    height=400
)
# Add text labels for numbers
text = heatmap.mark_text(baseline="middle").encode(
    text=alt.Text("value:Q"),
    color=alt.value("black"),
    size=alt.value(50)
)

heatmap + text

From the confusion matrix, we get scores for our model.

  • accuracy: the proportion of cases classified correctly
  • precision: the proportion of predicted values that are correct
  • recall: the proportion of all values that are correctly classified
  • specificity: the recall score for the other category

Calculating these scores is simple in scikit-learn.

# You must use print to make this readable
print(classification_report(y_test, predictions))

There are individual functions for these, too.

Cross-validation lets you compare multiple runs of the model with different training data.

scores = cross_val_score(logit_model, X, y, cv=5)
print(f"{scores.mean():.2} accuracy with standard deviation {scores.std():.2}")

Plot the model’s recall with the ROC Curve.

This only works for binary classifiers!

# Create our ROC Curve plot
RocCurveDisplay.from_predictions(y_test,
                                 probabilities[categories[0]],
                                 pos_label=categories[0])

# Draw a green line for 0
plt.plot([0, 1], [0, 1], color = 'g')

ROC: Receiver Operating Characteristics

Another helpful measure is the area under the ROC curve (AUC).

This measure is written right on the ROC Curve plot!