CIS 241, Dr. Ladd

AKA Logit Regression, Maximum-Entropy Classification

Logistic Regression is our first **classification**
method.

First, load in the `penguins`

dataset in Seaborn.

Now create a scatter plot showing two numeric variables from this
dataset, using the `species`

variable as different colors for
the dots.

We will learn to train a multiclass logistic regression later. For
now, we should filter our data so we have just two variables. Let’s
create a `gentoo_chinstrap`

dataframe that has just those two
species.

Make a pairplot showing the relationship between all the numerical variables in this dataset. Also show the correlation matrix for the same variables.

Do we have any multicollinearity here? What should we do about it?

This works just like it did for linear regression. We don’t have any categorical predictors this time, but that would be the same too.

Run the `train_test_split`

function now. What should you
use as a test size?

```
# See next slide for discussion of parameters
logit_model = LogisticRegression(penalty=None,
solver='lbfgs',
random_state=42)
logit_model.fit(X_train, y_train)
```

Do you need to drop null values?

**penalty**: By default, scikit-learn regularizes your predictors. This could lead to unpredictable results for non-normalized data! For now, always set this to ‘none’.**solver**: This is the underlying algorithm scikit-learn will use to calculate the coefficients. The`lbfgs`

solver is the default and is good for small datasets.**random_state**: As in`train_test_split`

, this should always be set to ensure repeatability.

For more on this, read Scikit-learns Defaults Are Wrong.

```
print(f"Intercept: {logit_model.intercept_[0]:.3f}")
print("Coefficients:")
for name, coef in zip(X_train.columns, logit_model.coef_[0]):
print(f"\t{name}: {coef:.4f}")
```

How do *the odds* change for each unit of the predictor?

So how do we assess our model instead?

```
# Add cross_val_score to your train_test_split line
from sklearn.model_selection import train_test_split, cross_val_score
# These replace the r-squared score and RMSE
# You could put these all on one line
from sklearn.metrics import confusion_matrix, classification_report
from sklearn.metric import RocCurveDisplay
# You'll also need matplotlib this time
import matplotlib.pyplot as plt
```

```
# Calculate confusion matrix and transform data
conf_mat = confusion_matrix(y_test,predictions)
conf_mat = pd.DataFrame(conf_mat,index=categories,columns=categories)
conf_mat = conf_mat.melt(ignore_index=False).reset_index()
# Create heatmap
heatmap = alt.Chart(conf_mat).mark_rect().encode(
x=alt.X("variable:N").title("Predicted Response"),
y=alt.Y("index:N").title("True Response"),
color=alt.Color("value:Q", legend=None).scale(scheme="blues")
).properties(
width=400,
height=400
)
# Add text labels for numbers
text = heatmap.mark_text(baseline="middle").encode(
text=alt.Text("value:Q"),
color=alt.value("black"),
size=alt.value(50)
)
heatmap + text
```

**accuracy**: the proportion of cases classified correctly**precision**: the proportion of predicted values that are correct**recall**: the proportion of all values that are correctly classified**specificity**: the recall score for the other category

There are individual functions for these, too.

This only works for *binary* classifiers!

```
# Create our ROC Curve plot
RocCurveDisplay.from_predictions(y_test,
probabilities[categories[0]],
pos_label=categories[0])
# Draw a green line for 0
plt.plot([0, 1], [0, 1], color = 'g')
```

ROC: Receiver Operating Characteristics

This measure is written right on the ROC Curve plot!