CIS 241, Dr. Ladd

🪁🪁🪁

Predicting a known target from a set of predictor variables.

e.g. Logistic Regression, Naive Bayes, KNN, Random Forest, etc.

It constructs a model of the data without learning from existing labels.

- Dimension reduction: get a more manageable set of variables.
- Clustering: identify meaningful categories in the data.
- Exploration: analyze variables and discover relationships.

- Principal component analysis (PCA): dimension reduction
- Correspondence analysis: dimension reduction
**K-Means Clustering: clustering and exploration**- Hierarchical Clustering: clustering and exploration

It can’t find clusters that aren’t there!

Just like in KNN, you have to choose K *based on the data*!
(More on this later.)

For a dataset of all quantitative variables, use
`StandardScaler()`

as usual.

It is best to stick with numerical variables for clustering, but if
you must use categorical variables, remember to use one-hot encoding
before scaling. A `MinMaxScaler()`

may also be useful in this
case.

Because this is an unsupervised method, there’s no need to reserve a test set. There would be nothing to test on!

This isn’t the whole workflow, it’s just the model code.

`n_clusters`

: the number of clusters the model will produce. This is “K”!`n_init`

: the total number of times the model will be run.`max_iter`

: the number of iterations the model will take to find centroids.`random_state`

: keeps the model the same every time.

First, we can look at the relative size of the clusters. Are they relatively balanced? Unbalanced clusters may mean we need to try again.

Next, we can look at the cluster means for each cluster. This tells us where the centroid is and gives us a sense of how the different variables interact.

```
# Using the .cluster_centers_ attribute of our model
# Get the centers into a dataframe:
centers = pd.DataFrame(kmeans.cluster_centers_, columns=predictors)
# Tidy our dataset with .melt() (the opposite of pivot):
centers = centers.melt(ignore_index=False).reset_index()
# Create bar plots to compare the centers
```

`penguins`

dataset.- Select features and prepare data. Consider standardizations as well as null values.
- Run K-Means clustering. Be thoughtful about the
hyperparameters, especially K (
`n_clusters`

)! - Assess your model using the cluster sizes and a bar plot of the cluster means.

Good luck! 🐧🐧🐧

You can review the documentation and use every exploratory tool in the book to get a sense of how many clusters there might be in the data.

Maybe your company needs to split customers into exactly 4 categories, for instance.

- Run K-Means multiple times, with a different value for K each time.
- Look at how close the values are to their centroid. (This is called inertia.)
- Create a graph to see at what value for K this inertia measure begins to settle.

```
# Steps 1 & 2: Run K-means with different K and get inertia
inertia = []
for n_clusters in range(2,14):
kmeans = KMeans(n_clusters=n_clusters, n_init='auto', random_state=0).fit(X_std)
inertia.append(kmeans.inertia_ / n_clusters)
# Step 3: Put into a dataframe and create a line plot
inertia = pd.DataFrame({'n_clusters': range(2,14), 'inertia': inertia})
# Make your line plot here!
```

See how the “elbow” bends around 3 or 4 clusters? 💪