CIS 241, Dr. Ladd
spacebar
to go to the next slide, esc
/menu to navigate
Predicting a known target from a set of predictor variables.
e.g. Logistic Regression, Naive Bayes, KNN, Random Forest, etc.
It constructs a model of the data without learning from existing labels.
It can’t find clusters that aren’t there!
Just like in KNN, you have to choose K based on the data! (More on this later.)
You can review the documentation and use every exploratory tool in the book to get a sense of how many clusters there might be in the data.
Maybe your company needs to split customers into exactly 4 categories, for instance.
# Steps 1 & 2: Run K-means with different K and get inertia
inertia = []
for n_clusters in range(2,14):
kmeans = KMeans(n_clusters=n_clusters, n_init='auto', random_state=0).fit(X_std)
inertia.append(kmeans.inertia_ / n_clusters)
# Step 3: Put into a dataframe and create a line plot
inertia = pd.DataFrame({'n_clusters': range(2,14), 'inertia': inertia})
# Make your line plot here!
See how the “elbow” bends around 3 or 4 clusters? 💪
For a dataset of all quantitative variables, use StandardScaler()
as usual.
It is best to stick with numerical variables for clustering, but if you must use categorical variables, remember to use one-hot encoding before scaling. A MinMaxScaler()
may also be useful in this case.
Because this is an unsupervised method, there’s no need to reserve a test set. There would be nothing to test on!
This isn’t the whole workflow, it’s just the model code.
from sklearn.cluster import KMeans
kmeans = KMeans(n_clusters=3, n_init='auto', max_iter=300, random_state=0)
kmeans.fit(X_std)
Before running this, you should try the elbow method to determine a good K value!
n_clusters
: the number of clusters the model will produce. This is “K”!n_init
: the total number of times the model will be run.max_iter
: the number of iterations the model will take to find centroids.random_state
: keeps the model the same every time.First, we can look at the relative size of the clusters. Are they relatively balanced? Unbalanced clusters may mean we need to try again.
Next, we can look at the cluster means for each cluster. This tells us where the centroid is and gives us a sense of how the different variables interact.
# Using the .cluster_centers_ attribute of our model
# Get the centers into a dataframe:
centers = pd.DataFrame(kmeans.cluster_centers_, columns=predictors)
# Tidy our dataset with .melt() (the opposite of pivot):
centers = centers.melt(ignore_index=False).reset_index()
# Create bar plots to compare the centers
penguins
dataset.n_clusters
)!Good luck! 🐧🐧🐧