CIS 241, Dr. Ladd
It can predict either numerical values or categories!
\[\sqrt[]{(x_2 - x_1)^2 + (y_2 - y_1)^2}\]
How is this different from reference coding?
Also called “normalization”, this keeps variables in the same scale.
Not “how much” but “how different from the average.”
\[z=\frac{x-\bar{x}}{s}\]
The z-score is the original value minus the mean and divided by the standard deviation.
# For standardization
from sklearn.preprocessing import StandardScaler
# For KNN
from sklearn.neighbors import KNeighborsRegressor, KNeighborsClassifier
You will also need plenty of classes and functions that we’ve used previously!
# Using the penguins dataset
predictors = ["bill_length_mm", "bill_depth_mm", "flipper_length_mm", "sex"]
target = "body_mass_g" # A numerical target for now
# Remove null values and use one-hot encoding
penguins = penguins.dropna()
X = pd.get_dummies(penguins[predictors])
y = penguins[target]
# Split data BEFORE standardizing
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.3,
random_state=0)
# Standardizing using the training data
scaler = StandardScaler()
scaler.fit(X_train)
X_train_std = scaler.transform(X_train)
# First you must split and standardize the data with a new target.
# Decide on your predictors and targets
predictors = ["bill_length_mm", "bill_depth_mm", "flipper_length_mm", "sex"]
target = "species" # A categorical target now
penguins = penguins.dropna()
X = pd.get_dummies(penguins[predictors])
y = penguins[target]
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.3,
random_state=0)
# Standardizing using the training data
scaler = StandardScaler()
scaler.fit(X_train)
X_train_std = scaler.transform(X_train)
# Fit the classification model
# Decide on a good value for K (n_neighbors)
knn = KNeighborsClassifier(n_neighbors=5)
knn.fit(X_train_std, y_train)
# Standardize test data
X_test_std = scaler.transform(X_test)
# Get both probabilities and predictions!
probabilities = knn.predict_proba(X_test_std)
predictions = knn.predict(X_test_std)
This works like it would for a linear regression: you can use RMSE to understand how your model performed.
All the usual measures (accuracy, precision, recall, etc.) are valuable here.