CIS 241, Dr. Ladd
spacebar
to go to the next slide, esc
/menu to navigate
There are many different regressors, and we’ll learn about:
But today we’re only focusing on regression (i.e. predicting a numerical target).
sklearn
classes to run these.#For Random Forest
from sklearn.tree import DecisionTreeRegressor, plot_tree
from sklearn.ensemble import RandomForestClassifier
See the Sklearn Guide for everything else you’ll need.
Euclidean distance is the default in scikit-learn.
\[\sqrt[]{(x_2 - x_1)^2 + (y_2 - y_1)^2}\]
How is this different from reference coding?
Also called “normalization”, this keeps variables in the same scale.
Not “how much” but “how different from the average.”
\[z=\frac{x-\bar{x}}{s}\]
The z-score is the original value minus the mean and divided by the standard deviation.
These are one type of tree model, but a decision tree runs the model one time and random forest is an ensemble approach.
Let’s find out more from the sklearn
documentation.
plot_tree()
function to see the tree of split data.See the Sklearn Guide and the documentation for more info.
This is referred to as “variable (or feature) importance” and takes advantage of decision trees’ skill at finding patterns in the data.
But they are not so reliable one-at-a-time, and often cause overfitting. We need to think about the bias-variance tradeoff!
To get more accurate predictions, it’s best to use many trees together.
And what do you call a lot of trees? A forest!
You can see all the metaphors here: a forest, a musical ensemble, etc.
The decision trees are put together using “bagging”: bootstrap aggregating.
min_samples_leaf
: the minimum number of records in a terminal node (leaf)max_leaf_nodes
: the maximum number of nodes in the entire treesplitter
and criterion
: the kind of trees you will createSetting these can help you create smaller trees and avoid spurious results!
Let’s try two models, one KNN and one Random Forest, with the Seattle Housing data.