Decision Trees 🌳 and the Random Forest 🌳🌲🌳🌲

CIS 241, Dr. Ladd

spacebar to go to the next slide, esc/menu to navigate

What are Decision Trees?

A tree model is a set of rules to split data into different categories.

Decision trees are trained using recursive partitioning.

Let’s find out more from the sklearn documentation.

Using the documentation and plot_tree(), let’s fit a tree model predicting species and recreate this plot.

Decision Trees create nodes (branching rules) based on optimal split values.

Tree models can be used as both classifiers and regressors.

But we will focus on their more common use as classifiers!

Decision Trees can help you determine which predictors (features) are most important.

This is referred to as “variable (or feature) importance” and takes advantage of decision trees’ skill at finding patterns in the data.

Trees can find hidden patterns and help you interpret interactions between variables.

But they are not so reliable one-at-a-time, and often cause overfitting. We need to think about the bias-variance tradeoff!

What is the Random Forest?

To get more accurate predictions, it’s best to use many trees together.

And what do you call a lot of trees? A forest!

The random forest is an ensemble method.

You can see all the metaphors here: a forest, a musical ensemble, etc.

The decision trees are put together using “bagging”: bootstrap aggregating.

For both decision trees and random forest, pay attention to your model’s hyperparameters.

  • min_samples_leaf: the minimum number of records in a terminal node (leaf)
  • max_leaf_nodes: the maximum number of nodes in the entire tree
  • splitter and criterion: the kind of trees you will create

Setting these can help you create smaller trees and avoid spurious results!

Random Forest in Python

???

By now, you’re equipped to find out how to do this on your own, so let’s try an example.

Here’s a hint:

from sklearn.ensemble import RandomForestClassifier

Create a Random Forest Classifier for the penguins dataset. Good luck! 🌲🌳🌲🌳

  1. Just like the Decision Tree, you will predict the species of the penguins.
  2. Use what you learned from the Decision Tree to determine your predictors and hyperparameters!
  3. Fit a random forest classification model.
  4. Do some out-of-sample validation of your model, using the usual metrics.