{
"cells": [
{
"cell_type": "markdown",
"id": "85e23c09-315b-4214-8086-86478fbd8262",
"metadata": {},
"source": [
"# Classifying Trees with Naive Bayes\n",
"\n",
"**Complete by: Tuesday 2 Apr. at 9am** \n",
"Data: (See below.)\n",
"\n",
"Let's look again at our dataset of Pittsburgh's trees from last week. You might have noticed something about the different predictors you used: they were almost all categorical!\n",
"\n",
"This week in class, we learned about a classifier designed specifically to work well with categorical predictors: Naive Bayes. In this week's workshop, we'll run a naive Bayes classifier on the tree data, and we'll try to determine if this is a better approach for this data than logistic regression.\n",
"\n",
"## Data Wrangling\n",
"\n",
"You'll need to repeat the data wrangling and cleaning steps that we did last time. **Make sure you use comments or markdown cells to explain what you're doing at each step.**\n",
"\n",
"There's one big difference: last time you may have used a numerical variable, `height_M`, as a predictor. As we discussed in class, you can only use a numerical variable with naive Bayes if the variable has been \"binned\" into categories. It turns out that this has already been done for us with the column `height_binned_M`. **Let's use this `height_binned_M` column instead, in all instances.** Could it be that this data was designed with a naive Bayes classifier in mind?\n",
"\n",
"## Building Your Model\n",
"\n",
"- This time you should use one-hot encoding instead of reference coding. **Write some code to create dummy variables for your categorical predictors.**\n",
"- **Split your data into training and test sets.** How large should your test set be?\n",
"- **Create your model and fit it to the data.** Again, remember to explain and comment as you go.\n",
"\n",
"## Prediction and Model Assessment\n",
"\n",
"- **Create a variable to store the names of the two main *categories* we are predicting.**\n",
"- **Predict the probabilities for the test data, put them into a dataframe, and describe what you see.** Based on browsing this list, does it look like the model did a good job?\n",
"- **Predict the categories for the test data, and store this in a variable for the next steps.**\n",
"- **Visualize the confusion matrix for your model, and interpret the visualization.** Are there lots of false positives? false negatives? What could be going on here?\n",
"- **Calculate scores for accuracy, precision, and recall. Explain how your model did based on these scores.**\n",
"- **Run cross-validation and interpret the results.**\n",
"- **Visualize the ROC Curve, and calculate the AUC score for this curve. Explain how your model did based on these metrics.**\n",
"\n",
"## Conclusion\n",
"\n",
"**Write a one-paragraph conclusion that reflects on the two different models you ran over this past two weeks.** Which one performed better, the KNN classifier or the naive Bayes classifier? Which one is better suited to the data, and why? What other data do you wish you had to help classify these trees? What are some possible next steps for this analysis?"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "1e04cc99-489b-4cd7-8d9a-4f68907ac708",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.6"
}
},
"nbformat": 4,
"nbformat_minor": 5
}