# Modeling Trees with K-Nearest Neighbors

**Complete by: Tuesday 26 Oct. at 9am**  
***n.b. BECAUSE THIS ASSIGNMENT IS DUE AFTER SPRING BREAK, YOU CANNOT RECEIVE AN EXTENSION.***  
Data: (See below.)

How often do you think about the trees you see every day? The "sylvan" part of Pennsylvania refers to trees after all, and though we're surrounded by trees in urban, suburban, and rural environments it's easy for them to become simply part of the scenery. But trees are an essential part of our environment and have been of steady interest to ecologists.

Recently, a group of ecologists created a dataset of over 5 million trees from 63 US cities in order to better understand the biodiversity of urban landscapes. Here's the abstract from their study:

>Sustainable cities depend on urban forests. City trees -- a pillar of urban forests -- improve our health, clean the air, store CO2, and cool local temperatures. Comparatively less is known about urban forests as ecosystems, particularly their spatial composition, nativity statuses, biodiversity, and tree health. Here, we assembled and standardized a new dataset of N=5,660,237 trees from 63 of the largest US cities. The data comes from tree inventories conducted at the level of cities and/or neighborhoods. Each data sheet includes detailed information on tree location, species, nativity status (whether a tree species is naturally occurring or introduced), health, size, whether it is in a park or urban area, and more (comprising 28 standardized columns per datasheet). This dataset could be analyzed in combination with citizen-science datasets on bird, insect, or plant biodiversity; social and demographic data; or data on the physical environment. Urban forests offer a rare opportunity to intentionally design biodiverse, heterogenous, rich ecosystems.

In this week's lab, we'll use this data to analyze trees with K-nearest neightbors. Can we correctly predict a tree's height? Can we correctly predict whether a tree is naturally occurring or was introduced into its environment?

As usual, you'll follow the steps that we worked through in class.

## Data Wrangling

- Go to the page on Data Dryad for this [urban forests data set](https://doi.org/10.5061/dryad.2jm63xsrf). Instead of having a file posted to our website, I'd like you to find the file for Pittsburgh directly from its source. *Do not simply click on the "Download Dataset" button!* Instead, find the Pittsburgh dataset in the dropdown menu and copy the URL for that file (or download it and upload the file to JupyterHub). **Load your data into this notebook and take a look at it. Describe the dataset and note any unusual features.**
- This data has some issues with NA values. **Write some code to remove columns that have *only* null values in them.** This will require a little research of the `dropna()` method in the Pandas documentation.
- In the `native` column, null values are expressed as the string "no_info". **Write some code to replace any instance of "no_info" with an NA, then drop all the NAs from this column.** Again, you may need to look at the documentation or Google a solution to this.
- This new dataframe contains a mixture of numerical and categorical variables. Note that `ward` is a categorical variable that's expressed as a number, and we'll need to fix that before we proceed. **Using the `astype()` method, write some code to change the `ward` column to the `category` data type. Verify that it worked with the `info()` method.**
- **Finally, drop all null values from this new, curated dataframe.** You're ready to get started!

## Exploratory Data Analysis

Let's learn a little more about this data set before moving on to modeling. **Write some code and/or some markdown to answer the following questions:**

- What are the different categories in the `condition` column? What does condition seem to be telling you?
- What do you think the `overhead_utility` column is showing?
- How do the heights of trees differ across the two `native` categories? Can you visualize this difference and talk about what you see? (Hint: a box plot would probably be best here.)
- What will be the predictor variables in your model? Assemble these variables into a list to use on the next step.
- You don't need to check for multicollinearity this time! Why not? **Write a few sentences explaining this.**

## KNN Regression

One possible thing we'd want to know is: how tall will trees grow in Pittsburgh under differing conditions? If we had information on new trees, could we predict how that tree might get?

- Use `height_M` as your target variable. **Write down what this variable actually means and what its units are.** You can check the Data Dryad documentation that I linked to above.
- Next choose some predictor variables, and give some thought about which ones you should include. Remember, if you use categorical variables that will generate hundreds of columns in your one-hot encoded data, that will slow down your model and likely cause your kernel to overload!
- Run a KNN regression using these variables and targets. Make sure you've scaled variables and done all appropriate pre-processing. **Explain your steps and interpret the results fully.**
- Use out-of-sample validation to check your model's residuals and calculate some appropriate measures. **How did your model perform?**

## KNN Classification

Beyond predicting an individual value, we might want to know if the health of people in certain cities varies strongly by state. Given information about a new city, could we correctly predict the state in which it's located?

- Use `native` as your target. As above **write down what this variable means**.
- Choose some predictor variables. Will you use different ones this time, or the same ones you used for regression?
- Run a KNN classification using these variables and targets. Make sure you've scaled variables and done all appropriate pre-processing. **Explain your steps and interpret the results fully.**
- Use out-of-sample validation to check your model's confusion matrix, calculate appropriate measures, and make some visualizations. Remember, there are two categories in this target, making this a binary classifier. **How did your model perform?**

## Conclusion

Write a brief summary of what you learned about Pittsburgh's trees through this analysis. Did you find it easier to predict values or categories with this data? What other approaches might you recommend? Are there additional data on trees that you might want to add to this analyis?