{
"cells": [
{
"cell_type": "markdown",
"id": "cf39dc08-3d5d-4d70-83d7-73317b3fd9f9",
"metadata": {},
"source": [
"# Classifying Airbnb Rentals with Logistic Regression\n",
"\n",
"**Complete by: Tuesday 15 Oct. at 10:55am** \n",
"Data: \n",
"\n",
"First, a reminder of the data and where it comes from:\n",
"\n",
">Airbnb was founded in 2008 to allow people to rent apartments, houses, and spaces directly to one another. The company provides an alternative to traditional hotels and rentals. Recently there's been a lot of public discussion about the effects Airbnb rentals have on neighborhoods: as Airbnb gets more popular, more and more homes and apartments that were previously used as private residences are now becoming full-time rentals. The data activists and journalists at [Inside Airbnb](http://insideairbnb.com/) are using public information about rentals to better understand these phenomena.\n",
"\n",
"In the last workshop, you chose one independent variable (predictor) and explained its relationship to `price` (your dependent variable, or target) using bivariate linear regression. Now that you've gained a better understanding of the data, this week you'll **accurately predict whether Airbnbs in New Orleans are instantly bookable using logistic regression**.\n",
"\n",
"As the [Data Dictionary from Inside Airbnb](http://insideairbnb.com/data-assumptions) says, the `instant_bookable` variable shows \"whether the guest can automatically book the listing without the host requiring to accept their booking request.\" If a property is instantly bookable, that's usually an indicator that it's a commercial property rather than one individually owned. Being able to predict this category correctly aligns with Inside Airbnb's original mission of finding out how much commercial full-time rentals are edging out private residences.\n",
"\n",
"You may need to try a few different models with some different sets of variables before you get one you feel confident in. You will also need to use multiple kinds of visualization for exploration, analysis, and validation.\n",
"\n",
"This should be a polished and clearly-formatted report. Remember that in all of these steps **your interpretations are just as important as your code**. You should be taking time to interpret at each stage of your report, and make sure you are interpreting things *completely, accurately, and in terms of the data*.\n",
"\n",
"## Ethical Considerations\n",
"\n",
"How much should a temporary rental apartment cost? How many rentals should be available? If a city is a tourist destination, should more of its housing be devoted to temporary hotel-like rental space? Since the launch of Airbnb these questions have only gotten more pressing. Imagine you live in a popular neighborhood in a city that's a tourist destination. At first, you and your neighbors can make extra money by putting your homes and apartments up for rent on Airbnb when you're not using them. But over time, larger companies and individuals who *don't live in your neighborhood* start buying up homes solely to rent them on Airbnb full time. What might this do to your neighborhood?\n",
"\n",
"The price and density of Airbnbs has effects not only on individuals who live in those neighborhoods and cities, but also on city officials, the real estate industry, and tourists themselves who may not be able to afford regular hotels and accommodations. How do you balance all of those competing concerns when pricing Airbnbs?\n",
"\n",
"## Data Wrangling\n",
"\n",
"- Import the data and prepare it for analysis. You will likely need to follow the same steps you took in the previous workshop (but you don't need to replicate the same errorsâ€”just skip right to the solutions).\n",
"\n",
"## Exploratory Data Analysis\n",
"\n",
"Let's learn a little more about this data set before moving on to modeling. **Write some code and/or some markdown to answer the following questions:**\n",
"\n",
"- What are the different categories in the `room_type` column? What does room_type seem to be telling you?\n",
"- What do you think the `host_acceptance_rate` column is showing?\n",
"- How do the prices of Airbnb's differ across the two `instant_bookable` categories? Can you visualize this difference and talk about what you see? (Hint: a box plot would probably be best here, and you made need to limit some of your axes to make the plot readable.)\n",
"- What will be the predictor variables in your model? Assemble these variables into a list to use on the next step, and make sure you check for multicollinearity.\n",
"\n",
"## Building Your Model\n",
"\n",
"- Do you need to use reference coding this time? **If so, write some code to create dummy variables for your categorical predictors.**\n",
"- **Split your data into training and test sets.** How large should your test set be?\n",
"- **Create your model and fit it to the data.** You will need to set parameters for your model, but these will be a little different. Since we have a much larger dataset, let's use the `'sag'` solver, but keep penalty set to \"none\". You may get a warning that your model isn't \"converging,\" which means that the coefficients aren't resolving to specific values as the model iterates. To fix this, you can change the `max_iter` parameter: the default is 100 iterations, but you will probably need it much higher than that.\n",
"- **Print the intercept and the coefficients, and interpret some of the coefficients.** If you're using a lot of predictor variables (say, more than 6 or 7), you don't need to interpret *all* of the coefficients. Instead, make sure that your reader understands which predictors seem to be influencing the model the most and the least.\n",
"\n",
"## Prediction and Model Assessment\n",
"\n",
"- **Create a variable to store the names of the two main *categories* we are predicting.**\n",
"- **Predict the probabilities for the test data, put them into a dataframe, and describe what you see.** Based on browsing this list, does it look like the model did a good job?\n",
"- **Predict the categories for the test data, and store this in a variable for the next steps.**\n",
"- **Visualize the confusion matrix for your model, and interpret the visualization.** Are there lots of false positives? false negatives? What could be going on here?\n",
"- **Calculate scores for accuracy, precision, and recall using the classification report. Explain how your model did based on these scores.**\n",
"- **Run at minimum a 5-fold cross-validation and interpret the results. Does your accuracy score hold up when your results are split differently?**\n",
"- **Visualize the ROC Curve, and report the AUC score for this curve. Explain how your model did based on these metrics.**\n",
"- **Write a brief conclusion summarizing how your model did and what you learned about the data. What would you recommend we try as a next step, to get better results?**"
]
},
{
"cell_type": "code",
"execution_count": null,
"id": "b1abca06-d6f9-4936-ba9b-404a19c764c5",
"metadata": {},
"outputs": [],
"source": []
}
],
"metadata": {
"kernelspec": {
"display_name": "Python 3 (ipykernel)",
"language": "python",
"name": "python3"
},
"language_info": {
"codemirror_mode": {
"name": "ipython",
"version": 3
},
"file_extension": ".py",
"mimetype": "text/x-python",
"name": "python",
"nbconvert_exporter": "python",
"pygments_lexer": "ipython3",
"version": "3.9.6"
}
},
"nbformat": 4,
"nbformat_minor": 5
}