CIS 241, Dr. Ladd
spacebar
to go to the next slide, esc
/menu to navigate
When using linear regression for prediction, you typically have more than one \(x\) (predictor) variable—sometimes you have hundreds!
Come up with a rationale for why you think they would be related.
It’s not a good idea to just try to regress any set of variables together.
Correlation does not mean causation!!
Can also be written as: \(Y=b_{1}X+b_{0}\)
\(Y\) is your target (dependent) variable.
\(X\) is your predictor (independent) variable. (There will eventually be many predictors.)
Coefficients:
\(b_{1}\) (or \(m\)) describes the slope of the line (and its direction).
\(b_{0}\) (or \(b\)) describes the height of the line when \(X\) is 0. This is called the y-intercept or simply the intercept.
This is what it means to “fit” a linear model.
Bivariate regression:
\(Y=b_{0}+b_{1}x\)
Multivariate regression:
\(Y=b_{0}+b_{1}x_{1}+b_{2}x_{2}+b_{3}x_{3}+...\)
As many as you want, but make sure you develop a rationale (use your human brain)!
Do you have good reason to believe that a linear regression or predictive model would help? Is there a relationship between variables that’s worth learning about?
Let’s make a scatter plot of engine displacement and fuel efficiency, in the cars dataset.
What do you think?
We can see a general trend: as engine size goes up, fuel efficiency goes down. Now we’re ready to try modeling this relationship as part of a larger linear regression model.
There’s no magic solution! You can try different options, but use your logic and don’t just throw everything in there.
This will confuse the model and mess up your results! It could even result in false predictions.
You must test for multicollinearity before you select your predictors and run your model.
You can do a pairwise comparison of the variables you’re thinking about.
potential_predictors = ['Cylinders', 'Displacement', 'Horsepower',
'Weight_in_lbs', 'Acceleration']
alt.Chart(cars).mark_point().encode(
alt.X(alt.repeat("column"), type='quantitative'),
alt.Y(alt.repeat("row"), type='quantitative')
).properties(
width=150,
height=150
).repeat(
row=potential_predictors,
column=potential_predictors
)
Ask yourself: is there an important variable that the data doesn’t account for?
sklearn
.Full scikit-learn documentation here
A training set will determine the model’s coefficients, and a test set will let us see how well it works on new data that it hasn’t already seen.
That’s why we imported the LinearRegression class above.
# Select target and predictors based on validation
target = "Miles_per_Gallon" # Our target variable
predictors = ["Displacement", "Horsepower", "Acceleration"] # A list of predictors
# Create variables and split data
X = cars[predictors]
y = cars[target]
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.4,
random_state=0)
# Fit the model
our_model = LinearRegression()
our_model.fit(X_train, y_train)
# View coefficients
print(f"Intercept: {our_model.intercept_:.3f}")
for c,p in zip(our_model.coef_,X.columns):
print(f"Coefficient for {p}: {c:.4f}")
With a coefficient for displacement of -0.0384
, this linear regression provides evidence that as engine displacment increases, fuel efficiency decreases slightly!
For every additional unit of engine displacement, the expected fuel efficiency decreases by 0.0384.
What would the other coefficients mean?
The intercept indicates that if all predictor variables were 0, fuel efficiency would be 45.392 miles per gallon. Why doesn’t this number make any sense?
Be careful not to imply that there is a direct causal link, especially without more evidence or studies.
Reference coding converts categorical variables to a set of binary variables.
The first category should always be left out as the reference (drop_first=True
). All the remaining slopes are relative to that reference!
Try to make an effective multivariate linear model to predict housing prices in Seattle.
Take a look at the dataset and logically choose some predictors. Check for multicollinearity before you run your model! When you’re done, try to predict housing price based on some new data points you create.
Load house_sales.tsv. You’ll need to open this with:
The predict()
method uses coefficients to calculate new values.
# Let's focus on out-of-sample prediction for validation
predictions = our_model.predict(X_test)
predictions
We can predict with our training data (in-sample prediction) or with our test data (out-of-sample prediction).
Residuals are the differences between the actual observed values and the ones the model predicted.
Think of these as the “errors” that the modeling method produced. If the residuals are symmetrically distributed with the median close to zero, the model may fit the data well.
# The raw value, in-sample
np.sqrt(mean_squared_error(y_test, predictions))
# Or neatly printed
print(f"Root mean squared error: {np.sqrt(mean_squared_error(y_test, predictions)):.2f}")
This is a good metric for comparing models.
This is also called the “coefficient of determination.”
\(R^{2}\) ranges from 0 to 1. If it were 1, the variables would make a straight line. If it were 0, the x variable wouldn’t predict the y variable at all.
In this example, \(R^{2}=0.67\), so our predictors account for about 67% of the variation in fuel efficiency.
There’s no rule for what makes an \(R^{2}\) “good.” Consider the context and purpose of your analysis!
In an analysis of ecology or human behavior (very unpredictable) an \(R^{2}\) of 0.20 or 0.30, might be considered good. In an analysis predicting mechanical repairs, or recovery from medical procedures, an \(R^{2}\) of 0.60 or 0.70 might be considered very poor.
RMSE
will decrease.But think about how much it increases or decreases.
# First put the results into a dataframe
results = pd.DataFrame({'Predictions': predictions, 'Residuals':residuals})
alt.Chart(results, title="Histogram of Residuals").mark_bar().encode(
x=alt.X('Residuals:Q', title="Residuals").bin(maxbins=20),
y=alt.Y('count():Q', title="Value Counts")
)
You could also use a Q-Q plot for this!
Is the variance consistent across the range of predicted values?
# Plot the absolute value of residuals against the predicted values
chart = alt.Chart(results, title="Testing for Heteroskedasticity").mark_point().encode(
x=alt.X('Predictions:Q').title("Predicted Values"),
y=alt.Y('y:Q').title("Absolute value of Residuals")
).transform_calculate(y='abs(datum.Residuals)')
chart + chart.transform_loess('Predictions', 'y').mark_line()
If the line is horizontal, there’s no heterskedasticity.