CIS 241
Dr. Ladd
Come up with a rationale for why you think they would be related.
It’s not a good idea to just try to regress any set of variables together.
Correlation does not mean causation!!
Can also be written as: \(Y=b_{1}X+b_{0}\)
\(Y\) is your target (dependent) variable.
\(X\) is your predictor (independent) variable. (There will eventually be many predictors.)
Coefficients:
\(b_{1}\) (or \(m\)) describes the slope of the line (and its direction).
\(b_{0}\) (or \(b\)) describes the height of the line when \(X\) is 0. This is called the y-intercept or simply the intercept.
This is what it means to “fit” a linear model.
Bivariate regression:
\(Y=b_{0}+b_{1}x\)
Multivariate regression:
\(Y=b_{0}+b_{1}x_{1}+b_{2}x_{2}+b_{3}x_{3}+...\)
As many as you want, but make sure you develop a rationale (use your human brain)!
Do you have good reason to believe that a linear regression or predictive model would help? Is there a relationship between variables that’s worth learning about?
We can see a general trend: as engine size goes up, fuel efficiency goes down. Now we’re ready to try modeling this relationship as part of a larger linear regression model.
sklearn
.Full scikit-learn documentation here
A training set will determine the model’s coefficients, and a test set will let us see how well it works on new data that it hasn’t already seen.
That’s why we imported the LinearRegression class above.
target = "Miles_per_Gallon" # Our target variable
predictors = ["Displacement", "Horsepower", "Acceleration"] # A list of predictors
X = cars[predictors]
y = cars[target]
# The train_test_split() function will create 4 variables
X_train, X_test, y_train, y_test = train_test_split(
X, # The first argument is the X variable
y, # The second argument is the Y variable
test_size=0.4, # Proportion of the data in the test set
random_state=0) # Set a random state so it's the same every time
our_model = LinearRegression() # Make an instance of this class
our_model.fit(X_train, y_train) # Run the fit() method on training data
Why did we get an error?
With a coefficient for displacement of -0.0384
, this
linear regression provides evidence that as engine displacment
increases, fuel efficiency decreases slightly!
For every additional unit of engine displacement, the expected fuel efficiency decreases by 0.0384.
What would the other coefficients mean?
The intercept indicates that if all predictor variables were 0, fuel efficiency would be 45.392 miles per gallon. Why doesn’t this number make any sense?
Be careful not to imply that there is a direct causal link, especially without more evidence or studies.
target = "Miles_per_Gallon" # Our target variable
predictors = ["Displacement", "Horsepower", "Acceleration"] # A list of predictors
X = cars[predictors]
y = cars[target]
# Split the data
X_train, X_test, y_train, y_test = train_test_split(
X,
y,
test_size=0.4,
random_state=0)
# Fit the model
our_model = LinearRegression()
our_model.fit(X_train, y_train)
# View coefficients
print(f"Intercept: {our_model.intercept_:.3f}")
for c,p in zip(our_model.coef_,X.columns):
print(f"Coefficient for {p}: {c:.4f}")
Try to make an effective multivariate linear model to predict housing prices in Seattle.
Take a look at the dataset and logically choose some predictors. Check for multicollinearity before you run your model! When you’re done, try to predict housing price based on some new data points you create.
Load house_sales.tsv. You’ll need to open this with:
RMSE
will decrease.But think about how much it increases or decreases.
This will confuse the model and mess up your results! It could even result in false predictions.
You must test for multicollinearity before you select your predictors and run your model.
You can do a pairwise comparison of the variables you’re thinking about.
Reference coding converts categorical variables to a set of binary variables.
The first category should always be left out as the reference
(drop_first=True
). All the remaining slopes are relative to
that reference!
Ask yourself: is there an important variable that the data doesn’t account for?
The predict()
method uses coefficients to calculate new
values.
# Let's focus on out-of-sample prediction for validation
predictions = our_model.predict(X_test)
predictions
We can predict with our training data (in-sample prediction) or with our test data (out-of-sample prediction).
Residuals are the differences between the actual observed values and the ones the model predicted.
Think of these as the “errors” that the modeling method produced. If the residuals are symmetrically distributed with the median close to zero, the model may fit the data well.
# The raw value, in-sample
np.sqrt(mean_squared_error(y_test, predictions))
# Or neatly printed
print(f"Root mean squared error: {np.sqrt(mean_squared_error(y_test, predictions)):.2f}")
This is a good metric for comparing models.
This is also called the “coefficient of determination.”
\(R^{2}\) ranges from 0 to 1. If it were 1, the variables would make a straight line. If it were 0, the x variable wouldn’t predict the y variable at all.
In this example, \(R^{2}=0.67\), so our predictors account for about 67% of the variation in fuel efficiency.
There’s no rule for what makes an \(R^{2}\) “good.” Consider the context and purpose of your analysis!
In an analysis of ecology or human behavior (very unpredictable) an \(R^{2}\) of 0.20 or 0.30, might be considered good. In an analysis predicting mechanical repairs, or recovery from medical procedures, an \(R^{2}\) of 0.60 or 0.70 might be considered very poor.
# First put the results into a dataframe
results = pd.DataFrame({'Predictions': predictions, 'Residuals':residuals})
alt.Chart(results, title="Histogram of Residuals").mark_bar().encode(
x=alt.X('Residuals:Q', title="Residuals").bin(maxbins=20),
y=alt.Y('count():Q', title="Value Counts")
)
You could also use a Q-Q plot for this!
Is the variance consistent across the range of predicted values?
# Plot the absolute value of residuals against the predicted values
chart = alt.Chart(results, title="Testing for Heteroskedasticity").mark_point().encode(
x=alt.X('Predictions:Q').title("Predicted Values"),
y=alt.Y('y:Q').title("Absolute value of Residuals")
).transform_calculate(y='abs(datum.Residuals)')
chart + chart.transform_loess('Predictions', 'y').mark_line()
If the line is horizontal, there’s no heterskedasticity.