We have recently looked at how including more than one continuous variable in a regression model works to “control” for the other variables. This means that we obtain an effect for one variable while the model holds the value of all other variables constant.
In this lesson we explore what it means to let these variables “interact” - what does that mean?
Basically, rather than assessing what happens to one variable while other variables are held constant, we now seek to ask whether the effect of one particular \(X\) on a \(Y\) will be influenced by another \(X\).
For example, consider the following variables:
\(Y\) = total amount of writing completed by a PhD student
\(X_1\) = amount of positive energy
\(X_2\) = length of time before thesis submission deadline
We might expect that each \(X\) has an independent effect on \(Y\):
But might there be an interaction between the two? Would the effect of positive energy become stronger or weaker as the time before deadline becomes shorter? If yes, then we would want to model this relationship in our statistical model so that our estimate of positive energy is more accurate across the lifespan of a PhD student.
Perhaps positive energy has a very strong effect at the start of a PhD, and this effect wanes over time? If we knew this, we would tell PhD students near the end of their study to focus more on something besides positive energy - perhaps anxiety becomes a stronger predictor? (this is of course completely hypothetical!)
Anyhow, let’s get back to our penguins:
Let’s think of an interaction in terms of penguins (right?). We will
again use body_mass_g
as our \(Y\), and bill_depth_mm
and
bill_length_mm
as \(X_1\)
and \(X_2\).
Do we think there might be an interaction between bill depth and bill length? We can try to visualise potential effects using ggplot.
bill_length_mm
?Maybe there is a relationship between these two variables that we could explore in a regression model?
ggplot(drop_na(penguins), aes(y = bill_depth_mm, x = bill_length_mm, size = body_mass_g, color = body_mass_g)) +
geom_jitter(alpha = .5) +
scale_color_viridis_c() +
scale_size_continuous(guide = 'none')
We can also explore the correlation between the two variables, here we see they are negatively correlated. This doesn’t guarantee there is an interaction, but we do see they covary!
##
## Pearson's product-moment correlation
##
## data: penguins$bill_depth_mm and penguins$bill_length_mm
## t = -4.4591, df = 340, p-value = 1.12e-05
## alternative hypothesis: true correlation is not equal to 0
## 95 percent confidence interval:
## -0.3328072 -0.1323004
## sample estimates:
## cor
## -0.2350529
Let’s z-score the variables so that we can put them on the same metric / interpretation.
# create manual z-scored versions
penguins <- drop_na(penguins) %>%
mutate(bill_depth_z = (bill_depth_mm - mean(bill_depth_mm)) / sd(bill_depth_mm),
bill_length_z = (bill_length_mm - mean(bill_length_mm)) / sd(bill_length_mm))
Here I will fit 3 models: one per each \(X\) individually, plus a third model with both \(X\) fit as main effets:
m1 <- lm(body_mass_g ~ bill_depth_z, data = penguins)
m2 <- lm(body_mass_g ~ bill_length_z, data = penguins)
m3 <- lm(body_mass_g ~ bill_depth_z + bill_length_z, data = penguins)
Remind yourself of what this output means - bill depth has a significant relationship with penguin body weight…
##
## Call:
## lm(formula = body_mass_g ~ bill_depth_z, data = penguins)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1616.08 -510.38 -68.67 486.51 1811.12
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4207.06 38.96 107.986 <2e-16 ***
## bill_depth_z -380.07 39.02 -9.741 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 710.9 on 331 degrees of freedom
## Multiple R-squared: 0.2228, Adjusted R-squared: 0.2205
## F-statistic: 94.89 on 1 and 331 DF, p-value: < 2.2e-16
As does bill length…
##
## Call:
## lm(formula = body_mass_g ~ bill_length_z, data = penguins)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1759.38 -468.82 27.79 464.20 1641.00
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4207.06 35.70 117.85 <2e-16 ***
## bill_length_z 474.64 35.75 13.28 <2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 651.4 on 331 degrees of freedom
## Multiple R-squared: 0.3475, Adjusted R-squared: 0.3455
## F-statistic: 176.2 on 1 and 331 DF, p-value: < 2.2e-16
And they both remain significant when added together, but have been adjusted based on coexisting within the model space:
Do you remember how to interpret these coefficients?
##
## Call:
## lm(formula = body_mass_g ~ bill_depth_z + bill_length_z, data = penguins)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1806.74 -456.82 15.33 471.07 1541.34
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4207.06 32.30 130.257 < 2e-16 ***
## bill_depth_z -286.54 33.23 -8.624 2.76e-16 ***
## bill_length_z 409.13 33.23 12.313 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 589.4 on 330 degrees of freedom
## Multiple R-squared: 0.4675, Adjusted R-squared: 0.4642
## F-statistic: 144.8 on 2 and 330 DF, p-value: < 2.2e-16
As a reminder, the slopes / coefficients for either \(X\) do not change when values of the other \(X\) change. This means we are not capturing the effect of \(X_1\) at different levels of \(X_2\) (or the other way around).
So how do we fit an interaction term? We simply use multiplication. The effects of \(X_1\) and \(X_2\) are literally multiplied to create a new effect. The inclusion of a multiplication term will introduce the combined and joint effects of the variables above their additive effects.
The formula for an interaction is:
\[ \large Y \sim b_0 + (b_1*X_1) + (b_2*X_2) + b_3 * (X_1 * X_2) + \epsilon \] The new coefficient term, \(b_3\), captures the joint contribution of both variables on \(Y\). This new term is used to determine how \(X_1\) or \(X_2\) will influence \(Y\) at different levels of either \(X\).
So to fit an interaction term, we type this:
And inspect the model per normal:
##
## Call:
## lm(formula = body_mass_g ~ bill_depth_z * bill_length_z, data = penguins)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1814.0 -353.6 2.9 352.4 1603.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4117.91 28.90 142.501 < 2e-16 ***
## bill_depth_z -205.16 29.47 -6.961 1.84e-11 ***
## bill_length_z 522.15 30.34 17.211 < 2e-16 ***
## bill_depth_z:bill_length_z -391.10 35.84 -10.911 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 505.8 on 329 degrees of freedom
## Multiple R-squared: 0.609, Adjusted R-squared: 0.6054
## F-statistic: 170.8 on 3 and 329 DF, p-value: < 2.2e-16
The intercept differs because we have added a new term: the interaction between the two variables. This interaction alters how the predictors relate to the outcome depending on the value of each other variable.
As a result, the coefficients for the predictors
(bill_depth
and bill_length
) can no longer be
interpreted as main effects. Instead, they represent
conditional effects, meaning their interpretation
depends on the value of the other predictor.
bill_depth
is associated with a decrease of 205 grams in body mass when bill_length
is at its mean.
bill_length
being at its mean.)The interaction term quantifies how the relationship between each predictor and \(Y\) changes depending on the value of the other predictor.
bill_depth
increases by one unit (e.g., 1 SD in this
case), the slope of bill_length
decreases
by 391 grams.bill_length
increases by one unit, the
slope of bill_depth
decreases by 391
grams.This reflects the reciprocal nature of the interaction term: each predictor’s effect on \(Y\) depends on the level of the other predictor.
The interaction term modifies the slope, not the coefficient values of the main effects directly. The interaction tells us how the slope of one predictor changes as the value of the other predictor increases.
Visualising the interaction helps to demonstrate this. There are
various ways to visualise interactions - a quick way is to use the
plot_model
function from the sjPlot
library.
How do we interpret this plot? The two lines represent the
conditional effects of bill_depth
(on the x-axis) at two
different levels of bill_length
(the legend).
bill_length_z
= -2.17, the slope
for bill_depth_z
is positivebill_length_z
= 2.85, the slope for
bill_depth_z
is negativeSo if someone asked you whether penguin body weight is affected by bill depth, you’d say “yes, but the effect depends (or is conditional) on the length of the penguin’s bill!
## #refugeeswelcome
What are these two values of -2.18 and 2.85? They are the smallest and largest values for this variable in the raw data:
## [1] -2.174715
## [1] 2.853932
The sjPlot package picks the min/max values for us. Using more
sophisticated methods from libraries such as emmeans
, we
can pick and choose to display the effects of one \(X\) at any level of the other variable. On
choice might be to show the effect of \(X_1\) while \(X_1\) is at its mean, as well as +/- 1SD
from the mean. Compare this to the earlier plot, where the slope
remained the same.
Whether the interaction term is “significant” is the same as any other test - whether the estimate / standard error is far enough away from zero to reach a p value below .05.
But we can also look at the model fit in terms of R2 change and other metrics:
Compare the r2 of the models with and without the interaction term:
## [1] 0.4674645
## [1] 0.6089661
There is a formal way to test the fit of two “nested” models using
the anova()
function
## Analysis of Variance Table
##
## Model 1: body_mass_g ~ bill_depth_z + bill_length_z
## Model 2: body_mass_g ~ bill_depth_z * bill_length_z
## Res.Df RSS Df Sum of Sq F Pr(>F)
## 1 330 114633409
## 2 329 84173822 1 30459587 119.05 < 2.2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
Let’s say we have a penguin with a bill depth of 22mm and bill length of 50mm. What would their predicted body weight be? We use the same approach, with this formula. First we’d have to z-score the values (since the model predictions are based on z-scores).
## [1] 2.455336
## [1] 1.098477
\[ \large Y \sim b_0 + (b_1*X_1) + (b_2*X_2) + b_3 * (X_1 * X_2) \]
##
## Call:
## lm(formula = body_mass_g ~ bill_depth_z * bill_length_z, data = penguins)
##
## Residuals:
## Min 1Q Median 3Q Max
## -1814.0 -353.6 2.9 352.4 1603.6
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 4117.91 28.90 142.501 < 2e-16 ***
## bill_depth_z -205.16 29.47 -6.961 1.84e-11 ***
## bill_length_z 522.15 30.34 17.211 < 2e-16 ***
## bill_depth_z:bill_length_z -391.10 35.84 -10.911 < 2e-16 ***
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 505.8 on 329 degrees of freedom
## Multiple R-squared: 0.609, Adjusted R-squared: 0.6054
## F-statistic: 170.8 on 3 and 329 DF, p-value: < 2.2e-16
First put the intercept and coefficients into the formula:
\[ \large \text{body weight} = 4117.91 + (-205.16*X_1) + (522.15*X_2) + -391 * (X_1 * X_2) \]
Now put our new values in (remember, we z-scored them)
\[ \large \text{body weight} = 4117.91 + (-205.16* 2.455) + (522.15*1.099) + -391 * (2.455 * 1.099) \]
Then complete the calculation:
\[ \large \text{body weight} = 4117.91 + (-503.6678) + (573.8428) + (-391 * (2.698)) \] Another step…
\[ \large \text{body weight} = 4117.91 + (-503.6678) + (573.8428) + -1054.918 \]
The final step…
\[ \large \text{body weight} = 3133.167 \]
If you knew that one of your \(X\) variables had a different effect depending on the value of another \(X\) variable, would you still care about the simple/main/singular effect?