library(tidyverse)
library(sjPlot)
<- tibble(
storks storks = c(100,300,1,5000,9,140,3300,2500,4,5000,5,30000,1500,5000,8000,150,25000),
birth_rate = c(83,87,118,117,59,774,901,106,188,124,551,610,120,367,439,82,1576),
area_km2 = c(29,84,31,111,43,544,357,132,42,93,301,312,92,237,505,41,779)
)
Controlling for confounders
Using multiple regression to better test causal relationships
Controlling for spurious correlations
In the section about causality we discussed the difference between correlation and causation, and what techniques we can use to distinguish between the two. One of those techniques is to use multivariate analysis techniques, such as multiple regression. In this tutorial we will show you how to use multiple regression to control for confounding variables, and how this helps us deal with spurious correlations. As example data we’ll be using a famous example of a spurious correlation: the number of storks and the birth rate of a country.
Do storks deliver babies?
There is a famous example of confounding, that points at the surprisingly high correlation between the number of storks and the number of newborn babies across European countries (\(\rho = 0.62\)). Based on this empirical finding, could one actually argue that the might be some truth to the old folk tale that storks deliver babies? The answer to this questions is obviously no, but it does raise an important issue: how can we distinguish between correlation and causation? The example of storks and babies is sufficiently ridiculous to make it clear that this is a spurious correlation, but how can we be sure in more realistic cases?
One of the techniques in our tool belt is to control for confounding variables in a multiple regression model. Let’s see how this works in practice using the storks and babies data from the article.
Correlation analysis
The data we have here is about the number of storks
, the birth_rate
, and the area_km2
of the country. Let’s first have a look at the correlations.
tab_corr(storks)
storks | birth_rate | area_km2 | |
storks | 0.620** | 0.579* | |
birth_rate | 0.620** | 0.922*** | |
area_km2 | 0.579* | 0.922*** | |
Computed correlation used pearson-method with listwise-deletion. |
As reported in the article, the correlation between the number of storks and the birth rate is high at \(r = 0.62\). But we also see that the area of the country is correlated with both the number of storks and the birth rate (\(r = 0.58\) and \(r = 0.92\) respectively). This is a good indication that the area of the country is likely a confounding variable.
Multiple regression analysis
Now let’s see what happens when we put the number of storks into a regression model. Our dependent variable is the birth_rate
, our independent variable is the storks
, and we’ll control for the area_km2
. The variables are standardized first, so that you can see the relation to the correlation coefficients. We’ll run two models: without and with the control variable.
## standardize all variables
<- storks |> scale() |> as_tibble()
storks_z
<- lm(birth_rate ~ storks, data = storks_z)
m1 <- lm(birth_rate ~ storks + area_km2, data = storks_z)
m2 tab_model(m1, m2)
birth_rate | birth_rate | |||||
Predictors | Estimates | CI | p | Estimates | CI | p |
(Intercept) | -0.00 | -0.42 – 0.42 | 1.000 | 0.00 | -0.21 – 0.21 | 1.000 |
storks | 0.62 | 0.19 – 1.05 | 0.008 | 0.13 | -0.13 – 0.39 | 0.304 |
area km2 | 0.85 | 0.59 – 1.11 | <0.001 | |||
Observations | 17 | 17 | ||||
R2 / R2 adjusted | 0.385 / 0.344 | 0.862 / 0.842 |
In the first model we see a positive effect of the number of storks on the birth rate (\(\beta = 0.62\)). The effect is significant (p < 0.01) and the standardized coefficient is the same as the correlation coefficient. So here you see firsthand a spurious effect.
In the second model, where we control for the area of the country, the effect of the number of storks on the birth rate is no longer significant (p = 0.304). Instead, we see a very strong effect of the area of the country on the birth rate (\(\beta = 0.85\), p < 0.001). So here you see the power of multiple regression in action: by controlling for area_km2
, the spurious effect of storks
on birth_rate
disappears.
Why controlling for confounding is important
If we do not control for confounding variables, the coefficients in our regression model might be biased. Sometimes the effect is overestimated, as we saw here with the effect of storks on the birth rate. But the effect can also be suppressed, or even reversed!
Being able to control for confounding variables enables us to better explore possible causal relationships between variables. Based on the first model, we might have concluded that if the number of storks increases or decreases, the birth rate of the country will follow. Imagine the policy implications of such a finding! If a government would want to increase the birth rate, they would simply have to breed more storks.
The second model however shows that this relation between storks and birth rate is spurious. This suggests that our stork breeding program would not have the desired effect. Instead, the model gives reason to believe that the birth rate is influenced by the size of the country, which is a much more plausible explanation. If a government would want to increase the birth rate, they would have to increase the size of the country. And this is why we have too many wars, and too few storks.