library(tidyverse)
<- read_csv("https://tinyurl.com/R-practice-data") |>
d rename(news_consumption = `news consumption`)
Correlation test
Testing the relationship between numeric variables
What is the correlation test?
The correlation test can be used to see if there is a relationship between two numeric variables
. For example, we might want to know if there is a relationship between the amount of time people spend on social media and their feelings of loneliness.
The relation is expressed as a measure (the correlation coefficient) on a scale from -1 to 1, that tells us how strong the relationship is, and whether it is positive or negative. We also get a p-value that tells us if the relationship is significant. In this tutorial we focus on conducting the test. For more details on the concept of correlation and how to interpret the correlation coefficient \(r\), see the covariance and correlation section.
What the correlation test does not tell us is why the relationship exists. If we find that there is a correlation between social media use and loneliness, it does not mean that social media use causes loneliness. For more detail on this, see the causality section.
How to use
We’ll show two ways of conducting the correlation test: using the cor.test
function and using the tab_corr
function from the sjPlot
package. The cor.test
function is part of the base R package, and is usefull if you just have two variables. The tab_corr
function is part of the sjPlot
package, and is useful if you have multiple variables and want to see the correlations between all of them.
To demonstrate the correlation test, we’ll use the practice data that we’ve used before.
Between two variables: cor.test
In the practice data we have variables for the news_consumption
of the participants, and their trust_t1
in the news media (measured before the experiment). These are both numeric variables, so we can use the correlation test to see if there is a relationship between them.
To perform the correlation test using the cor.test
function, we pass the two variables to the function. Recall that we can use the $
symbol to access variables in a data frame. So here we access the news_consumption
and trust_t1
variables from the d
data frame.
cor.test(d$news_consumption, d$trust_t1)
Pearson's product-moment correlation
data: d$news_consumption and d$trust_t1
t = 3.334, df = 598, p-value = 0.000909
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
0.05564478 0.21283051
sample estimates:
cor
0.1350875
The output gives us the correlation coefficient under the sample estimates
, which in our case is \(r = 0.135\). This means that the correlation is positive (> 0), but quite weak. The test also reports a p-value (0.001), which tells us that the correlation is significant at the \(95\%\) (< 0.05
) confidence level. Also note that the degrees of freedom are reported as 598, which is the number of observations minus 2.
It is also possible to use the formula notation in the cor.test
function. However, this looks a bit weird, because in the correlation tests there are no dependent variables, since we are just looking at the relationship. The convention for formulas in R is to use the ~
symbol to separate the dependent and independent variables (dependent ~ independent_1 + independent_2 + ...
). For the correlation test, we would just omit the dependent variable. So for the example above, the code would look like this:
cor.test(~ news_consumption + trust_t1, data = d)
Between more variables: tab_corr
If we need to see the correlations between multiple variables, we can use the tab_corr
function from the sjPlot
package. For example, let’s look at the correlations between the 5 items used to calculate the trust_t1
scale. To use tab_corr
, we can simply pass a data frame with the variables we want to use. Here we use the select
function to select all columns from trust_t1_item1
to trust_t1_item5
.
library(sjPlot)
|>
d select(trust_t1_item1:trust_t1_item5) |>
tab_corr()
trust_t1_item1 | trust_t1_item2 | trust_t1_item3 | trust_t1_item4 | trust_t1_item5 | |
trust_t1_item1 | 0.335*** | -0.707*** | 0.715*** | 0.801*** | |
trust_t1_item2 | 0.335*** | -0.301*** | 0.288*** | 0.321*** | |
trust_t1_item3 | -0.707*** | -0.301*** | -0.633*** | -0.756*** | |
trust_t1_item4 | 0.715*** | 0.288*** | -0.633*** | 0.744*** | |
trust_t1_item5 | 0.801*** | 0.321*** | -0.756*** | 0.744*** | |
Computed correlation used pearson-method with listwise-deletion. |
This is great for getting a quick overview of the relationships between multiple variables. If you need to get the exact p-values, you can also use tab_corr(p.numeric = TRUE)
.
Conditions and assumptions
The correlation test can be used when you have two numeric
variables and you want to know if there is a relationship between them. There is no dependent
or independent
variable in the correlation test, because we are just looking at the relationship between the variables.
There are four main conditions that need to be met for the correlation test to be valid:
- Continuous variables: Both variables should be numeric, and measured on a interval or ratio scale.
- Linearity: The relationship between the variables should be linear.
- Normality: The variables should be (roughly) normally distributed.
- No outliers: Extreme values can have a big impact on the correlation coefficient.
If these conditions are not met, you might want to use Spearman’s \(\rho\) or Kendall’s \(\tau\) instead of the Pearson correlation coefficienta Spearman’s \(\rho\) looks as the rank order of the variables instead of the specific scores, which makes it suitable for ordinal data, and makes it more robust to outliers, non-linear relationships, and non-normality. Kenall’s \(\tau\) is similar to Spearman’s \(\rho\), but can be preferable if there are many ties in the data (when two or more observations have the same value, due to which they can’t be ranked perfectly).
Spearman’s \(\rho\) and Kendall’s \(\tau\)
To use the Spearman’s \(\rho\) or Kendall’s \(\tau\) correlation, you need to specify the method
argument in the cor.test
function, or the corr.method
argument in the tab_corr
function.
cor.test(d$news_consumption, d$trust_t1, method = "spearman")
cor.test(d$news_consumption, d$trust_t1, method = "kendall")
|> select(trust_t1_item1:trust_t1_item5) |> tab_corr(corr.method="spearman")
d |> select(trust_t1_item1:trust_t1_item5) |> tab_corr(corr.method="kendall") d
How to report
For APA style reporting of the correlation test you need to know the correlation coefficient
, p-value
, and the degrees of freedom
. You can find all these values in the output of the cor.test
function.1
The formula for reporting the results of a correlation test in APA style is:
r(degrees of freedom
) = correlation coefficient
, p = p-value
.
- the
correlation coefficient
is rounded to two decimal places, and the number before the decimal point is ommited if zero (.11
instead of0.11
). - The
p-value
is reported for the smallest significant level (p < 0.05
,p < 0.01
orp < 0.001
), or in full with three decimal places if not significant (p = 0.123
).
For example, if we observe a correlation coefficient of \(r = 0.107\) with a p-value of \(p = 0.009\), and \(298\) degrees of freedom. We would report this as:
There is a positive correlation between age and trust in journalists, \(r(298) = .12\), \(p < 0.01\).
Footnotes
When using
tab_corr
, you can use thep.numeric = TRUE
argument to get the p-values. The degrees of freedom is always the number of observations minus 2, but be carefull not to count missing (NA) values.↩︎