Correlation test

Testing the relationship between numeric variables

What is the correlation test?

The correlation test can be used to see if there is a relationship between two numeric variables. For example, we might want to know if there is a relationship between the amount of time people spend on social media and their feelings of loneliness.

The relation is expressed as a measure (the correlation coefficient) on a scale from -1 to 1, that tells us how strong the relationship is, and whether it is positive or negative. We also get a p-value that tells us if the relationship is significant. In this tutorial we focus on conducting the test. For more details on the concept of correlation and how to interpret the correlation coefficient $r$, see the covariance and correlation section.

What the correlation test does not tell us is why the relationship exists. If we find that there is a correlation between social media use and loneliness, it does not mean that social media use causes loneliness. For more detail on this, see the causality section.

How to use

We’ll show two ways of conducting the correlation test: using the cor.test function and using the tab_corr function from the sjPlot package. The cor.test function is part of the base R package, and is usefull if you just have two variables. The tab_corr function is part of the sjPlot package, and is useful if you have multiple variables and want to see the correlations between all of them.

To demonstrate the correlation test, we’ll use the practice data that we’ve used before.

library(tidyverse)
d <- read_csv("https://tinyurl.com/R-practice-data") |>
        rename(news_consumption = `news consumption`)

Between two variables: cor.test

In the practice data we have variables for the news_consumption of the participants, and their trust_t1 in the news media (measured before the experiment). These are both numeric variables, so we can use the correlation test to see if there is a relationship between them.

To perform the correlation test using the cor.test function, we pass the two variables to the function. Recall that we can use the $ symbol to access variables in a data frame. So here we access the news_consumption and trust_t1 variables from the d data frame.

cor.test(d$news_consumption, d$trust_t1)


    Pearson's product-moment correlation

data:  d$news_consumption and d$trust_t1
t = 3.334, df = 598, p-value = 0.000909
alternative hypothesis: true correlation is not equal to 0
95 percent confidence interval:
 0.05564478 0.21283051
sample estimates:
      cor 
0.1350875

The output gives us the correlation coefficient under the sample estimates, which in our case is $r = 0.135$. This means that the correlation is positive (> 0), but quite weak. The test also reports a p-value (0.001), which tells us that the correlation is significant at the $95\%$ (< 0.05) confidence level. Also note that the degrees of freedom are reported as 598, which is the number of observations minus 2.

cor.test using the formula notation

It is also possible to use the formula notation in the cor.test function. However, this looks a bit weird, because in the correlation tests there are no dependent variables, since we are just looking at the relationship. The convention for formulas in R is to use the ~ symbol to separate the dependent and independent variables (dependent ~ independent_1 + independent_2 + ...). For the correlation test, we would just omit the dependent variable. So for the example above, the code would look like this:

cor.test(~ news_consumption + trust_t1, data = d)

Between more variables: tab_corr

If we need to see the correlations between multiple variables, we can use the tab_corr function from the sjPlot package. For example, let’s look at the correlations between the 5 items used to calculate the trust_t1 scale. To use tab_corr, we can simply pass a data frame with the variables we want to use. Here we use the select function to select all columns from trust_t1_item1 to trust_t1_item5.

library(sjPlot)
d |>
    select(trust_t1_item1:trust_t1_item5) |>
    tab_corr()

	trust_t1_item1	trust_t1_item2	trust_t1_item3	trust_t1_item4	trust_t1_item5
trust_t1_item1		0.335***	-0.707***	0.715***	0.801***
trust_t1_item2	0.335***		-0.301***	0.288***	0.321***
trust_t1_item3	-0.707***	-0.301***		-0.633***	-0.756***
trust_t1_item4	0.715***	0.288***	-0.633***		0.744***
trust_t1_item5	0.801***	0.321***	-0.756***	0.744***
Computed correlation used pearson-method with listwise-deletion.

This is great for getting a quick overview of the relationships between multiple variables. If you need to get the exact p-values, you can also use tab_corr(p.numeric = TRUE).

Conditions and assumptions

The correlation test can be used when you have two numeric variables and you want to know if there is a relationship between them. There is no dependent or independent variable in the correlation test, because we are just looking at the relationship between the variables.

There are four main conditions that need to be met for the correlation test to be valid:

Continuous variables: Both variables should be numeric, and measured on a interval or ratio scale.
Linearity: The relationship between the variables should be linear.
Normality: The variables should be (roughly) normally distributed.
No outliers: Extreme values can have a big impact on the correlation coefficient.

If these conditions are not met, you might want to use Spearman’s $\rho$ or Kendall’s $\tau$ instead of the Pearson correlation coefficienta Spearman’s $\rho$ looks as the rank order of the variables instead of the specific scores, which makes it suitable for ordinal data, and makes it more robust to outliers, non-linear relationships, and non-normality. Kenall’s $\tau$ is similar to Spearman’s $\rho$, but can be preferable if there are many ties in the data (when two or more observations have the same value, due to which they can’t be ranked perfectly).

Spearman’s $\rho$ and Kendall’s $\tau$

To use the Spearman’s $\rho$ or Kendall’s $\tau$ correlation, you need to specify the method argument in the cor.test function, or the corr.method argument in the tab_corr function.

cor.test(d$news_consumption, d$trust_t1, method = "spearman")
cor.test(d$news_consumption, d$trust_t1, method = "kendall")

d |> select(trust_t1_item1:trust_t1_item5) |> tab_corr(corr.method="spearman")
d |> select(trust_t1_item1:trust_t1_item5) |> tab_corr(corr.method="kendall")

How to report

For APA style reporting of the correlation test you need to know the correlation coefficient, p-value, and the degrees of freedom. You can find all these values in the output of the cor.test function.¹

The formula for reporting the results of a correlation test in APA style is:

r(degrees of freedom) = correlation coefficient, p = p-value.

the correlation coefficient is rounded to two decimal places, and the number before the decimal point is ommited if zero (.11 instead of 0.11).
The p-value is reported for the smallest significant level (p < 0.05, p < 0.01 or p < 0.001), or in full with three decimal places if not significant (p = 0.123).

For example, if we observe a correlation coefficient of $r = 0.107$ with a p-value of $p = 0.009$, and $298$ degrees of freedom. We would report this as:

There is a positive correlation between age and trust in journalists, $r(298) = .12$, $p < 0.01$.

Footnotes

When using tab_corr, you can use the p.numeric = TRUE argument to get the p-values. The degrees of freedom is always the number of observations minus 2, but be carefull not to count missing (NA) values.↩︎