library(tidyverse)
<- read_csv("https://tinyurl.com/R-practice-data") d
Mutate and recode
Create new variables and recode existing ones
In this tutorial we use the tidyverse
package and the simulated practice data.
Creating and modifying columns with mutate()
The mutate
function allows you to create or modify columns in a data frame. You just specify which columns you want to create or modify, and then provide an expression for how to compute the new values. In this expression you can refer to other columns in the table, and use any R functions you like, which makes mutate
a very powerful tool.
Creating new variables
To create a new variable you use the following syntax:
mutate(d, new_variable = expression)
The expression can be anything that returns a valid column. For example, in the practice data we have the columns trust_t1
and trust_t2
, which represent trust in journalists before and after the experiment. We can create a new variable trust_change
that represents the change in trust from before to after the experiment.
<- mutate(d, trust_change = trust_t2 - trust_t1)
d
select(d, trust_t1, trust_t2, trust_change)
Mutate existing variables
To mutate an existing variable, you can simply overwrite the column with the same name. For example, let’s say that we want to standardize the trust_change
variable that we just made. We can standardize a variable with the scale
function, so we can use that inside of mutate
.
<- mutate(d, trust_change = scale(trust_change)) d
Now the trust_change
variable is standardized, which means that it has a mean of 0 and a standard deviation of 1. A nice way to get a quick overview of the distribution of a single variable is to plot a histogram.
hist(d$trust_change)
Recoding variables
Recoding variables means changing the values of a variable based on some condition. This is a common operation in data management, because often you want to change the values of a variable to make them more interpretable, correct errors, or prepare the data for analysis. To recode variables in R, you can use the mutate
function in combination with the case_match
and case_when
functions.
Recode with case_match
The case_match
function is a simple way to recode specific values into new values. For example, in our practice data we have a column with the experimental groups, which are control
, positive
, and negative
. Let’s say we want to clarify that positive
means positive_movie
, and negative
means negative_movie
. We could then use case_match
to change these values.
<- mutate(d, experiment_group = case_match(experiment_group,
d "positive" ~ "positive_movie",
"negative" ~ "negative_movie",
.default = experiment_group))
Here we say: overwrite the experiment_group
column with output of the case_match
function. Inside the case_match
function, we specify three things:
- The column we want to recode (
experiment_group
). - The conditions for recoding the values. We have two conditions:
- If value is
"positive"
, recode into"positive_movie"
. - If value is
"negative"
, recode into"negative_movie"
.
- If value is
- We specify a
.default
value for values that are not matched in the conditions. Here we say that in that case we want to use the current value of theexperiment_group
column.
If you check the unique values of the experiment_group
column, you will see that positive
and negative
have been changed to positive_movie
and negative_movie
, and that control
remains the same.
unique(d$experiment_group)
[1] "negative_movie" "control" "positive_movie"
More flexible recoding with case_when
The case_match
function is great if you need to recode many values, but sometimes you need more flexibility. For example, if we want to recode the age
variable into categories (e.g., <= 20
, 20-30
), it would be really tiresome to recode every individual age value. With the case_when
function, we can specify the conditions using logical expressions. Each condition is evaluated in order, and the first one that is TRUE
is used.
<- mutate(d, age_category = case_when(
d < 20 ~ "<= 20",
age < 30 ~ "20-30",
age < 40 ~ "30-40",
age < 50 ~ "40-50",
age < 60 ~ "50-60",
age .default = ">= 60"
))
table(d$age_category)
<= 20 >= 60 20-30 30-40 40-50 50-60
12 78 124 138 129 119
Binary cases with if_else
If you only have two categories, you can use the if_else
function. You could technically also use case_when
for this, but if_else
is more concise and easier to read. The syntax for if_else
is:
if_else(condition, value_if_true, value_if_false)
A common use case is that sometimes you want to perform an operation only on a subset of the data. For example, in our data there are a few participants that accidentally entered their birthyear instead of their age. To correct this, we can use if_else
to set the age to 2024 - birthyear
, but only if the number the participants entered is above 1000 (which is only the case if it’s a birthyear).
<- mutate(d, age = if_else(age > 1000, 2024 - age, age)) d
So this reads: if the age is above 1000, return 2024 - age
, otherwise return the current age.