Mutate and recode

Create new variables and recode existing ones

Required packages and data for this tutorial

In this tutorial we use the tidyverse package and the simulated practice data.

library(tidyverse)
d <- read_csv("https://tinyurl.com/R-practice-data")

Creating and modifying columns with mutate()

The mutate function allows you to create or modify columns in a data frame. You just specify which columns you want to create or modify, and then provide an expression for how to compute the new values. In this expression you can refer to other columns in the table, and use any R functions you like, which makes mutate a very powerful tool.

Creating new variables

To create a new variable you use the following syntax:

mutate(d, new_variable = expression)

The expression can be anything that returns a valid column. For example, in the practice data we have the columns trust_t1 and trust_t2, which represent trust in journalists before and after the experiment. We can create a new variable trust_change that represents the change in trust from before to after the experiment.

d <- mutate(d, trust_change = trust_t2 - trust_t1)

select(d, trust_t1, trust_t2, trust_change)

Mutate existing variables

To mutate an existing variable, you can simply overwrite the column with the same name. For example, let’s say that we want to standardize the trust_change variable that we just made. We can standardize a variable with the scale function, so we can use that inside of mutate.

d <- mutate(d, trust_change = scale(trust_change))

Now the trust_change variable is standardized, which means that it has a mean of 0 and a standard deviation of 1. A nice way to get a quick overview of the distribution of a single variable is to plot a histogram.

hist(d$trust_change)

Recoding variables

Recoding variables means changing the values of a variable based on some condition. This is a common operation in data management, because often you want to change the values of a variable to make them more interpretable, correct errors, or prepare the data for analysis. To recode variables in R, you can use the mutate function in combination with the case_match and case_when functions.

Recode with `case_match`

The case_match function is a simple way to recode specific values into new values. For example, in our practice data we have a column with the experimental groups, which are control, positive, and negative. Let’s say we want to clarify that positive means positive_movie, and negative means negative_movie. We could then use case_match to change these values.

d <- mutate(d, experiment_group = case_match(experiment_group,
                                            "positive" ~ "positive_movie",
                                            "negative" ~ "negative_movie",
                                            .default = experiment_group))

Here we say: overwrite the experiment_group column with output of the case_match function. Inside the case_match function, we specify three things:

The column we want to recode (experiment_group).
The conditions for recoding the values. We have two conditions:
1. If value is "positive", recode into "positive_movie".
2. If value is "negative", recode into "negative_movie".
We specify a .default value for values that are not matched in the conditions. Here we say that in that case we want to use the current value of the experiment_group column.

If you check the unique values of the experiment_group column, you will see that positive and negative have been changed to positive_movie and negative_movie, and that control remains the same.

unique(d$experiment_group)

[1] "negative_movie" "control"        "positive_movie"

More flexible recoding with `case_when`

The case_match function is great if you need to recode many values, but sometimes you need more flexibility. For example, if we want to recode the age variable into categories (e.g., <= 20, 20-30), it would be really tiresome to recode every individual age value. With the case_when function, we can specify the conditions using logical expressions. Each condition is evaluated in order, and the first one that is TRUE is used.

d <- mutate(d, age_category = case_when(
  age < 20 ~ "<= 20",
  age < 30 ~ "20-30",
  age < 40 ~ "30-40",
  age < 50 ~ "40-50",
  age < 60 ~ "50-60",
  .default = ">= 60"
))

table(d$age_category)


<= 20 >= 60 20-30 30-40 40-50 50-60 
   12    78   124   138   129   119

Binary cases with `if_else`

If you only have two categories, you can use the if_else function. You could technically also use case_when for this, but if_else is more concise and easier to read. The syntax for if_else is:

if_else(condition, value_if_true, value_if_false)

A common use case is that sometimes you want to perform an operation only on a subset of the data. For example, in our data there are a few participants that accidentally entered their birthyear instead of their age. To correct this, we can use if_else to set the age to 2024 - birthyear, but only if the number the participants entered is above 1000 (which is only the case if it’s a birthyear).

d <- mutate(d, age = if_else(age > 1000, 2024 - age, age))

So this reads: if the age is above 1000, return 2024 - age, otherwise return the current age.