Select and rename

Required packages and data for this tutorial

In this tutorial we use the tidyverse package and the simulated practice data.

library(tidyverse)
d <- read_csv("https://tinyurl.com/R-practice-data")

Selecting columns with select

Often you do not need to use all columns in your data, or you only need a subset of the columns for a specific analysis. You can do this with the select function.

First, let’s see what columns are in our data using the colnames function, which returns the column names of a data frame:

colnames(d)
 [1] "id"                    "age"                   "political_orientation"
 [4] "political_interest"    "np_subscription"       "news consumption"     
 [7] "experiment_group"      "trust_t1"              "trust_t2"             
[10] "trust_t1_item1"        "trust_t1_item2"        "trust_t1_item3"       
[13] "trust_t1_item4"        "trust_t1_item5"        "trust_t2_item1"       
[16] "trust_t2_item2"        "trust_t2_item3"        "trust_t2_item4"       
[19] "trust_t2_item5"       

Selecting specific columns

The simplest way of using select is to explicitly specify the columns you want to keep. The first argument is the tibble (data frame) you want to select from, and the following arguments are the columns you want to keep. The following code returns a new tibble with only the columns id, age, and np_subscription.

select(d, id, age, np_subscription)
# A tibble: 600 × 3
      id   age np_subscription
   <dbl> <dbl> <chr>          
 1     1    52 yes            
 2     2    49 yes            
 3     3    44 yes            
 4     4    34 no             
 5     5    28 no             
 6     6    58 yes            
 7     7    32 yes            
 8     8    57 yes            
 9     9    36 yes            
10    10    29 no             
# ℹ 590 more rows

As with any output, you can assign it to a variable to store the result.

ds <- select(d, id, age, np_subscription)

Here we create a new tibble ds that only contains the columns id, age, and np_subscription. Whenever you import data into R, it is often a good idea to first select only the columns you need. While you could also overwrite the original tibble (d <- select(d, ...)) it is usually better to create a new tibble. There is no harm in having multiple tibbles in your environment, and if you give clear names to your tibbles, it will make your code more readable.

Selecting a range of columns

You can also specify a range of columns using the syntax first_column:last_column. For example, to select all columns from experiment_group to trust_t2:

select(d, experiment_group:trust_t2)

This will return a new tibble with only the columns experiment_group, trust_t1, and trust_t2.

Note that here we did not assign the result to anything. So in this case R will just print the result to the console, but not store it in a variable.

Selecting and renaming columns

When you select a column, you can also rename it using the syntax new_name = old_name. The following code selects the columns experiment_group, trust_t1, and trust_t2, and renames them to group, trust_before, and trust_after:

select(d, group = experiment_group, 
          trust_before = trust_t1, 
          trust_after = trust_t2)

Selecting columns that have spaces in the name

Sometimes columns names have spaces in them. This is a bit annoying to work with in R, because you need to then tell R where a name starts and ends. You can do this by using backticks (reverse quotes) around the column name. In our practice data, we need this to select the news consumption column. It is then often smart to immediately rename the column to something without spaces, such as just replacing them with underscores:

select(d, news_consumption = `news consumption`)

Dropping columns

Instead of selecting which column to keep, you can also specify which columns to drop. You can do this by adding a minus sign in front of the column name. The following code drops the columns np_subscription and trust_t1:

select(d, -np_subscription, -trust_t1)

This will return a new tibble with all columns except np_subscription and trust_t1.

Renaming columns with rename

Sometimes you only want to rename columns without selecting or dropping any. You can do this with the rename function, which works similarly to how you rename columns with select:

rename(d, group = experiment_group, 
          trust_before = trust_t1, 
          trust_after = trust_t2)

In this case, we do rename the columns, but without dropping all the other columns.

Back to top