library(tidyverse)
<- read_csv("https://tinyurl.com/R-practice-data") d
Select and rename
In this tutorial we use the tidyverse
package and the simulated practice data.
Selecting columns with select
Often you do not need to use all columns in your data, or you only need a subset of the columns for a specific analysis. You can do this with the select
function.
First, let’s see what columns are in our data using the colnames
function, which returns the column names of a data frame:
colnames(d)
[1] "id" "age" "political_orientation"
[4] "political_interest" "np_subscription" "news consumption"
[7] "experiment_group" "trust_t1" "trust_t2"
[10] "trust_t1_item1" "trust_t1_item2" "trust_t1_item3"
[13] "trust_t1_item4" "trust_t1_item5" "trust_t2_item1"
[16] "trust_t2_item2" "trust_t2_item3" "trust_t2_item4"
[19] "trust_t2_item5"
Selecting specific columns
The simplest way of using select
is to explicitly specify the columns you want to keep. The first argument is the tibble (data frame) you want to select from, and the following arguments are the columns you want to keep. The following code returns a new tibble with only the columns id
, age
, and np_subscription
.
select(d, id, age, np_subscription)
# A tibble: 600 × 3
id age np_subscription
<dbl> <dbl> <chr>
1 1 52 yes
2 2 49 yes
3 3 44 yes
4 4 34 no
5 5 28 no
6 6 58 yes
7 7 32 yes
8 8 57 yes
9 9 36 yes
10 10 29 no
# ℹ 590 more rows
As with any output, you can assign it to a variable to store the result.
<- select(d, id, age, np_subscription) ds
Here we create a new tibble ds
that only contains the columns id
, age
, and np_subscription
. Whenever you import data into R, it is often a good idea to first select only the columns you need. While you could also overwrite the original tibble (d <- select(d, ...)
) it is usually better to create a new tibble. There is no harm in having multiple tibbles in your environment, and if you give clear names to your tibbles, it will make your code more readable.
Selecting a range of columns
You can also specify a range of columns using the syntax first_column:last_column
. For example, to select all columns from experiment_group
to trust_t2
:
select(d, experiment_group:trust_t2)
This will return a new tibble with only the columns experiment_group
, trust_t1
, and trust_t2
.
Note that here we did not assign the result to anything. So in this case R will just print the result to the console, but not store it in a variable.
Selecting and renaming columns
When you select a column, you can also rename it using the syntax new_name = old_name
. The following code selects the columns experiment_group
, trust_t1
, and trust_t2
, and renames them to group
, trust_before
, and trust_after
:
select(d, group = experiment_group,
trust_before = trust_t1,
trust_after = trust_t2)
Selecting columns that have spaces in the name
Sometimes columns names have spaces in them. This is a bit annoying to work with in R, because you need to then tell R where a name starts and ends. You can do this by using backticks (reverse quotes) around the column name. In our practice data, we need this to select the news consumption
column. It is then often smart to immediately rename the column to something without spaces, such as just replacing them with underscores:
select(d, news_consumption = `news consumption`)
Dropping columns
Instead of selecting which column to keep, you can also specify which columns to drop. You can do this by adding a minus sign in front of the column name. The following code drops the columns np_subscription
and trust_t1
:
select(d, -np_subscription, -trust_t1)
This will return a new tibble with all columns except np_subscription
and trust_t1
.
Renaming columns with rename
Sometimes you only want to rename columns without selecting or dropping any. You can do this with the rename
function, which works similarly to how you rename columns with select
:
rename(d, group = experiment_group,
trust_before = trust_t1,
trust_after = trust_t2)
In this case, we do rename the columns, but without dropping all the other columns.