The tidyverse

A swiss army knife for data management

Installing the tidyverse

The tidyverse is collection of R packages that makes it much easier to import, manage and visualize data. To use the tidyverse, you only need to open the tidyverse package, and it will automatically open all of the tidyverse R packages.

Like any normal package, you need to first install it once:

install.packages('tidyverse')

Then in every script that you use the package, you open it with library:

library(tidyverse)

When you first run this code, it will print a message that shows you all the packages that it opened for you. Some of the most important ones that we’ll we using are:

  • tibble. An optimized way for structuring rectangular data (basically: a spreadsheet of rows and columns)
  • dplyr. Functions for manipulating tibbles: select and rename columns, filter rows, mutate values, etc.
  • readr. Read data into R.
  • ggplot2. One of the best visualization tools out there. Check out the gallery

When opening the tidyverse, and when opening packages in general, you can get a Conflicts warning. A very common warning for the tidyverse is:

✖ dplyr::filter() masks stats::filter()
✖ dplyr::lag()    masks stats::lag()

So what does this mean, and should we be worried?

Since anyone can write new packages for R, it can happen that two packages provide functions with the same name. In this example, we see that the filter function exists in both the dplyr package (which we opened by opening the tidyverse), and in the stats package (which is included in base R). So now R needs to decide which version of the function to use when you type filter(). In this case, it says that the dplyr::filter() masks stats::filter(), meaning that it will now use the dplyr version.

In practice, this will rarely be a problem, because you seldom need two versions of a function in the same script. But if you ever do, there is a simple solution. Instead of just using filter(), you can type dplyr::filter() to specifically use this version. In the following code, we use this notation to specifically open the help page for dplyr::filter and stats::filter.

?dplyr::filter()
?stats::filter()

Many of the things that the tidyverse allows you to do are also possible in base R (i.e. the basic installation of R). Base R also provides functions for importing, managing and visualizing data. So why do we need the tidyverse?

The tidyverse is an opinionated framework, which means that it doesn’t just enable you to do things, but also suggests how you should do things. The authors have thought long and hard about how to make data management easy, effective and intuitive (they have even written papers about it). This not only makes the tidyverse much easier and intuitive to learn, but also makes sure everyone writes their code in the same way, which improves transparency and shareability.

This is different from base R, which is designed to be a highly flexible programming language, that allows you to do almost anything. Accordingly, it is still worthwhile to learn base R at some point if you want to specialize more in computational research methods. But for our Communication Science program, and for many data science applications in general, you can do all your data management in the tidyverse.

Data management with the tidyverse

The tidyverse is built around the concept of tidy data (Wickham 2014). The main principles of tidy data are:

  1. Each variable must have its own column.
  2. Each observation must have its own row.
  3. Each value must have its own cell.

This type of data is also called a data frame, or a spreadsheet. What the tidyverse does is provide a set of tools that make it easy to work with this type of data. At the core of this is the tibble data structure. As a simple example, the following code creates a tibble containing respondents, their gender, and their height. We’ll call our tibble d (short for data).

d <- tibble(resp_id = c(  1,   2,   3,   4), 
            gender  = c("M", "M", "F", "F"), 
            height  = c(176, 165, 172, 160))

The name d is now a tibble with 4 rows and 3 columns. Like any name in R, we can print it to see what it looks like:

d
# A tibble: 4 × 3
  resp_id gender height
    <dbl> <chr>   <dbl>
1       1 M         176
2       2 M         165
3       3 F         172
4       4 F         160

The vast majority of data that we work with in the social sciences can be structured in this way. The rows typically represent our units of analysis (e.g., respondents, participants, texts, etc.), and the columns represent the variables that we measure on these units. This makes it imperative for us to learn how we can manage this type of data effectively. We need to be able to select columns, filter rows, mutate values, and summarize data. Sometimes we also need to pivot the data, or join it with other data. In this chapter you will learn how to do all of this with the tidyverse.

Back to top

References

Wickham, Hadley. 2014. “Tidy Data.” Journal of Statistical Software 59 (10): 1–23. https://doi.org/10.18637/jss.v059.i10.