Data cleaning

garbage in, data out

Required packages and data for this tutorial

In this tutorial we use the tidyverse and summarytools packages, and the simulated practice data

library(tidyverse)
library(summarytools)
d <- read_csv("https://tinyurl.com/R-practice-data")

Inspect and clean your data

An wise saying in data science is “garbage in, garbage out”.¹ If you don’t take the time to inspect your data, and clean it up if necessary, you run the risk of ending up with incorrect results.

In this section we’ll show you some techniques for inspecting your data, and how you can tidy it up using the data management techniques we’ve learned so far.

First, get to know your data

When you start working with a new dataset, the first thing you should do is get to know it. What columns are in there? What is the data type? How many rows are there?

If you’re tibble doesn’t have a huge number of columns, you can just print it to the console to get a quick overview.

Another nice way to look at the data is to use the View function in RStudio (or clicking on the name of the tibble in the Environment tab in the top-right corner of RStudio). This will open a new tab in RStudio where you can see the data in a nice table. Here you can also filter and sort on columns, and search for specific values. Do note that if there are many columns, RStudio might not show them all at once, but you can click to see more.

View(d)

Viewing a summary of every column

Viewing the top of your data is a good start, but there are many things that can go wrong that you won’t be able to spot. There can be missing values, incorrect data types, or outliers that are hard to spot just by looking at the data. So in addition you will want to view summaries of all the columns that you intend to use in your analysis.²

One way to get a quick summary is by just using the summary function in R. This is a generic function that you can use on many types of objects, and when you use it on a tibble (or regular data frame) it will give you a summary of each column.

summary(d)

The summary includes various usefull statistics, like the minimum and maximum values, the median, and the number of missing values (NA). However, this output can be a bit hard to read, especially when you have many columns. A great alternative is to use the summarytools package, which can create summaries in a nice table, including small graphs for each column to get a quick impression of the distribution.

library(summarytools)
dfSummary(d)

This will print the summary in your console, but you can also render it in a more readable format with the view (With a small v) function.

dfSummary(d) |> view()

Data Frame Summary

d

Dimensions: 600 x 18
Duplicates: 0

id [numeric]

Mean (sd) : 300.5 (173.3)
min ≤ med ≤ max:
1 ≤ 300.5 ≤ 600
IQR (CV) : 299.5 (0.6)

600 distinct values

600 (100.0%)

0 (0.0%)

age [numeric]

Mean (sd) : 51.3 (137.7)
min ≤ med ≤ max:
17 ≤ 41 ≤ 1987
IQR (CV) : 22 (2.7)

52 distinct values

595 (99.2%)

5 (0.8%)

political_orientation [character]

1. center
2. left
3. right

202	(	33.7%	)
196	(	32.7%	)
202	(	33.7%	)

600 (100.0%)

0 (0.0%)

np_subscription [character]

1. no
2. yes

263	(	43.8%	)
337	(	56.2%	)

600 (100.0%)

0 (0.0%)

news consumption [numeric]

Mean (sd) : 10.1 (3.2)
min ≤ med ≤ max:
0 ≤ 11 ≤ 17
IQR (CV) : 4 (0.3)

18 distinct values

600 (100.0%)

0 (0.0%)

experiment_group [character]

1. control
2. negative
3. positive

200	(	33.3%	)
200	(	33.3%	)
200	(	33.3%	)

600 (100.0%)

0 (0.0%)

trust_t1 [numeric]

Mean (sd) : 3.9 (1)
min ≤ med ≤ max:
1.4 ≤ 3.8 ≤ 8.2
IQR (CV) : 1.4 (0.3)

28 distinct values

600 (100.0%)

0 (0.0%)

trust_t2 [numeric]

Mean (sd) : 3.8 (1.3)
min ≤ med ≤ max:
1 ≤ 3.8 ≤ 8
IQR (CV) : 2 (0.4)

34 distinct values

600 (100.0%)

0 (0.0%)

trust_t1_item1 [numeric]

Mean (sd) : 4.1 (1.2)
min ≤ med ≤ max:
1 ≤ 4 ≤ 8
IQR (CV) : 2 (0.3)

1	:	2	(	0.3%	)
2	:	46	(	7.7%	)
3	:	133	(	22.2%	)
4	:	206	(	34.3%	)
5	:	145	(	24.2%	)
6	:	57	(	9.5%	)
7	:	10	(	1.7%	)
8	:	1	(	0.2%	)

600 (100.0%)

0 (0.0%)

trust_t1_item2 [numeric]

Mean (sd) : 2.7 (1.5)
min ≤ med ≤ max:
1 ≤ 3 ≤ 7
IQR (CV) : 3 (0.5)

1	:	161	(	26.8%	)
2	:	130	(	21.7%	)
3	:	142	(	23.7%	)
4	:	95	(	15.8%	)
5	:	44	(	7.3%	)
6	:	24	(	4.0%	)
7	:	4	(	0.7%	)

600 (100.0%)

0 (0.0%)

trust_t1_item3 [numeric]

Mean (sd) : 6.7 (1.3)
min ≤ med ≤ max:
2 ≤ 7 ≤ 10
IQR (CV) : 2 (0.2)

2	:	1	(	0.2%	)
3	:	6	(	1.0%	)
4	:	17	(	2.8%	)
5	:	83	(	13.8%	)
6	:	154	(	25.7%	)
7	:	179	(	29.8%	)
8	:	117	(	19.5%	)
9	:	37	(	6.2%	)
10	:	6	(	1.0%	)

600 (100.0%)

0 (0.0%)

trust_t1_item4 [numeric]

Mean (sd) : 3.7 (1.1)
min ≤ med ≤ max:
1 ≤ 4 ≤ 8
IQR (CV) : 1 (0.3)

1	:	12	(	2.0%	)
2	:	69	(	11.5%	)
3	:	184	(	30.7%	)
4	:	207	(	34.5%	)
5	:	101	(	16.8%	)
6	:	24	(	4.0%	)
7	:	2	(	0.3%	)
8	:	1	(	0.2%	)

600 (100.0%)

0 (0.0%)

trust_t1_item5 [numeric]

Mean (sd) : 4.6 (1.3)
min ≤ med ≤ max:
1 ≤ 5 ≤ 9
IQR (CV) : 2 (0.3)

1	:	4	(	0.7%	)
2	:	16	(	2.7%	)
3	:	99	(	16.5%	)
4	:	151	(	25.2%	)
5	:	175	(	29.2%	)
6	:	124	(	20.7%	)
7	:	28	(	4.7%	)
8	:	2	(	0.3%	)
9	:	1	(	0.2%	)

600 (100.0%)

0 (0.0%)

trust_t2_item1 [numeric]

Mean (sd) : 4 (1.5)
min ≤ med ≤ max:
1 ≤ 4 ≤ 8
IQR (CV) : 2 (0.4)

1	:	28	(	4.7%	)
2	:	80	(	13.3%	)
3	:	115	(	19.2%	)
4	:	146	(	24.3%	)
5	:	144	(	24.0%	)
6	:	61	(	10.2%	)
7	:	20	(	3.3%	)
8	:	6	(	1.0%	)

600 (100.0%)

0 (0.0%)

trust_t2_item2 [numeric]

Mean (sd) : 2.7 (1.5)
min ≤ med ≤ max:
1 ≤ 3 ≤ 8
IQR (CV) : 3 (0.6)

1	:	176	(	29.3%	)
2	:	121	(	20.2%	)
3	:	127	(	21.2%	)
4	:	89	(	14.8%	)
5	:	53	(	8.8%	)
6	:	29	(	4.8%	)
7	:	4	(	0.7%	)
8	:	1	(	0.2%	)

600 (100.0%)

0 (0.0%)

trust_t2_item3 [numeric]

Mean (sd) : 6.8 (1.7)
min ≤ med ≤ max:
1 ≤ 7 ≤ 10
IQR (CV) : 2 (0.3)

1	:	1	(	0.2%	)
2	:	4	(	0.7%	)
3	:	17	(	2.8%	)
4	:	32	(	5.3%	)
5	:	70	(	11.7%	)
6	:	122	(	20.3%	)
7	:	140	(	23.3%	)
8	:	107	(	17.8%	)
9	:	71	(	11.8%	)
10	:	36	(	6.0%	)

600 (100.0%)

0 (0.0%)

trust_t2_item4 [numeric]

Mean (sd) : 3.6 (1.4)
min ≤ med ≤ max:
1 ≤ 4 ≤ 8
IQR (CV) : 1 (0.4)

1	:	39	(	6.5%	)
2	:	97	(	16.2%	)
3	:	135	(	22.5%	)
4	:	180	(	30.0%	)
5	:	108	(	18.0%	)
6	:	26	(	4.3%	)
7	:	12	(	2.0%	)
8	:	3	(	0.5%	)

600 (100.0%)

0 (0.0%)

trust_t2_item5 [numeric]

Mean (sd) : 4.5 (1.7)
min ≤ med ≤ max:
1 ≤ 5 ≤ 9
IQR (CV) : 3 (0.4)

1	:	21	(	3.5%	)
2	:	52	(	8.7%	)
3	:	100	(	16.7%	)
4	:	110	(	18.3%	)
5	:	154	(	25.7%	)
6	:	102	(	17.0%	)
7	:	40	(	6.7%	)
8	:	14	(	2.3%	)
9	:	7	(	1.2%	)

600 (100.0%)

0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.4.1)
2024-09-16

Using summarytools in Quarto or RMarkdown

If you’re working in Quarto or RMarkdown, the normal way of using summarytools doesn’t create a nice table. Instead, what you need to do is tell Quarto to render the output as HTML. You can do this by setting the method argument to render.

print(dfSummary(iris), 
      method='render', 
      max.tbl.height = 500)

Note that we also set the max.tbl.height argument to 500, to make sure that very large tables are put inside a scrollable window.

What to look for in the summary

There are a few things in particular that you should look out for:

Missing values: Are there any columns with a lot of missing values? If so, you might need to think about how to deal with these.
Data types: Are the data types of the columns correct? For example, are numbers stored as numbers, and not as text?
Outliers: Are there any columns with very high or very low values? These might be errors in the data.
Distribution: Are the values in the columns distributed in a way that you would expect? For example, if you have a column with ages, are there any values that are negative, or over 100?

Taking this into account, most of the columns in our data look pretty good. Our trust variables (trust_t1 and trust_t2) are on a scale from 1 to 10, with a pretty normal distribution. We have a completely balanced distribution over the three experimental groups. The news consumption variable is a bit skewed, but nobody’s perfect. However, there is one column that stands out sorely: age.

The first sign that something is off is the histogram, which shows a very strange distribution where pretty much all observations are in the same bin on the left side. The reason for this becomes clear when we look at the maximum value, which is 1987. This is clearly an error in the data, as it is very unlikely that someone in our study is 1987 years old. What is much more likely is that at least one person entered their birth year instead of their age.

To see if this is the case, we can look at the unique values in the age column. Recall that we can refer to a vector in a data frame using the $ operator, like d$age. When we run the unique function on this vector, we see multiple several cases that indeed look like birth years.

unique(d$age)

 [1]   21   42   18   32   33   46   50   63   52   26   40   56   37   64   48
[16]   62   61   60   55   51   29   41   58   53   23   30   34   36   31   59
[31]   45   19   35   38   20   39   44   28   49   54   22   57   24   25   43
[46]   NA   27 1987   47   17   65 1970 1967

Finally, one thing to look out for when inspecting the data is missing values, which are often coded as NA (not available) in R. In the summary we see that there are 5 missing values in the age column. This isn’t too bad, but we do need to be transparent about how we deal with these missing values in our analysis.

Cleaning up our data

Now that we’ve identified some issues in our data, we need to clean it up. For this we can use the techniques you learned in the Data Management chapter.

Selecting columns

The first thing we can do is to select only the columns that we need for our analysis, and where needed rename them. For the current example, let’s say that we just need the columns id, age and news consumption. While selecting them, we’ll immediately also rename the news consumption column to news_consumption, because spaces in column names can be a bit annoying to work with.

d_clean <- select(d, id, age, news_consumption = `news consumption`)

Note that we assigned the result to a new tibble called d_clean. This is a good practice, because it keeps the original data frame intact, and you can always go back to it if you need to.

Mutating columns

Next, we need to deal with the age column. When inspecting the data we saw that there were some birth years in there. Let’s consider some ways to address this.

Option 1: Subtract the birth year from the year of the study

The easiest (and in this case best) way is to subtract the birth year from the year in which the study took place. We need to do this only for the birth years, and not for the actual ages, so we’ll use the if_else function. This let’s us say: if some condition is met, do this, else do that. The format is: if_else(condition, if_true, if_false).

In this case, our condition is that the number is a birth year, which we can do by checking if it’s unrealistically high, like over 200. When we can say: if the age is greater than 200, subtract it from the year of the study, otherwise just return the age.

d_clean <- mutate(d_clean, age = if_else(age > 200, 2024 - age, age))

Notice that this time we also used the d_clean tibble as input, and assigned the result back to d_clean to overwrite it. A common mistake would be to use d as input, but then we’d lose our previous clean up steps.

Option 2: Recode specific values

Another option would be to recode the specific values that are wrong. This is more work, but it is a versatile technique that you can use in many situations.

Using the recode function, we can say what values we want to recode into what. The syntax is recode(column, current_value = new_value, ...).

d_clean = mutate(d_clean, age = recode(age, `1987` = 37, `1970` = 51, `1967` = 54))

Removing missing values

It is often a good idea to explicitly remove missing values from your data. If you run any statistical analyses using variables with missing data, R will often just ignore these cases. But it’s better to remove them yourself, and be transparent about this in your report.

In R we refer to missing values as NA (Not Available). Let’s look at two ways to remove rows with missing values. The first one is to use the drop_na function from the tidyverse, which will remove any rows that have missing values in at least one column. To demonstrate that this works, we use the nrow function to count the number of rows before and after removing missing values.

nrow(d_clean)

[1] 600

drop_na(d_clean) |> nrow()

[1] 595

The drop_na function works well if you only have columns in your tibble that you want to use in your analysis. But note it will remove the entire row even if only one column has a missing value. So sometimes you want to remove rows only if a specific column has a missing value.

In the current data, we know that only the age column has missing values. We can then use the filter function with the expression: !is.na(age). This means: select all rows where it is not (!) the case that the value of the age column is missing (is.na(age)).

d_clean <- filter(d_clean, !is.na(age))

Filtering rows

Next to missing values, there might also be other conditions for removing rows from your data. For example, you might only want to include participants that at least 18 years old. We can do this with the filter function, where we specify the condition in which we want to select the rows.

d_clean <- filter(d_clean, age >= 18)

Check again!

After cleaning your data, it’s always a good idea to check if everything went as planned. So just use the same tricks to inspect your data:

dfSummary(d_clean) |> view()

Data Frame Summary

d_clean

Dimensions: 594 x 3
Duplicates: 0

id [numeric]

Mean (sd) : 300.2 (173.7)
min ≤ med ≤ max:
1 ≤ 301.5 ≤ 600
IQR (CV) : 299.5 (0.6)

594 distinct values

594 (100.0%)

0 (0.0%)

age [numeric]

Mean (sd) : 41.7 (13.3)
min ≤ med ≤ max:
18 ≤ 41 ≤ 65
IQR (CV) : 22 (0.3)

48 distinct values

594 (100.0%)

0 (0.0%)

news_consumption [numeric]

Mean (sd) : 10.1 (3.2)
min ≤ med ≤ max:
0 ≤ 11 ≤ 17
IQR (CV) : 4 (0.3)

18 distinct values

594 (100.0%)

0 (0.0%)

Generated by summarytools 1.0.1 (R version 4.4.1)
2024-09-16

If done right, your age column should now have a more reasonable distribution, the missing values should be gone, and the minimum age should be 18.

Footnotes

Or the more common version: “Sh*t in, sh*t out”.↩︎
If you have a very large dataset, you might first want to select only the columns that you need for your analysis. We’ll show you how to do this in the next section.↩︎