# Data frame Manipulation with dplyr

## Overview

Teaching:30 min

Exercises:10 minQuestions

How can I manipulate dataframes without repeating myself?

Objectives

To be able to use the six main dataframe manipulation ‘verbs’ with pipes in

`dplyr`

.To understand how

`group_by()`

and`summarize()`

can be combined to summarize datasets.Be able to analyze a subset of data using logical filtering.

Manipulation of dataframes means many things to many researchers, we often select certain observations (rows) or variables (columns), we often group the data by a certain variable(s), or we even calculate summary statistics. We can do these operations using the normal base R operations:

```
mean(gapminder[gapminder$continent == "Africa", "gdpPercap"])
```

```
[1] 2193.755
```

```
mean(gapminder[gapminder$continent == "Americas", "gdpPercap"])
```

```
[1] 7136.11
```

```
mean(gapminder[gapminder$continent == "Asia", "gdpPercap"])
```

```
[1] 7902.15
```

But this isn’t very efficient, and can become tedious quickly because there is a fair bit of repetition. Repeating yourself will cost you time, both now and later, and potentially introduce some nasty bugs.

## The `dplyr`

package

Luckily, the `dplyr`

package provides a number of
very useful functions for manipulating dataframes in a way that will reduce the
above repetition, reduce the probability of making errors, and probably even
save you some typing. As an added bonus, you might even find the `dplyr`

grammar
easier to read.

Here we’re going to cover 6 of the most commonly used functions as well as using
pipes (`%>%`

) to combine them.

`select()`

`filter()`

`group_by()`

`summarize()`

`mutate()`

If you have have not installed this package earlier, please do so:

```
install.packages('dplyr')
```

Now let’s load the package:

```
library("dplyr")
```

## Using `select()`

If, for example, we wanted to move forward with only a few of the variables in
our dataframe we could use the `select()`

function. This will keep only the
variables you select.

```
year_country_gdp <- select(gapminder, year, country, gdpPercap)
```

If we open up `year_country_gdp`

we’ll see that it only contains the year,
country and gdpPercap. Above we used ‘normal’ grammar, but the strengths of
`dplyr`

lie in combining several functions using pipes. Since the pipes grammar
is unlike anything we’ve seen in R before, let’s repeat what we’ve done above
using pipes.

```
year_country_gdp <- gapminder %>% select(year,country,gdpPercap)
```

To help you understand why we wrote that in that way, let’s walk through it step
by step. First we summon the `gapminder`

data frame and pass it on, using the
pipe symbol `%>%`

, to the next step, which is the `select()`

function. In this
case we don’t specify which data object we use in the `select()`

function since
in gets that from the previous pipe. **Fun Fact**: You may have encountered
pipes before in the shell. In R, a pipe symbol is `%>%`

while in the shell it is
`|`

but the concept is the same!

## Using `filter()`

If we now wanted to move forward with the above, but only with European
countries, we can combine `select`

and `filter`

```
year_country_gdp_euro <- gapminder %>%
filter(continent == "Europe") %>%
select(year, country, gdpPercap)
```

## Challenge 1

Write a single command (which can span multiple lines and includes pipes) that will produce a dataframe that has the African values for

`lifeExp`

,`country`

and`year`

, but not for other Continents. How many rows does your dataframe have and why?## Solution to Challenge 1

`year_country_lifeExp_Africa <- gapminder %>% filter(continent=="Africa") %>% select(year,country,lifeExp)`

As with last time, first we pass the gapminder dataframe to the `filter()`

function, then we pass the filtered version of the gapminder data frame to the
`select()`

function. **Note:** The order of operations is very important in this
case. If we used ‘select’ first, filter would not be able to find the variable
continent since we would have removed it in the previous step.

## Using `group_by()`

and `summarize()`

Now, we were supposed to be reducing the error prone repetitiveness of what can
be done with base R, but up to now we haven’t done that since we would have to
repeat the above for each continent. Instead of `filter()`

, which will only pass
observations that meet your criteria (in the above: `continent=="Europe"`

), we
can use `group_by()`

, which will essentially use every unique criteria that you
could have used in filter.

```
str(gapminder)
```

```
'data.frame': 1704 obs. of 6 variables:
$ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
$ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
$ pop : num 8425333 9240934 10267083 11537966 13079460 ...
$ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
$ lifeExp : num 28.8 30.3 32 34 36.1 ...
$ gdpPercap: num 779 821 853 836 740 ...
```

```
gapminder %>% group_by(continent) %>% str()
```

```
Classes 'grouped_df', 'tbl_df', 'tbl' and 'data.frame': 1704 obs. of 6 variables:
$ country : Factor w/ 142 levels "Afghanistan",..: 1 1 1 1 1 1 1 1 1 1 ...
$ year : int 1952 1957 1962 1967 1972 1977 1982 1987 1992 1997 ...
$ pop : num 8425333 9240934 10267083 11537966 13079460 ...
$ continent: Factor w/ 5 levels "Africa","Americas",..: 3 3 3 3 3 3 3 3 3 3 ...
$ lifeExp : num 28.8 30.3 32 34 36.1 ...
$ gdpPercap: num 779 821 853 836 740 ...
- attr(*, "groups")=Classes 'tbl_df', 'tbl' and 'data.frame': 5 obs. of 2 variables:
..$ continent: Factor w/ 5 levels "Africa","Americas",..: 1 2 3 4 5
..$ .rows :List of 5
.. ..$ : int 25 26 27 28 29 30 31 32 33 34 ...
.. ..$ : int 49 50 51 52 53 54 55 56 57 58 ...
.. ..$ : int 1 2 3 4 5 6 7 8 9 10 ...
.. ..$ : int 13 14 15 16 17 18 19 20 21 22 ...
.. ..$ : int 61 62 63 64 65 66 67 68 69 70 ...
..- attr(*, ".drop")= logi TRUE
```

You will notice that the structure of the dataframe where we used `group_by()`

(`grouped_df`

) is not the same as the original `gapminder`

(`data.frame`

). A
`grouped_df`

can be thought of as a `list`

where each item in the `list`

is a
`data.frame`

which contains only the rows that correspond to the a particular
value `continent`

(at least in the example above).

## Using `summarize()`

The above was a bit on the uneventful side but `group_by()`

is much more
exciting in conjunction with `summarize()`

. This will allow us to create new
variable(s) by using functions that repeat for each of the continent-specific
data frames. That is to say, using the `group_by()`

function, we split our
original dataframe into multiple pieces, then we can run functions
(e.g. `mean()`

or `sd()`

) within `summarize()`

.

```
gdp_bycontinents <- gapminder %>%
group_by(continent) %>%
summarize(mean_gdpPercap = mean(gdpPercap))
gdp_bycontinents
```

```
# A tibble: 5 x 2
continent mean_gdpPercap
<fct> <dbl>
1 Africa 2194.
2 Americas 7136.
3 Asia 7902.
4 Europe 14469.
5 Oceania 18622.
```

That allowed us to calculate the mean gdpPercap for each continent, but it gets even better.

## Challenge 2

Calculate the average life expectancy per country. Which has the longest average life expectancy and which has the shortest average life expectancy?

## Solution to Challenge 2

`lifeExp_bycountry <- gapminder %>% group_by(country) %>% summarize(mean_lifeExp=mean(lifeExp)) lifeExp_bycountry %>% filter(mean_lifeExp == min(mean_lifeExp) | mean_lifeExp == max(mean_lifeExp))`

`# A tibble: 2 x 2 country mean_lifeExp <fct> <dbl> 1 Iceland 76.5 2 Sierra Leone 36.8`

Another way to do this is to use the

`dplyr`

function`arrange()`

, which arranges the rows in a data frame according to the order of one or more variables from the data frame. It has similar syntax to other functions from the`dplyr`

package. You can use`desc()`

inside`arrange()`

to sort in descending order.`lifeExp_bycountry %>% arrange(mean_lifeExp) %>% head(1)`

`# A tibble: 1 x 2 country mean_lifeExp <fct> <dbl> 1 Sierra Leone 36.8`

`lifeExp_bycountry %>% arrange(desc(mean_lifeExp)) %>% head(1)`

`# A tibble: 1 x 2 country mean_lifeExp <fct> <dbl> 1 Iceland 76.5`

The function `group_by()`

allows us to group by multiple variables. Let’s group by `year`

and `continent`

.

```
gdp_bycontinents_byyear <- gapminder %>%
group_by(continent, year) %>%
summarize(mean_gdpPercap = mean(gdpPercap))
```

That is already quite powerful, but it gets even better! You’re not limited to defining 1 new variable in `summarize()`

.

```
gdp_pop_bycontinents_byyear <- gapminder %>%
group_by(continent,year) %>%
summarize(mean_gdpPercap = mean(gdpPercap),
sd_gdpPercap = sd(gdpPercap),
mean_pop = mean(pop),
sd_pop = sd(pop))
```

`count()`

and `n()`

A very common operation is to count the number of observations for each group.
The `dplyr`

package comes with two related functions that help with this.

For instance, if we wanted to check the number of countries included in the
dataset for the year 2002, we can use the `count()`

function. It takes the name
of one or more columns that contain the groups we are interested in, and we can
optionally sort the results in descending order by adding `sort=TRUE`

:

```
gapminder %>%
filter(year == 2002) %>%
count(continent, sort = TRUE)
```

```
# A tibble: 5 x 2
continent n
<fct> <int>
1 Africa 52
2 Asia 33
3 Europe 30
4 Americas 25
5 Oceania 2
```

If we need to use the number of observations in calculations, the `n()`

function
is useful. For instance, if we wanted to get the standard error of the life
expectancy per continent:

```
gapminder %>%
group_by(continent) %>%
summarize(se_le = sd(lifeExp)/sqrt(n()))
```

```
# A tibble: 5 x 2
continent se_le
<fct> <dbl>
1 Africa 0.366
2 Americas 0.540
3 Asia 0.596
4 Europe 0.286
5 Oceania 0.775
```

You can also chain together several summary operations; in this case calculating the `minimum`

, `maximum`

, `mean`

and `se`

of each continent’s per-country life-expectancy:

```
gapminder %>%
group_by(continent) %>%
summarize(
mean_le = mean(lifeExp),
min_le = min(lifeExp),
max_le = max(lifeExp),
se_le = sd(lifeExp)/sqrt(n()))
```

```
# A tibble: 5 x 5
continent mean_le min_le max_le se_le
<fct> <dbl> <dbl> <dbl> <dbl>
1 Africa 48.9 23.6 76.4 0.366
2 Americas 64.7 37.6 80.7 0.540
3 Asia 60.1 28.8 82.6 0.596
4 Europe 71.9 43.6 81.8 0.286
5 Oceania 74.3 69.1 81.2 0.775
```

## Using `mutate()`

We can also create new variables prior to (or even after) summarizing information using `mutate()`

.

```
gdp_pop_bycontinents_byyear <- gapminder %>%
mutate(gdp_billion = gdpPercap*pop/10^9) %>%
group_by(continent, year) %>%
summarize(mean_gdpPercap = mean(gdpPercap),
sd_gdpPercap = sd(gdpPercap),
mean_pop = mean(pop),
sd_pop = sd(pop),
mean_gdp_billion = mean(gdp_billion),
sd_gdp_billion = sd(gdp_billion))
```

## Other great resources

- R for Data Science
- Data Wrangling Cheat sheet
- Introduction to dplyr
- Data wrangling with R and RStudio

## Key Points

Use the

`dplyr`

package to manipulate dataframes.Use

`select()`

to choose variables from a dataframe.Use

`filter()`

to choose data based on values.Use

`group_by()`

and`summarize()`

to work with subsets of data.Use

`mutate()`

to create new variables.