Setup
install.packages(c('dplyr', 'readr'))
download.file("https://ndownloader.figshare.com/files/2292172", "surveys.csv")
download.file("https://ndownloader.figshare.com/files/3299474", "plots.csv")
download.file("https://ndownloader.figshare.com/files/3299483", "species.csv")
download.file("https://www.datacarpentry.org/semester-biology/data/shrub-volume-data.csv", "shrub-volume-data.csv")
Basic aggregation
surveys <- read_csv("surveys.csv")
- Aggregation combines rows into groups based on one of more columns.
- Calculates combined values for each group.
- First step, group the data frame.
- Let’s group it by
year
group_by
- Arguments: 1) table to work on; 2) columns to group by
group_by(surveys, year)
- The tibble produced by this function has grouping information
- Store the data frame in a variable to use in the next step
surveys_by_year <- group_by(surveys, year)
- After grouping a data frame use
summarize()
to calculate values for each group. -
Count the number of rows for each group (individuals in each species).
- First argument is the table to work on
- Needs to be a grouped table
- One additional argument for each calculation we want to do for each group
- Column name to store calculated value,
=
, calculation to perform for each group - We’ll use the function
n
which is a special function that counts the rows in the table
counts_by_year <- summarize(surveys_by_year, abundance = n())
- Can group by multiple columns
- Count the number of individuals in each plot in each year
surveys_by_plot_year <- group_by(surveys, plot_id, year)
counts_by_plot_year <- summarize(surveys_by_plot_year, abundance = n())
- Just like with other
dplyr
functions we could write this using pipes instead
plot_year_counts <- surveys |>
group_by(plot_id, year) |>
summarize(abundance = n())
- We can also do multiple calculations using summarize
- Use any function that returns a single value from one or more vectors
- E.g., mean, max, min
- We’ll calculate the number of individuals in each plot year combination and their average weight
size_abundance_data <- surveys |>
group_by(plot_id, year) |>
summarize(abundance = n(), avg_weight = mean(weight))
- Open table
- Why did we get
NA
? mean(weight)
returnsNA
whenweight
has missing values (NA
)- Can fix using
drop_na(weight)
size_abundance_data <- surveys |>
drop_na(weight) |>
group_by(plot_id, year) |>
summarize(abundance = n(), avg_weight = mean(weight))
- Also note the message about “grouped output”
- It says that the resulting data frame is grouped by
year
- When we group by more than one column the resulting data frame is grouped by all but the last group
- Can be useful in some more complicated circumstances
- Can also make things not work if functions don’t support grouped data frames
- To remove these groups add
ungroup()
to the end of the pipeline
size_abundance_data <- surveys |>
drop_na(weight) |>
group_by(plot_id, year) |>
summarize(abundance = n(), avg_weight = mean(weight)) |>
ungroup()
- The message still prints because it happens as part of the
summarize
step - But looking at the resulting data frame
size_abundance_data
- Shows us that the final data frame is ungrouped