Setup

install.packages(c('dplyr', 'readr'))
download.file("https://ndownloader.figshare.com/files/2292172", "surveys.csv")
download.file("https://ndownloader.figshare.com/files/3299474", "plots.csv")
download.file("https://ndownloader.figshare.com/files/3299483", "species.csv")
download.file("https://www.datacarpentry.org/semester-biology/data/shrub-volume-data.csv", "shrub-volume-data.csv")

Basic aggregation

surveys <- read_csv("surveys.csv")
  • Aggregation combines rows into groups based on one of more columns.
  • Calculates combined values for each group.
  • First step, group the data frame.
  • Let’s group it by year
  • group_by
  • Arguments: 1) table to work on; 2) columns to group by
group_by(surveys, year)
  • The tibble produced by this function has grouping information
  • Store the data frame in a variable to use in the next step
surveys_by_year <- group_by(surveys, year)
  • After grouping a data frame use summarize() to calculate values for each group.
  • Count the number of rows for each group (individuals in each species).

  • First argument is the table to work on
  • Needs to be a grouped table
  • One additional argument for each calculation we want to do for each group
  • Column name to store calculated value, =, calculation to perform for each group
  • We’ll use the function n which is a special function that counts the rows in the table
counts_by_year <- summarize(surveys_by_year, abundance = n())
  • Can group by multiple columns
  • Count the number of individuals in each plot in each year
surveys_by_plot_year <- group_by(surveys, plot_id, year)
counts_by_plot_year <- summarize(surveys_by_plot_year, abundance = n())
  • Just like with other dplyr functions we could write this using pipes instead
plot_year_counts <- surveys |>
  group_by(plot_id, year) |>
  summarize(abundance = n())

Do Portal Data Aggregation 1-2.

  • We can also do multiple calculations using summarize
  • Use any function that returns a single value from one or more vectors
  • E.g., mean, max, min
  • We’ll calculate the number of individuals in each plot year combination and their average weight
size_abundance_data <- surveys |>
  group_by(plot_id, year) |>
  summarize(abundance = n(), avg_weight = mean(weight))
  • Open table
  • Why did we get NA?
  • mean(weight) returns NA when weight has missing values (NA)
  • Can fix using drop_na(weight)
size_abundance_data <- surveys |>
  drop_na(weight) |>
  group_by(plot_id, year) |>
  summarize(abundance = n(), avg_weight = mean(weight))
  • Also note the message about “grouped output”
  • It says that the resulting data frame is grouped by year
  • When we group by more than one column the resulting data frame is grouped by all but the last group
  • Can be useful in some more complicated circumstances
  • Can also make things not work if functions don’t support grouped data frames
  • To remove these groups add ungroup() to the end of the pipeline
size_abundance_data <- surveys |>
  drop_na(weight) |>
  group_by(plot_id, year) |>
  summarize(abundance = n(), avg_weight = mean(weight)) |>
  ungroup()
  • The message still prints because it happens as part of the summarize step
  • But looking at the resulting data frame
size_abundance_data
  • Shows us that the final data frame is ungrouped

Do Portal Data Aggregation 3.