Learning Objectives

Following this assignment students should be able to:

  • start to combine multiple computing concepts to solve bigger problems
  • start to debug errors in code
  • use reproducible computing practices

Reading

Lecture Notes

Setup

install.packages(c('dplyr', 'ggplot2', 'readr'))
download.file("https://ndownloader.figshare.com/files/2292172",
  "surveys.csv")
download.file("https://ndownloader.figshare.com/files/3299474",
  "plots.csv")
download.file("https://ndownloader.figshare.com/files/3299483",
  "species.csv")
download.file("https://datacarpentry.org/semester-biology/data/mammal-size-data-clean.tsv",
  "mammal-size-data-clean.tsv")
download.file("https://datacarpentry.org/semester-biology/data/ramesh2010-macroplots.csv",
  "ramesh2010-macroplots.csv")
download.file("https://datacarpentry.org/semester-biology/data/ramesh2010-species-list.tsv",
  "ramesh2010-species-list.tsv")

Lecture Notes

  1. Problem Decomposition
  2. Basic Debugging
  3. Searching For Help
  4. Paths
  5. Basic Reproducibility

Exercises

  1. Bird Banding Multiple Vectors (20 pts)

    The number of birds banded at a series of sampling sites has been counted by your field crew and entered into the following vector. Counts are entered in order and sites are numbered starting at one. There is also information on the number of trees at each site. Cut and paste the vector into your assignment and then answer the following questions by using code and printing the result to the screen.

    number_of_birds <- c(28, 32, 1, 0, 10, 22, 30, NA, 145, 27, 
    36, 25, 9, 38, 21, 12, 122, 87, 36, 3, 0, 5, 55, 62, 98, 32, 
    900, 33, 14, 39, 56, 81, 29, 38, 1, 0, 143, 37, 98, 77, 92, 
    83, 34, 98, 40, 45, 51, 17, 22, 37, 48, NA, 91, 73, 54, 46,
    102, 273, 600, 10, 11)
    number_of_trees <- c(10, 12, 2, 3, 10, 8, 19, 19, 14, 3, 
    4, 5, 8, 4, 8, 1, 12, 10, 3, 1, 2, 3, 5, 6, 8, 2, 
    90, 3, 4, 3, 6, 8, NA, 4, 0, 1, 14, 3, 10, NA, 9, 
    8, 4, 8, 4, 4, 5, 1, 2, 3, 5, 4, 10, 7, 5, 8,
    10, 30, 26, 1, 6)
    
    1. How many sites are there?
    2. How many birds were counted at the 26th site?
    3. What is the largest number of birds counted?
    4. What is the average number of birds seen at a site?
    5. What is the total number of trees counted across all of the sites?
    6. What is the smallest number of trees counted?
    7. Produce a vector with the number of birds counted on sites with at least 10 trees.
    8. Produce a vector with the number of trees counted on sites with at least 10 trees.
    9. Combine the number_of_birds and number_of_trees vectors into a dataframe that also includes a year column with the year 2012 in every row and site column containing the numbers 1 through 61.
    Expected outputs for Bird Banding Multiple Vectors
  2. Portal Data Review (20 pts)

    If surveys.csv, species.csv, and plots.csv are not available in your workspace download them:

    Load them into R using read_csv().

    1. Create a data frame with only data for the species_id DO, with the columns year, month, day, species_id, and weight.
    2. Create a data frame with only data for species IDs PP and PB and for years starting in 1995, with the columns year, species_id, and hindfoot_length, with no null values for hindfoot_length.
    3. Create a data frame with the average hindfoot_length for each species_id in each year with no null values.
    4. Create a data frame with the year, genus, species, weight and plot_type for all cases where the genus is "Dipodomys".
    5. Make a scatter plot with weight on the x-axis and hindfoot_length on the y-axis. Use a log10 scale on the x-axis. Color the points by species_id. Include good axis labels.
    6. Make a histogram of weights with a separate subplot for each species_id. Do not include species with no weights. Set the scales argument to "free_y" so that the y-axes can vary. Include good axis labels.
    7. (Challenge) Make a plot with histograms of the weights of three species, PP, PB, and DM, colored by species_id, with a different facet (i.e., subplot) for each of three plot_type’s Control, Long-term Krat Exclosure, and Short-term Krat Exclosure. Include good axis labels and a title for the plot. Export the plot to a png file.
    Expected outputs for Portal Data Review
  3. Megafaunal Extinction (50 pts)

    There were a relatively large number of extinctions of mammalian species roughly 10,000 years ago. To help understand why these extinctions happened scientists are interested in understanding if there were differences in the size of the species that went extinct and those that did not. You are going to reproduce the three main figures from one of the major papers on this topic Lyons et al. 2004.

    You will do this using a large dataset of mammalian body sizes that has data on the mass of recently extinct mammals as well as extant mammals (i.e., those that are still alive today).

    1. Import the data into R. As with most real world data there are a some things about the dataset that you’ll need to identify and address during the import process. Print out the structure of the resulting data frame.
    2. Create a plot showing histograms of masses for mammal species that are still present and those that went extinct during the pleistocene (extant and extinct in the status column). There should be one sub-plot for each continent and that sub-plot should show the histograms for both groups as a stacked histogram. To match the original analysis don’t include islands (Insular and Oceanic in the continent column) and or the continent labeled EA (because EA had no species that went extinct in the pleistocene). Scale the x-axis logarithmically and use 25 bins to roughly match the original figure. Use good axis labels.
    3. The 2nd figure in the original paper looks in more detail at two orders, Xenarthra and Carnivora, which showed extinctions in North and South America. Create a figure similar to the one in Part 2, but that shows 4 sub-plots, one for each order on each of the two continents. Still scale the x-axis logarithmically, but use 19 bins to roughly match the original figure.
    4. The 3rd figure in the original paper explores Australia as a case study. Australia is interesting because there is good data on both Pleistocene extinctions (extinct in the status column) and more modern extinctions occurring over the last 300 years (historical in the status column). Make single stacked histogram that compares the sizes of extinct, extant, and historical statuses. Scale the x-axis logarithmically and use 25 bins to roughly match the original figure. Use good axis labels.
    5. (optional) Instead of excluding continent EA by name in your analysis (in part 2), modify your code to determine from the data which continents had species that went extinct in the pleistocene and only include those continents.
    Expected outputs for Megafaunal Extinction
  4. Check That Your Code Runs (10 pts)

    Sometimes you think you’re code runs, but it only actually works because of something else you did previously. To make sure it actually runs you should save your work and then run it in a clean environment.

    Follow these steps in RStudio to make sure your code really runs:

    1. Restart R (see above) by clicking Session in the menu bar and selecting Restart R:

    Screenshot showing clicking session from the menu bar and selecting Restart R

    2. If the Environment tab isn’t empty click on the broom icon to clear it:

    Screenshot showing the Environment tab with the cursor hovering over the broom icon

    The Environment tab should now say “Environment Is Empty”:

    Screenshot showing the Environment tab with only the words Environment Is Empty

    3. Rerun your entire homework assignment using “Source with Echo” to make sure it runs from start to finish and produces the expected results.

    Screenshot showing the RStudio Source with Echo item hovered in the Source dropdown

    Expected outputs for Check That Your Code Runs
  5. Tree Biomass Challenge (Challenge - optional)

    Understanding the total amount of biomass (the total mass of all individuals) in forests is important for understanding the global carbon budget and how the earth will respond to increases in carbon dioxide emissions.

    We don’t normally measure the mass of a tree, but take a measurement of the diameter of the trunk and then estimate mass using equations like mass = 0.124 * diameter2.53.

    Make a histogram of the estimated tree biomass (using the above equation) for each species in a 96 hectare area of the Western Ghats in India using the data in ramesh2010-macroplots.csv and ramesh2010-species-list.tsv (if they aren’t already in your workspace then download them). Make one subplot for each family, only include families with at least 10 species (because otherwise there isn’t enough information for the histogram), scale the x axis logarithmically, color the bars purple and outline them in black, and provide good descriptive axis labels.

    Expected outputs for Tree Biomass Challenge

Assignment submission & checklist