Learning Objectives

Following this assignment students should be able to:

  • connect to a remote database and execute simple queries
  • integrate database and R workflow
  • export output data from R to database
  • tidy data table with redundant fields or overfilled cells


Lecture Notes


  1. Simple WHERE (10 pts)

    A population biologist (Dr. Undomiel) who studies the population dynamics of Dipodomys spectabilis would like to use some data from the Portal Project, but she doesn’t know how to work with large datasets.

    Write a query that returns the month, day, year, and mass of each individual for her species of interest Dipodomys spectabilis (DS in the species_id column).Since there is only one species do no include the species_id column in the output. Save this query as a view with the name spectabilis_population_data.

    [click here for output] [click here for output] [click here for output]
  2. COUNT (10 pts)

    Write a query that returns the number of individuals trapped in each year. Count the species_id column so that you only include cases where an individual was identified to species. Name the count column total_abundance and sort it chronologically. Include the year in the output. Save it as total_abundance_by_year. There should only be one value for each year since this is a count of the individuals across all species in that year.

    [click here for output]
  3. Basic Join (20 pts)

    Write a query that returns the year, month, and day for each individual captured as well as it’s genus and species names. This can be accomplished by joining the species table to the surveys table using the species_id column in both tables. Save this query as species_captures_by_date.

    [click here for output]
  4. Multi-table Join (20 pts)

    The plots table in the Portal database can be joined to the surveys table by joining plot_id to plot_id and the species table can be joined to the surveys table by joining species_id to species_id.

    The Portal mammal data include data from a number of different experimental manipulations. You want to do a time-series analysis of the population dynamics of all of the species at the site, taking into account the different experimental manipulations. Write a query that returns the year, month, day, genus and species of every individual as well as the plot_id and plot_type of the plot they are captured on. Save this query as species_plot_data.

    [click here for output]
  5. Link to Databases (20 pts)

    Let’s access an SQL database directly from R using dplyr.

    Either use an existing copy of the portal_mammals.sqlite database or download a new copy. You should be able to link to the surveys table in the database using:

    portaldb <- dbConnect(RSQLite::SQLite(), "portal_mammals.sqlite")
    surveys <- tbl(portaldb, "surveys")

    surveys is actually a connection to the database, which means that the table remains external to the R environment. Also, we won’t need to worry about it printing out huge numbers of rows when we look at it.

    1. Select the year, month, day, and species_id columns in that order.
    2. Create a new data frame with the year, species_id, and weight in kilograms of each individual, with no null weights.
    3. Use the distinct() function to print the species_id for each species in the dataset that has been weighed.
    [click here for output]
  6. Copy to Database (20 pts)

    Dr. Undómiel has decided to focus on the change in size of a few target rodent species over the course of the experiment(1977-2002). She has chosen Dipodymys spectabilis, Onychomys torridus, Perymiscus erimicus, Chaetodipus penicillatus.

    Write a script that uses dplyr to:

    1. Connect to the portal_mammals.sqlite.
    2. Generate a data frame with year, species_id, and the average weight per year (avg_weight) for each target species. You may find the %in% c() construction useful for your filter().
    3. Use copy_to() to include your new data frame in portal_mammals.sqlite. Call it something informative so that Dr. Undómiel can find it easily. Make sure it remains after the connection is terminated using temporary = FALSE.
    [click here for output]