Data Management with SQL for Ecologists: Instructor Notes

Learning objectives

Data

As for all of the Data Carpentry ecology lessons, this lesson uses the Portal Project Teaching Database. The data is available at https://doi.org/10.6084/m9.figshare.1314459 and the download includes a SQLite database file (portal_mammals.sqlite) as well as three .csv files (species.csv, plots.csv, surveys.csv) that can be imported into SQLite.

Note that the figshare download is an archive (.zip) file that rudely explodes all of the files into your current directory.

Motivation and Framing

See this slide deck as a sample intro for the lesson: SQL Intro Deck

Key points:

If you’ve written up a diagram of the data analysis pipeline (raw data -> clean data -> import and analyze -> results -> visualization), it can be helpful to identify that you’re now somewhere between clean data and analysis.

Common Difficult Concepts

Lesson outline

00-sql-introduction

Tips

00-supplement-database-design.md

(optional) The first lesson includes a brief introduction to data design and choosing database systems. This material expands on the database design in the first section.

01-sql-basic-queries

02-sql-aggregation

03-sql-joins-aliases

Alternative activities

Queries on the board

As you teach the lesson, it can be helpful to pause and write up the query keywords on the board. This could look like this:

FROM table

WHERE (conditional statement, applies to row values) (AND/OR)

ORDER BY column/FUNCTION(column) (ASC/DESC)


* After 02-sql-aggregation

SELECT column FUNCTION(column) AGGREGATE_FUNCTION(column) FROM table

WHERE (conditional statement, applies to row values) (AND/OR) (IS (NOT) NULL) GROUP BY column HAVING (conditional statement, applies to group) ORDER BY column/FUNCTION(column) (ASC/DESC)


* After 03-sql-joins-aliases

SELECT column FUNCTION(column) AGGREGATE_FUNCTION(column) FROM table JOIN table ON table.col = table.col WHERE (conditional statement, applies to row values) (AND/OR) (IS (NOT) NULL) GROUP BY column HAVING (conditional statement, applies to group) ORDER BY column/FUNCTION(column) (ASC/DESC)


As a bonus, if you can leave this on the board, it translates nicely into
the `dplyr` portion of the `R` lesson, i.e.:

SQL: dplyr:

SELECT column select(col) FUNCTION(column) mutate(col = fcn(col)) AGGREGATE_FUNCTION(column) summarize(col = fcn(col)) FROM table JOIN table ON table.col = table.col WHERE (conditional statement, applies to row values) filter(condition) (AND/OR) (IS (NOT) NULL) is.na() GROUP BY column group_by(col) HAVING (conditional statement, applies to group) ORDER BY column/FUNCTION(column) (ASC/DESC) arrange() ~~~

“Interactive” database

If you want to try something more active (esp. if you’re teaching SQL in the afternoon!), this is a an interactive activity to try.

Exercise Solutions

WARNING

This file may not be always up to date with regards to the exact exercises instructions and the naming of the columns and tables in the database. Check before you run the workshop!

EXERCISE

Write a query that returns the year, month, day, species ID and weight (in mg).

SOLUTION

SELECT day, month, year, species_id, weight * 1000
FROM surveys;

EXERCISE

Write a query that returns the day, month, year, species ID, and weight (in kg) for individuals caught on Plot 1 that weigh more than 75 g.

SOLUTION

SELECT day, month, year, species_id, weight / 1000
FROM surveys
WHERE plot_id = 1
AND weight > 75;

EXERCISE

Write a query that returns the day, month, year, species ID, and weight (in kg) for individuals caught on Plot 1 that weigh more than 75 g.

SOLUTION

SELECT
	surveys.day,
	surveys.month,
	surveys.year,
	species.species_id,
	surveys.weight / 1000
FROM surveys
JOIN species ON surveys.species_id = species.species_id
WHERE surveys.weight > 75
AND surveys.plot_id = 1;

EXERCISE

Write a query that returns day, month, year, species ID for individuals caught in January, May and July.

SOLUTION

SELECT day, month, year, species_id
FROM surveys
WHERE month IN (1, 5, 7);

EXERCISE

Write a query that returns year, species ID, and weight in kg from the surveys table, sorted with the largest weights at the top.

SOLUTION

SELECT year, species_id, weight / 1000
FROM surveys ORDER BY weight DESC;

EXERCISE

Let’s try to combine what we’ve learned so far in a single query. Using the surveys table write a query to display the three date fields, species ID, and weight in kilograms (rounded to two decimal places), for rodents captured in 1999, ordered alphabetically by the species ID.

SOLUTION

SELECT year, month, day, species_id, ROUND(weight / 1000, 2)
FROM surveys
WHERE year = 1999
ORDER BY species_id;

EXERCISE

Write queries that return:

  1. How many individuals were counted in each year.

  2. Average weight of each species in each year.

SOLUTION

SELECT year, COUNT(*)
FROM surveys
GROUP BY year;

SELECT year, species_id, ROUND(AVG(weight), 2)
FROM surveys
GROUP BY year, species_id;

EXERCISE

Write a query that returns the number of each species caught in each year sorted from most often caught species to the least occurring ones within each year starting from the most recent records.

SOLUTION

SELECT year, species_id, COUNT(*)
FROM surveys
GROUP BY year, species_id
ORDER BY year DESC, COUNT(*) DESC;

EXERCISE

Write a query that returns the genus, the species, and the weight of every individual captured at the site.

SOLUTION

SELECT species.genus, species.species_id, surveys.weight
FROM surveys
JOIN species ON surveys.species_id = species.species_id;

EXERCISE

Write a query that returns the number of genus of the animals caught in each plot in descending order.

SOLUTION

SELECT surveys.plot_id, species.genus, COUNT(*)
FROM surveys
JOIN species ON surveys.species_id = species.species_id
GROUP BY species.genus, surveys.plot_id
ORDER BY surveys.plot_id, COUNT(*) DESC