Background and Metadata


Teaching: 10 min
Exercises: 5 min
  • What data are we using?

  • Why is this experiment important?

  • Why study E. coli?

  • Understand the data set.

  • What is hypermutability?


We are going to use a long-term sequencing dataset from a population of Escherichia coli.


The data

View the metadata

We will be working with three sample events from the Ara-3 strain of this experiment, one from 5,000 generations, one from 15,000 generations, and one from 50,000 generations. The population changed substantially during the course of the experiment, and we will be exploring how (the evolution of a Cit+ mutant and hypermutability) with our variant calling workflow. The metadata file associated with this lesson can be downloaded directly here or viewed in Github. If you would like to know details of how the file was created, you can look at some notes and sources here.

This metadata describes information on the Ara-3 clones and the columns represent:

Column Description
strain strain name
generation generation when sample frozen
clade based on parsimony-based tree
reference study the samples were originally sequenced for
population ancestral population group
mutator hypermutability mutant status
facility facility samples were sequenced at
run Sequence read archive sample ID
read_type library type of reads
read_length length of reads in sample
sequencing_depth depth of sequencing
cit citrate-using mutant status


Based on the metadata, can you answer the following questions?

  1. How many different generations exist in the data?
  2. How many rows and how many columns are in this data?
  3. How many citrate+ mutants have been recorded in Ara-3?
  4. How many hypermutable mutants have been recorded in Ara-3?


  1. 25 different generations
  2. 62 rows, 12 columns
  3. 10 citrate+ mutants
  4. 6 hypermutable mutants

Key Points

  • It is important to record and understand your experiment’s metadata.