A Genomics Data Carpentry Workshop

James Madison University

Bioscience Building, room 2007
July 22-23, 2015
9:00 am - 5:00 pm

General Information

Data Carpentry workshops are for any researcher who has data they want to analyze , and no prior computational experience is required. This hands-on workshop teaches basic concepts, skills and tools for working more effectively with data.

The focus of this workshop will be on working with genomics data and data management and analysis for genomics research. We will cover metadata organization in spreadsheets, connecting to and using cloud computing, the command line for sequence quality control and bioinformatics workflows, and R for data analysis and visualization. We will not be teaching any particular bioinformatics tools, but the foundational skills that will allow you to conduct any analysis and analyze the output of a genomics pipeline.

Participants should bring their laptops and plan to participate actively. By the end of the workshop learners should be able to more effectively manage and analyze data and be able to apply the tools and approaches directly to their ongoing research.

Data Carpentry's aim is to teach researchers basic concepts, skills, and tools for working with data so that they can get more done in less time, and with less pain.

Updates will be posted to this website as they become available.

The etherpad for this workshop can be found here.

Organizer: James Herrick (James Madison), Ray Enke (James Madison), Steve Cresawn (James Madison)

Instructors: Stephen Turner (University of Virginia), Amanda Charbonneau (Michigan State University)

Assistants: Pete Nagraj (University of Virginia)

Who: The course is aimed at faculty, research staff, postdocs, graduate students, advanced undergraduates, and other researchers in any field. No prior computational experience is required.

Where: Bioscience Building, room 2007. Get directions with OpenStreetMap or Google Maps.

Requirements: Data Carpentry's teaching is hands-on, so participants are encouraged to bring in and use their own laptops to insure the proper setup of tools for an efficient workflow once you leave the workshop. (We will provide instructions on setting up the required software several days in advance.). There are no pre-requisites, and we will assume no prior knowledge about the tools. Participants are required to abide by Software Carpentry's Code of Conduct.

Contact: Please email tkteal@datacarpentry.org for questions and information not covered here.

Twitter: #datacarpentry

@datacarpentry

Schedule

Wednesday 08:30 Breakfast, and optional help with setup prior to starting
09:00 Data organization and management (Amanda)
Refreshments will be served around 10:30.
10:45 Working with genomics file types (Amanda)
12:30 Lunch break
13:30 Introduction to R (Stephen)
Refreshments will be served around 15:00.
15:15 Advanced manipulation & visualization with R (Stephen)
16:30 Wrap-up
17:00 Evening Mixer (5pm to 7pm)
Thursday 08:30 Breakfast, and optional help with setup prior to starting
09:00 Introduction to command line (Amanda)
Refreshments will be served around 10:30.
10:45 Wrangling and processing genomic data-Advanced shell (Amanda)
12:30 Lunch break
13:30 Introduction to cloud computing (Stephen)
Refreshments will be served around 15:00.
15:15 Cloud computing: example analysis & data transfer (Stephen)
16:30 Wrap-up
17:00 Evening Mixer (5pm to 7pm)

Setup

R + Rstudio

setup-r

Note: R and RStudio are separate downloads and installations. R is the underlying statistical computing environment, but using R alone is no fun. RStudio is a graphical integrated development environment that makes using R much easier. You need R installed before you install RStudio.

  1. Download the gapminder data. Click here to download the data that we will use for this section. Save it somewhere you’ll remember.
  2. Install R. You’ll need R version 3.2.0 or higher. Download and install R for Windows or Mac OS X (download the latest R-3.x.x.pkg file for your appropriate version of OS X).
  3. Install RStudio. Download and install the latest stable version of RStudio Desktop.
  4. Install R packages. Launch RStudio (RStudio, not R itself). Ensure that you have internet access, then enter the following commands into the Console panel (usually the lower-left panel, by default). Note that these commands are case-sensitive. At any point (especially if you’ve used R/Bioconductor in the past), R may ask you if you want to update any old packages by asking Update all/some/none? [a/s/n]:. If you see this, type a at the propt and hit Enter to update any old packages. If you’re using a Windows machine you might get some errors about not having permission to modify the existing libraries – don’t worry about this message. You can avoid this error altogether by running RStudio as an administrator.
  5. Download the data. Click here to download the data that we will use for the R lesson. Save it somewhere easy to remember and fine e.g., your Desktop. (Alternative data source: http://bioconnector.org/data/gapminder.csv).
# Install packages from CRAN
install.packages("dplyr")
install.packages("ggplot2")
install.packages("tidyr")

# Install Bioconductor base
source("http://bioconductor.org/biocLite.R")
biocLite()

You can check that you’ve installed everything correctly by closing and reopening RStudio and entering the following commands at the console window:

library(dplyr)
library(ggplot2)
library(tidyr)
library(Biobase)

These commands may produce some notes or other output, but as long as they work without an error message, you’re good to go. If you get a message that says something like: Error in library(packageName) : there is no package called 'packageName', then the required packages did not install correctly.

Shell / cloud

setup-shell

Most bioinformatics is done on a computer running a Linux/UNIX operating system. In this workshop we will be doing data analysis on a remote linux server rather than on our own laptops. To do that we need: (1) a remote computer set up with all the software we’ll need, (2) a way to connect to that computer, and (3) a way to transfer files to and from that computer.

Since most of us don’t have our own Linux server running somewhere, we’ll rent a server from Amazon for the duration of this course.

Acknowledgements & Support

Data Carpentry is supported by the Gordon and Betty Moore Foundation and a partnership of several NSF-funded BIO Centers (NESCent, iPlant, iDigBio, BEACON and SESYNC) and Software Carpentry. The structure and objectives of the curriculum as well as the teaching style are informed by Software Carpentry.