James Madison University

Data Carpentry workshops are for any researcher who has data they want to analyze , and no prior computational experience is required. This hands-on workshop teaches basic concepts, skills and tools for working more effectively with data.

The focus of this workshop will be on working with genomics data and data management and analysis for genomics research. We will cover metadata organization in spreadsheets, connecting to and using cloud computing, the command line for sequence quality control and bioinformatics workflows, and R for data analysis and visualization. We will not be teaching any particular bioinformatics tools, but the foundational skills that will allow you to conduct any analysis and analyze the output of a genomics pipeline.

Participants should bring their laptops and plan to participate actively. By the end of the workshop learners should be able to more effectively manage and analyze data and be able to apply the tools and approaches directly to their ongoing research.

Data Carpentry's aim is to teach researchers basic concepts, skills, and tools for working with data so that they can get more done in less time, and with less pain.

Organizer: James Herrick (James Madison), Ray Enke (James Madison), Steve Cresawn (James Madison)

Instructors: Stephen Turner (University of Virginia), Amanda Charbonneau (Michigan State University)

Who: The course is aimed at faculty, research staff, postdocs, graduate students, advanced undergraduates, and other researchers in any field. No prior computational experience is required.

Requirements: Data Carpentry's teaching is hands-on, so participants are encouraged to bring in and use their own laptops to insure the proper setup of tools for an efficient workflow once you leave the workshop. (We will provide instructions on setting up the required software several days in advance.). There are no pre-requisites, and we will assume no prior knowledge about the tools. Participants are required to abide by Software Carpentry's Code of Conduct.

Contact: Please email tkteal@datacarpentry.org for questions and information not covered here.

Schedule

Setup

R + Rstudio

Note: R and RStudio are separate downloads and installations. R is the underlying statistical computing environment, but using R alone is no fun. RStudio is a graphical integrated development environment that makes using R much easier. You need R installed before you install RStudio.

Download the gapminder data. Click here to download the data that we will use for this section. Save it somewhere you’ll remember.
Install R. You’ll need R version 3.2.0 or higher. Download and install R for Windows or Mac OS X (download the latest R-3.x.x.pkg file for your appropriate version of OS X).
Install RStudio. Download and install the latest stable version of RStudio Desktop.
Install R packages. Launch RStudio (RStudio, not R itself). Ensure that you have internet access, then enter the following commands into the Console panel (usually the lower-left panel, by default). Note that these commands are case-sensitive. At any point (especially if you’ve used R/Bioconductor in the past), R may ask you if you want to update any old packages by asking Update all/some/none? [a/s/n]:. If you see this, type a at the propt and hit Enter to update any old packages. If you’re using a Windows machine you might get some errors about not having permission to modify the existing libraries – don’t worry about this message. You can avoid this error altogether by running RStudio as an administrator.
Download the data. Click here to download the data that we will use for the R lesson. Save it somewhere easy to remember and fine e.g., your Desktop. (Alternative data source: http://bioconnector.org/data/gapminder.csv).

# Install packages from CRAN
install.packages("dplyr")
install.packages("ggplot2")
install.packages("tidyr")

# Install Bioconductor base
source("http://bioconductor.org/biocLite.R")
biocLite()

You can check that you’ve installed everything correctly by closing and reopening RStudio and entering the following commands at the console window:

library(dplyr)
library(ggplot2)
library(tidyr)
library(Biobase)

These commands may produce some notes or other output, but as long as they work without an error message, you’re good to go. If you get a message that says something like: Error in library(packageName) : there is no package called 'packageName', then the required packages did not install correctly.

Shell / cloud

Most bioinformatics is done on a computer running a Linux/UNIX operating system. In this workshop we will be doing data analysis on a remote linux server rather than on our own laptops. To do that we need: (1) a remote computer set up with all the software we’ll need, (2) a way to connect to that computer, and (3) a way to transfer files to and from that computer.

Since most of us don’t have our own Linux server running somewhere, we’ll rent a server from Amazon for the duration of this course.

(All participants) Download a file transfer program. Download and install Cyberduck (free): https://cyberduck.io. We will use this transfer files back and forth between our local laptops and our remote linux server running on Amazon. Note: If using a Mac, download from the website above (free), not from the Mac App Store (paid).
(All participants) Download a text editor. We may wish to view and/or edit plain text files. To do this, let’s use a better alternative to the built-in Notepad (Windows) or TextEdit (Mac). Download and install Sublime Text (works for Windows and Mac): http://www.sublimetext.com/.
For Windows users only:
- Install git-bash. Download and run the installer here. This will install a “real” command line on a Windows machine.
- Install the Software Carpentry Installer. After installing git-bash, download and install the software carpentry installer here.
- Download a terminal emulator.
  - Download the latest version of PuTTY here. Ensure that you can run this .exe file (no installation necessary). This is your terminal emulator.
  - Download puttygen here. This will allow you to generate a key used to log in to your cloud computing instance.
Bookmark these links: These resources will be useful for logging in and transferring data to your instance.

Wednesday	08:30	Breakfast, and optional help with setup prior to starting
	09:00	Data organization and management (Amanda)
		Refreshments will be served around 10:30.
	10:45	Working with genomics file types (Amanda)
	12:30	Lunch break
	13:30	Introduction to R (Stephen)
		Refreshments will be served around 15:00.
	15:15	Advanced manipulation & visualization with R (Stephen)
	16:30	Wrap-up
	17:00	Evening Mixer (5pm to 7pm)
Thursday	08:30	Breakfast, and optional help with setup prior to starting
	09:00	Introduction to command line (Amanda)
		Refreshments will be served around 10:30.
	10:45	Wrangling and processing genomic data-Advanced shell (Amanda)
	12:30	Lunch break
	13:30	Introduction to cloud computing (Stephen)
		Refreshments will be served around 15:00.
	15:15	Cloud computing: example analysis & data transfer (Stephen)
	16:30	Wrap-up
	17:00	Evening Mixer (5pm to 7pm)

A Genomics Data Carpentry Workshop