A Genomics Data Carpentry Workshop
James Madison University
Bioscience Building, room 2007
July 22-23, 2015
9:00 am - 5:00 pm
General Information
Data Carpentry workshops are for any researcher who has data they want to analyze
, and no prior computational experience is required. This hands-on workshop
teaches basic concepts, skills and tools for working more effectively with
data.
The focus of this workshop will be on working with genomics data
and data management and analysis for genomics research.
We will cover metadata organization in spreadsheets, connecting to
and using cloud computing, the command line for
sequence quality control and bioinformatics workflows,
and R for data analysis and visualization.
We will not be teaching any particular bioinformatics tools, but the
foundational skills that will allow you to conduct any analysis and
analyze the output of a genomics pipeline.
Participants should bring their laptops and plan to participate
actively. By the end of the workshop learners should be able to more effectively
manage and analyze data and be able to apply the tools and approaches
directly to their ongoing research.
Data Carpentry's aim is to teach researchers basic concepts, skills,
and tools for working with data so that they can get more done in less
time, and with less pain.
Updates will be posted to this website as they become available.
The etherpad for this workshop can be found here.
Organizer:
James Herrick (James Madison), Ray Enke (James Madison), Steve Cresawn (James Madison)
Instructors:
Stephen Turner (University of Virginia), Amanda Charbonneau (Michigan State University)
Assistants:
Pete Nagraj (University of Virginia)
Who:
The course is aimed at faculty, research staff, postdocs, graduate students, advanced undergraduates, and other researchers in any field. No prior computational experience is required.
Where:
Bioscience Building, room 2007.
Get directions with
OpenStreetMap
or
Google Maps.
Requirements:
Data Carpentry's teaching is hands-on, so participants are encouraged to bring in and use their own laptops to insure the proper setup of tools for an efficient workflow once you leave the workshop. (We will provide instructions on setting up the required software several days in advance.). There are no pre-requisites, and we will assume no prior knowledge about the tools. Participants are required to abide by Software Carpentry's
Code of Conduct.
Contact:
Please email
tkteal@datacarpentry.org
for questions and information not covered here.
Twitter: #datacarpentry
@datacarpentry
Schedule
Wednesday | 08:30 | Breakfast, and optional help with setup prior to starting |
| 09:00 | Data organization and management (Amanda) |
| | Refreshments will be served around 10:30. |
| 10:45 | Working with genomics file types (Amanda) |
| 12:30 | Lunch break |
| 13:30 | Introduction to R (Stephen) |
| | Refreshments will be served around 15:00. |
| 15:15 | Advanced manipulation & visualization with R (Stephen) |
| 16:30 | Wrap-up |
| 17:00 | Evening Mixer (5pm to 7pm) |
Thursday | 08:30 | Breakfast, and optional help with setup prior to starting |
| 09:00 | Introduction to command line (Amanda) |
| | Refreshments will be served around 10:30. |
| 10:45 | Wrangling and processing genomic data-Advanced shell (Amanda) |
| 12:30 | Lunch break |
| 13:30 | Introduction to cloud computing (Stephen) |
| | Refreshments will be served around 15:00. |
| 15:15 | Cloud computing: example analysis & data transfer (Stephen) |
| 16:30 | Wrap-up |
| 17:00 | Evening Mixer (5pm to 7pm) |
Setup
R + Rstudio
setup-rNote: R and RStudio are separate downloads and installations. R is the underlying statistical computing environment, but using R alone is no fun. RStudio is a graphical integrated development environment that makes using R much easier. You need R installed before you install RStudio.
- Download the gapminder data. Click here to download the data that we will use for this section. Save it somewhere you’ll remember.
- Install R. You’ll need R version 3.2.0 or higher. Download and install R for Windows or Mac OS X (download the latest R-3.x.x.pkg file for your appropriate version of OS X).
- Install RStudio. Download and install the latest stable version of RStudio Desktop.
- Install R packages. Launch RStudio (RStudio, not R itself). Ensure that you have internet access, then enter the following commands into the Console panel (usually the lower-left panel, by default). Note that these commands are case-sensitive. At any point (especially if you’ve used R/Bioconductor in the past), R may ask you if you want to update any old packages by asking
Update all/some/none? [a/s/n]:
. If you see this, type a
at the propt and hit Enter
to update any old packages. If you’re using a Windows machine you might get some errors about not having permission to modify the existing libraries – don’t worry about this message. You can avoid this error altogether by running RStudio as an administrator.
- Download the data. Click here to download the data that we will use for the R lesson. Save it somewhere easy to remember and fine e.g., your Desktop. (Alternative data source: http://bioconnector.org/data/gapminder.csv).
# Install packages from CRAN
install.packages("dplyr")
install.packages("ggplot2")
install.packages("tidyr")
# Install Bioconductor base
source("http://bioconductor.org/biocLite.R")
biocLite()
You can check that you’ve installed everything correctly by closing and reopening RStudio and entering the following commands at the console window:
library(dplyr)
library(ggplot2)
library(tidyr)
library(Biobase)
These commands may produce some notes or other output, but as long as they work without an error message, you’re good to go. If you get a message that says something like: Error in library(packageName) : there is no package called 'packageName'
, then the required packages did not install correctly.
Shell / cloud
setup-shellMost bioinformatics is done on a computer running a Linux/UNIX operating system. In this workshop we will be doing data analysis on a remote linux server rather than on our own laptops. To do that we need: (1) a remote computer set up with all the software we’ll need, (2) a way to connect to that computer, and (3) a way to transfer files to and from that computer.
Since most of us don’t have our own Linux server running somewhere, we’ll rent a server from Amazon for the duration of this course.
- (All participants) Download a file transfer program. Download and install Cyberduck (free): https://cyberduck.io. We will use this transfer files back and forth between our local laptops and our remote linux server running on Amazon. Note: If using a Mac, download from the website above (free), not from the Mac App Store (paid).
- (All participants) Download a text editor. We may wish to view and/or edit plain text files. To do this, let’s use a better alternative to the built-in Notepad (Windows) or TextEdit (Mac). Download and install Sublime Text (works for Windows and Mac): http://www.sublimetext.com/.
- For Windows users only:
- Bookmark these links: These resources will be useful for logging in and transferring data to your instance.
Acknowledgements & Support
Data Carpentry is supported by the Gordon and Betty Moore Foundation and a partnership of several NSF-funded BIO Centers (NESCent, iPlant, iDigBio, BEACON and SESYNC) and Software Carpentry.
The structure and objectives of the curriculum as well as the teaching style are informed by Software Carpentry.