Before we start

  • R is a programming language and RStudio is the IDE that assists in using R.
  • There are many benefits to learning R, including writing reproducibile code, ability to use a variety of datasets, and a broad, open-source community of practioners.
  • Files related to analysis should be organized within a single working directory.
  • R uses commands containing functions to tell the computer what to do.
  • Documentation for each function is available within RStudio, or users can ask for help from one of many online forums, cheatsheets, or email lists.

Introduction to R

  • <- is used to assign values on the right to objects on the left
  • Code should be saved within the Source pane in RStudio to help you return to your code later.
  • ‘#’ can be used to add comments to your code.
  • Functions can automate more complicated sets of commands, and require arguments as inputs.
  • Vectors are composed by a series of values and can take many forms.
  • Data structures in R include ‘vector’, ‘list’, ‘matrix’, ‘data.frame’, ‘factor’, and ‘array’.
  • Vectors can be subset by indexing or through logical vectors.
  • Many functions exist to remove missing data from data structures.

Starting with data

  • Use read.csv to read tabular data in R.
  • A data frame is the representation of data in the format of a table where the columns are vectors that all have the same length.
  • dplyr provides many methods for inspecting and summarizing data in data frames.
  • Use factors to represent categorical data in R.
  • The lubridate package has many useful functions for working with dates.

Manipulating, analyzing and exporting data with tidyverseData manipulation using dplyr and tidyr Exporting data

  • Use the dplyr package to manipulate data frames.
  • Use select() to choose variables from a data frame.
  • Use filter() to choose data based on values.
  • Use mutate() to create new variables.
  • Use group_by() and summarize() to work with subsets of data.

Data visualization with ggplot2

  • start simple and build your plots iteratively
  • the ggplot() function initiates a plot, and geom_ functions add representations of your data
  • use aes() when mapping a variable from the data to a part of the plot
  • use facet_ to partition a plot into multiple plots based on a factor included in the dataset
  • use premade theme_ functions to broadly change appearance, and the theme() function to fine-tune
  • the patchwork library can combine separate plots into a single figure
  • use ggsave() to save plots in your favorite format and dimensions

SQL databases and R

  • tbl connects to a database and can send SQL queries.
  • use dplyr syntax to extract information from SQL tables.
  • dplyr laziness only pulls the needed information, speeding up data retrieval.
  • use src_sqlite() to create a new empty SQLite database and copy_to() to add data to it.