Code Handout - Starting with Data

Last updated on 2023-07-10 | Edit this page

This document contains all of the functions that were covered in the Introduction to R workshop. Each function is presented alongside an example of how it can be used.

All of the examples below are in the context of the Palmer Penguins, found here (link).

Packages


  • library() – loads packages into your R session

R

library(tidyverse)
library(lubridate)

Importing Data


  • read_csv() – function to import a csv file.
    • First argument is the path to the data, passed as a character (inside quotations).
    • You can specify what values should be considered missing, using the na argument.

R

penguins <- read_csv("data/penguins.csv")

Inspecting Data


  • dim() - returns a vector with the number of rows as the first element, and the number of columns as the second element (the dimensions of the object)

R

dim(penguins)
  • nrow() - returns the number of rows
  • ncol() - returns the number of columns

R

nrow(penguins)
ncol(penguins)
  • head() - displays the first 6 rows of the dataframe
  • tail() - displays the last 6 rows of the dataframe

R

head(penguins)
tail(penguins)
  • names() - returns the all of the names of an object (both row and column)
  • colnames() - returns column names for dataframes (without row names)

R

names(penguins)
colnames(penguins)
  • glimpse() - provides a preview of the data, where column names are presented with their associated data types, and the entries from each column are printed in each row

R

glimpse(penguins)
  • str() - returns the structure of the object and information about the class, the names and data types of each column, and a preview of the first entries of each column

R

str(penguins)
  • summary() - provides summary statistics for each column
    • Note: summary statistics for character variables are not meaningful, as they simply state the number of observations (length) of the variable

R

summary(penguins)

Subsetting Data


  • [] – selects rows and columns from a dataframe
    • The first entry is the row number, the second entry is the column number(s), and they are separated with a comma.

R

## Selects the element in the first row, second column
penguins[1, 2]

## Selects every element in the fourth row
penguins[4, ]

## Selects every element in the third column
penguins[, 3]
  • [[]] – selects a column from a dataframe
    • Inside the brackets you can pass either the number of the column or the name of the column (in quotations)

R

penguins[[1]]

penguins[["island"]]
  • $ – selects a column from a dataframe, where the name of the dataframe is on the left and the name of the column is on the right

R

penguins$body_mass_g

Working with Different Data Types


  • factor() – creates a categorical variable from a character or numeric variable, variable has a factor datatype
    • the values (level) of the factor levels is specified in the levels argument, where the levels must be specified in a vector (using c())
    • Note: the order you wish for the levels to appear is how you should list them in the levels argument, you can also specify ordered = TRUE to ensure the levels remain in this order

R

penguins$year_fct <- factor(penguins$year, 
                               levels = c("2007", "2008", "2009"), 
                            ordered = TRUE)
  • as.factor() – creates a categorical variable from a character or numeric variable, variable has a factor datatype
    • does not allow for you to specify the order of the levels
    • defaults to alphabetical ordering for factor levels

R

penguins$year_fct <- as.factor(penguins$year)
  • levels() – returns the levels of a variable with a factor datatype, in the order they were stored
    • Note: this function will not work for character datatypes

R

levels(penguins$year_fct)
  • nlevels() – returns the number of levels of a variable with a factor datatype
    • Note: this function will not work for character datatypes

R

nlevels(penguins$year_fct)
  • as.character() – creates a character variable from a numeric or factor variable

R

penguins$species_chr <- as.character(penguins$species)
  • ymd() – transforms dates stored as character or numeric variables to dates
    • Note: to use this function, dates must be stored in year-month-day format
    • The function does well with heterogeneous formats (as seen below), but formats where some of the entries are not in double digits may not be parsed correctly.

R

x <- c("2009-01-01", "2009-01-02", "2009-01-03")
ymd(x)
  • day() – extracts the day (number) of a date variable

R

day(x)
  • month() – extracts the month (number) of a date variable

R

month(x)
  • year() – extracts the year of a date variable

R

year(x)

Visualizing Data


  • plot() – a generic function for plotting R objects
    • In this lesson plot() was used to create bargraphs of categorical variables.

R

plot(penguins$species)