Learning Objectives

What are data frames?

data.frame is the de facto data structure for most tabular data and what we use for statistics and plotting.

A data.frame is a collection of vectors of identical lengths. Each vector represents a column, and each vector can be of a different data type (e.g., characters, integers, factors). The str() function is useful to inspect the data types of the columns.

A data.frame can be created by the functions read.csv() or read.table(), in other words, when importing spreadsheets from your hard drive (or the web).

By default, data.frame converts (= coerces) columns that contain characters (i.e., text) into the factor data type. Depending on what you want to do with the data, you may want to keep these columns as character. To do so, read.csv() and read.table() have an argument called stringsAsFactors which can be set to FALSE:

some_data <- read.csv("data/some_file.csv", stringsAsFactors=FALSE)

You can also create data.frame manually with the function data.frame(). This function can also take the argument stringsAsFactors. Compare the output of these examples:

example_data <- data.frame(animal=c("dog", "cat", "sea cucumber", "sea urchin"),
                           feel=c("furry", "furry", "squishy", "spiny"),
                           weight=c(45, 8, 1.1, 0.8))
str(example_data)
## 'data.frame':    4 obs. of  3 variables:
##  $ animal: Factor w/ 4 levels "cat","dog","sea cucumber",..: 2 1 3 4
##  $ feel  : Factor w/ 3 levels "furry","spiny",..: 1 1 3 2
##  $ weight: num  45 8 1.1 0.8
example_data <- data.frame(animal=c("dog", "cat", "sea cucumber", "sea urchin"),
                           feel=c("furry", "furry", "squishy", "spiny"),
                           weight=c(45, 8, 1.1, 0.8), stringsAsFactors=FALSE)
str(example_data)
## 'data.frame':    4 obs. of  3 variables:
##  $ animal: chr  "dog" "cat" "sea cucumber" "sea urchin"
##  $ feel  : chr  "furry" "furry" "squishy" "spiny"
##  $ weight: num  45 8 1.1 0.8

Challenge

  1. There are a few mistakes in this hand crafted data.frame, can you spot and fix them? Don’t hesitate to experiment!
##  There are a few mistakes in this hand crafted `data.frame`,
##  can you spot and fix them? Don't hesitate to experiment!
author_book <- data.frame(author_first=c("Charles", "Ernst", "Theodosius"),
                            author_last=c(Darwin, Mayr, Dobzhansky),
                            year=c(1942, 1970))
  1. Can you predict the class for each of the columns in the following example?
## Can you predict the class for each of the columns in the following example?
## Check your guesses using `str(country_climate)`. Are they what you expected?
##  Why? why not?
country_climate <- data.frame(country=c("Canada", "Panama", "South Africa", "Australia"),
                              climate=c("cold", "hot", "temperate", "hot/temperate"),
                              temperature=c(10, 30, 18, "15"),
                              north_hemisphere=c(TRUE, TRUE, FALSE, "FALSE"),
                              has_kangaroo=c(FALSE, FALSE, FALSE, 1))

Check your gueses using str(country_climate). Are they what you expected? Why? Why not?

R coerces (when possible) to the data type that is the least common denominator and the easiest to coerce to.

Can you fix the R code above statement so that the fields of country_climate have the types you would like?

Inspecting Data Frame objects

We already saw how the functions head() and str() can be useful to check the content and the structure of a data.frame. Here is a non-exhaustive list of functions to get a sense of the content/structure of the data.

Note: most of these functions are “generic”, they can be used on other types of objects besides data.frame.

Indexing and sequences

If we want to extract one or several values from a vector, we must provide one or several indices in square brackets, just as we do in math. For instance:

animals <- c("mouse", "rat", "dog", "cat")
animals[2]
## [1] "rat"
animals[c(3, 2)]
## [1] "dog" "rat"
animals[2:4]
## [1] "rat" "dog" "cat"
more_animals <- animals[c(1:3, 2:4)]
more_animals
## [1] "mouse" "rat"   "dog"   "rat"   "dog"   "cat"

R indexes start at 1. Programming languages like Fortran, MATLAB, and R start counting at 1, because that’s what human beings typically do. Languages in the C family (including C++, Java, Perl, and Python) count from 0 because that’s simpler for computers to do.

: is a special function that creates numeric vectors of integer in increasing or decreasing order. Test 1:10 and 10:1 for instance. The function seq() (for sequence) can be used to create more complex patterns:

seq(1, 10, by=2)
## [1] 1 3 5 7 9
seq(5, 10, length.out=3)
## [1]  5.0  7.5 10.0
seq(50, by=5, length.out=10)
##  [1] 50 55 60 65 70 75 80 85 90 95
seq(1, 8, by=3) # sequence stops to stay below upper limit
## [1] 1 4 7

Our survey data frame has rows and columns (it has 2 dimensions). If we want to extract some specific data from it, we need to specify the “coordinates” we want from it. Row numbers come first, followed by column numbers.

surveys[1, 1]   # first element in the first column of the data frame
surveys[1, 6]   # first element in the 6th column
surveys[1:3, 7] # first three elements in the 7th column
surveys[3, ]    # the 3rd element for all columns
surveys[, 8]    # the entire 8th column
head_surveys <- surveys[1:6, ] # surveys[1:6, ] is equivalent to head(surveys)
surveys$species # the species column

Challenge

  1. Using the functions above:
  • How many rows and how many columns are in the surveys object?
  • How many species have been recorded during these surveys? You may want to think about factors.
  1. The function nrow() on a data.frame returns the number of rows. Use it, in conjuction with seq() to create a new data.frame called surveys_by_10 that includes every 10th row of the survey data frame starting at row 10 (10, 20, 30, …)

Logical indexing

Logical values can be used to select a subset of the data.

colors <- c("red", "yellow", NA, "blue")
colors[c(TRUE, FALSE, FALSE, TRUE)]    # elements corresponding TRUE positions
## [1] "red"  "blue"

This can be useful when you have a function that produces TRUE/FALSE and you can then use this to subset your data. For example, removing NA (null) values.

is.na(colors)             # logical vector TRUE if position is NA
## [1] FALSE FALSE  TRUE FALSE
!is.na(colors)            # not (!) missing, FALSE if position is NA
## [1]  TRUE  TRUE FALSE  TRUE
colors[!is.na(colors)]    # colours that are not missing (NA)
## [1] "red"    "yellow" "blue"

Subsetting data frames based on a logic statement

example_data <- data.frame(fruit=c("apple", "mango", "banana",  "coconut"),
                           quantity=c(45, 8, 12, 2))

example_data$quantity > 10                # quantity is greater than 10 (TRUE/FALSE)
## [1]  TRUE FALSE  TRUE FALSE
example_data[example_data$quantity > 10,] # rows where quantity is greater than 10
##    fruit quantity
## 1  apple       45
## 3 banana       12

Sorting data frames

fruit_info <- data.frame(fruit=c("apple", "mango", "banana",  "coconut"),
                           quantity=c(45, 8, 12, 2))

order(fruit_info$quantity)               # The order of the quantity column
## [1] 4 2 3 1
correct_row_order <- order(fruit_info$quantity)  
fruit_info[correct_row_order,]  # sort fruit_info by quantity column
##     fruit quantity
## 4 coconut        2
## 2   mango        8
## 3  banana       12
## 1   apple       45

Challenge

  1. From the surveys data frame, print a copy of the weights, without any NA values.

  2. Produce a copy of the surveys data frame with only rows that contain weights.

  3. Sort the entire fruit_info data frame by reverse (descending) alphabetical order of the fruit names. Hint: args(order).

Previous: Starting with data Next: Manipulating data