Introduction to Geospatial Raster and Vector Data with R

Intro to Raster Data

Overview

Teaching: 30 min
Exercises: 20 min
Questions
  • What is a raster dataset?

  • How do I work with and plot raster data in R?

  • How can I handle missing or bad data values for a raster?

Objectives
  • Describe the fundamental attributes of a raster dataset.

  • Explore raster attributes and metadata using R.

  • Import rasters into R using the raster package.

  • Plot a raster file in R using the ggplot2 package.

  • Describe the difference between single- and multi-band rasters.

Things You’ll Need To Complete This Episode

See the lesson homepage for detailed information about the software, data, and other prerequisites you will need to work through the examples in this episode.

In this episode, we will introduce the fundamental principles, packages and metadata/raster attributes that are needed to work with raster data in R. We will discuss some of the core metadata elements that we need to understand to work with rasters in R, including CRS and resolution. We will also explore missing and bad data values as stored in a raster and how R handles these elements.

We will continue to work with the dplyr and ggplot2 packages that were introduced in the Introduction to R for Geospatial Data lesson. We will use two additional packages in this episode to work with raster data - the raster and rgdal packages. Make sure that you have these packages loaded.

library(raster)
library(rgdal)
library(ggplot2)
library(dplyr)

Introduce the Data

If not already discussed, introduce the datasets that will be used in this lesson. A brief introduction to the datasets can be found on the Geospatial workshop homepage.

For more detailed information about the datasets, check out the Geospatial workshop data page.

View Raster File Attributes

We will be working with a series of GeoTIFF files in this lesson. The GeoTIFF format contains a set of embedded tags with metadata about the raster data. We can use the function GDALinfo() to get information about our raster data before we read that data into R. It is ideal to do this before importing your data.

GDALinfo("data/NEON-DS-Airborne-Remote-Sensing/HARV/DSM/HARV_dsmCrop.tif")
rows        1367 
columns     1697 
bands       1 
lower left origin.x        731453 
lower left origin.y        4712471 
res.x       1 
res.y       1 
ysign       -1 
oblique.x   0 
oblique.y   0 
driver      GTiff 
projection  +proj=utm +zone=18 +datum=WGS84 +units=m +no_defs 
file        data/NEON-DS-Airborne-Remote-Sensing/HARV/DSM/HARV_dsmCrop.tif 
apparent band summary:
   GDType hasNoDataValue NoDataValue blockSize1 blockSize2
1 Float64           TRUE       -9999          1       1697
apparent band statistics:
    Bmin   Bmax    Bmean      Bsd
1 305.07 416.07 359.8531 17.83169
Metadata:
AREA_OR_POINT=Area 

If you wish to store this information in R, you can do the following:

HARV_dsmCrop_info <- capture.output(
  GDALinfo("data/NEON-DS-Airborne-Remote-Sensing/HARV/DSM/HARV_dsmCrop.tif")
)

Each line of text that was printed to the console is now stored as an element of the character vector HARV_dsmCrop_info. We will be exploring this data throughout this episode. By the end of this episode, you will be able to explain and understand the output above.

Open a Raster in R

Now that we’ve previewed the metadata for our GeoTIFF, let’s import this raster dataset into R and explore its metadata more closely. We can use the raster() function to open a raster in R.

Data Tip - Object names

To improve code readability, file and object names should be used that make it clear what is in the file. The data for this episode were collected from Harvard Forest so we’ll use a naming convention of datatype_HARV.

First we will load our raster file into R and view the data structure.

DSM_HARV <- 
  raster("data/NEON-DS-Airborne-Remote-Sensing/HARV/DSM/HARV_dsmCrop.tif")

DSM_HARV
class      : RasterLayer 
dimensions : 1367, 1697, 2319799  (nrow, ncol, ncell)
resolution : 1, 1  (x, y)
extent     : 731453, 733150, 4712471, 4713838  (xmin, xmax, ymin, ymax)
crs        : +proj=utm +zone=18 +datum=WGS84 +units=m +no_defs +ellps=WGS84 +towgs84=0,0,0 
source     : /home/runner/work/r-raster-vector-geospatial/r-raster-vector-geospatial/_episodes_rmd/data/NEON-DS-Airborne-Remote-Sensing/HARV/DSM/HARV_dsmCrop.tif 
names      : HARV_dsmCrop 
values     : 305.07, 416.07  (min, max)

The information above includes a report of min and max values, but no other data range statistics. Similar to other R data structures like vectors and data frame columns, descriptive statistics for raster data can be retrieved like

summary(DSM_HARV)
Warning in .local(object, ...): summary is an estimate based on a sample of 1e+05 cells (4.31% of all cells)
        HARV_dsmCrop
Min.        305.5500
1st Qu.     345.6500
Median      359.6450
3rd Qu.     374.2825
Max.        413.9000
NA's          0.0000

but note the warning - unless you force R to calculate these statistics using every cell in the raster, it will take a random sample of 100,000 cells and calculate from that instead. To force calculation on more, or even all values, you can use the parameter maxsamp:

summary(DSM_HARV, maxsamp = ncell(DSM_HARV))
        HARV_dsmCrop
Min.          305.07
1st Qu.       345.59
Median        359.67
3rd Qu.       374.28
Max.          416.07
NA's            0.00

You may not see major differences in summary stats as maxsamp increases, except with very large rasters.

To visualise this data in R using ggplot2, we need to convert it to a dataframe. We learned about dataframes in an earlier lesson. The raster package has an built-in function for conversion to a plotable dataframe.

DSM_HARV_df <- as.data.frame(DSM_HARV, xy = TRUE)

Now when we view the structure of our data, we will see a standard dataframe format.

str(DSM_HARV_df)
'data.frame':	2319799 obs. of  3 variables:
 $ x           : num  731454 731454 731456 731456 731458 ...
 $ y           : num  4713838 4713838 4713838 4713838 4713838 ...
 $ HARV_dsmCrop: num  409 408 407 407 409 ...

We can use ggplot() to plot this data. We will set the color scale to scale_fill_viridis_c which is a color-blindness friendly color scale. We will also use the coord_quickmap() function to use an approximate Mercator projection for our plots. This approximation is suitable for small areas that are not too close to the poles. Other coordinate systems are available in ggplot2 if needed, you can learn about them at their help page ?coord_map.

ggplot() +
    geom_raster(data = DSM_HARV_df , aes(x = x, y = y, fill = HARV_dsmCrop)) +
    scale_fill_viridis_c() +
    coord_quickmap()

Raster plot with ggplot2 using the viridis color scale

Plotting Tip

More information about the Viridis palette used above at R Viridis package documentation.

Plotting Tip

For faster, simpler plots, you can use the plot function from the raster package.

Show plot

See ?plot for more arguments to customize the plot

plot(DSM_HARV)

plot of chunk unnamed-chunk-5

This map shows the elevation of our study site in Harvard Forest. From the legend, we can see that the maximum elevation is ~400, but we can’t tell whether this is 400 feet or 400 meters because the legend doesn’t show us the units. We can look at the metadata of our object to see what the units are. Much of the metadata that we’re interested in is part of the CRS. We introduced the concept of a CRS in an earlier lesson.

Now we will see how features of the CRS appear in our data file and what meanings they have.

View Raster Coordinate Reference System (CRS) in R

We can view the CRS string associated with our R object using thecrs() function.

crs(DSM_HARV)
CRS arguments:
 +proj=utm +zone=18 +datum=WGS84 +units=m +no_defs +ellps=WGS84
+towgs84=0,0,0 

Challenge

What units are our data in?

Answers

+units=m tells us that our data is in meters.

Understanding CRS in Proj4 Format

The CRS for our data is given to us by R in proj4 format. Let’s break down the pieces of proj4 string. The string contains all of the individual CRS elements that R or another GIS might need. Each element is specified with a + sign, similar to how a .csv file is delimited or broken up by a ,. After each + we see the CRS element being defined. For example projection (proj=) and datum (datum=).

UTM Proj4 String

Our projection string for DSM_HARV specifies the UTM projection as follows:

+proj=utm +zone=18 +datum=WGS84 +units=m +no_defs +ellps=WGS84 +towgs84=0,0,0

Note that the zone is unique to the UTM projection. Not all CRSs will have a zone. Image source: Chrismurf at English Wikipedia, via Wikimedia Commons (CC-BY).

The UTM zones across the continental United States. From: https://upload.wikimedia.org/wikipedia/commons/8/8d/Utm-zones-USA.svg

Calculate Raster Min and Max Values

It is useful to know the minimum or maximum values of a raster dataset. In this case, given we are working with elevation data, these values represent the min/max elevation range at our site.

Raster statistics are often calculated and embedded in a GeoTIFF for us. We can view these values:

minValue(DSM_HARV)
[1] 305.07
maxValue(DSM_HARV)
[1] 416.07

Data Tip - Set min and max values

If the minimum and maximum values haven’t already been calculated, we can calculate them using the setMinMax() function.

DSM_HARV <- setMinMax(DSM_HARV)

We can see that the elevation at our site ranges from 305.0700073m to 416.0699768m.

Raster Bands

The Digital Surface Model object (DSM_HARV) that we’ve been working with is a single band raster. This means that there is only one dataset stored in the raster: surface elevation in meters for one time period.

Multi-band raster image

A raster dataset can contain one or more bands. We can use the raster() function to import one single band from a single or multi-band raster. We can view the number of bands in a raster using the nlayers() function.

nlayers(DSM_HARV)
[1] 1

However, raster data can also be multi-band, meaning that one raster file contains data for more than one variable or time period for each cell. By default the raster() function only imports the first band in a raster regardless of whether it has one or more bands. Jump to a later episode in this series for information on working with multi-band rasters: Work with Multi-band Rasters in R.

Dealing with Missing Data

Raster data often has a NoDataValue associated with it. This is a value assigned to pixels where data is missing or no data were collected.

By default the shape of a raster is always rectangular. So if we have a dataset that has a shape that isn’t rectangular, some pixels at the edge of the raster will have NoDataValues. This often happens when the data were collected by an airplane which only flew over some part of a defined region.

In the image below, the pixels that are black have NoDataValues. The camera did not collect data in these areas.

plot of chunk demonstrate-no-data-black-ggplot

In the next image, the black edges have been assigned NoDataValue. R doesn’t render pixels that contain a specified NoDataValue. R assigns missing data with the NoDataValue as NA.

The difference here shows up as ragged edges on the plot, rather than black spaces where there is no data.

plot of chunk demonstrate-no-data-ggplot

If your raster already has NA values set correctly but you aren’t sure where they are, you can deliberately plot them in a particular colour. This can be useful when checking a dataset’s coverage. For instance, sometimes data can be missing where a sensor could not ‘see’ its target data, and you may wish to locate that missing data and fill it in.

To highlight NA values in ggplot, alter the scale_fill_*() layer to contain a colour instruction for NA values, like scale_fill_viridis_c(na.value = 'deeppink')

plot of chunk napink

The value that is conventionally used to take note of missing data (the NoDataValue value) varies by the raster data type. For floating-point rasters, the figure -3.4e+38 is a common default, and for integers, -9999 is common. Some disciplines have specific conventions that vary from these common values.

In some cases, other NA values may be more appropriate. An NA value should be a) outside the range of valid values, and b) a value that fits the data type in use. For instance, if your data ranges continuously from -20 to 100, 0 is not an acceptable NA value! Or, for categories that number 1-15, 0 might be fine for NA, but using -.000003 will force you to save the GeoTIFF on disk as a floating point raster, resulting in a bigger file.

If we are lucky, our GeoTIFF file has a tag that tells us what is the NoDataValue. If we are less lucky, we can find that information in the raster’s metadata. If a NoDataValue was stored in the GeoTIFF tag, when R opens up the raster, it will assign each instance of the value to NA. Values of NA will be ignored by R as demonstrated above.

Challenge

Use the output from the GDALinfo() function to find out what NoDataValue is used for our DSM_HARV dataset.

Answers

GDALinfo("data/NEON-DS-Airborne-Remote-Sensing/HARV/DSM/HARV_dsmCrop.tif")
rows        1367 
columns     1697 
bands       1 
lower left origin.x        731453 
lower left origin.y        4712471 
res.x       1 
res.y       1 
ysign       -1 
oblique.x   0 
oblique.y   0 
driver      GTiff 
projection  +proj=utm +zone=18 +datum=WGS84 +units=m +no_defs 
file        data/NEON-DS-Airborne-Remote-Sensing/HARV/DSM/HARV_dsmCrop.tif 
apparent band summary:
   GDType hasNoDataValue NoDataValue blockSize1 blockSize2
1 Float64           TRUE       -9999          1       1697
apparent band statistics:
    Bmin   Bmax    Bmean      Bsd
1 305.07 416.07 359.8531 17.83169
Metadata:
AREA_OR_POINT=Area 

NoDataValue are encoded as -9999.

Bad Data Values in Rasters

Bad data values are different from NoDataValues. Bad data values are values that fall outside of the applicable range of a dataset.

Examples of Bad Data Values:

Find Bad Data Values

Sometimes a raster’s metadata will tell us the range of expected values for a raster. Values outside of this range are suspect and we need to consider that when we analyze the data. Sometimes, we need to use some common sense and scientific insight as we examine the data - just as we would for field data to identify questionable values.

Plotting data with appropriate highlighting can help reveal patterns in bad values and may suggest a solution. Below, reclassification is used to highlight elevation values over 400m with a contrasting colour.

plot of chunk demo-bad-data-highlighting

Create A Histogram of Raster Values

We can explore the distribution of values contained within our raster using the geom_histogram() function which produces a histogram. Histograms are often useful in identifying outliers and bad data values in our raster data.

ggplot() +
    geom_histogram(data = DSM_HARV_df, aes(HARV_dsmCrop))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

plot of chunk view-raster-histogram

Notice that a warning message is thrown when R creates the histogram.

stat_bin() using bins = 30. Pick better value with binwidth.

This warning is caused by a default setting in geom_histogram enforcing that there are 30 bins for the data. We can define the number of bins we want in the histogram by using the bins value in the geom_histogram() function.

ggplot() +
    geom_histogram(data = DSM_HARV_df, aes(HARV_dsmCrop), bins = 40)

plot of chunk view-raster-histogram2

Note that the shape of this histogram looks similar to the previous one that was created using the default of 30 bins. The distribution of elevation values for our Digital Surface Model (DSM) looks reasonable. It is likely there are no bad data values in this particular raster.

Challenge: Explore Raster Metadata

Use GDALinfo() to determine the following about the NEON-DS-Airborne-Remote-Sensing/HARV/DSM/HARV_DSMhill.tif file:

  1. Does this file have the same CRS as DSM_HARV?
  2. What is the NoDataValue?
  3. What is resolution of the raster data?
  4. How large would a 5x5 pixel area be on the Earth’s surface?
  5. Is the file a multi- or single-band raster?

Notice: this file is a hillshade. We will learn about hillshades in the Working with Multi-band Rasters in R episode.

Answers

GDALinfo("data/NEON-DS-Airborne-Remote-Sensing/HARV/DSM/HARV_DSMhill.tif")
rows        1367 
columns     1697 
bands       1 
lower left origin.x        731453 
lower left origin.y        4712471 
res.x       1 
res.y       1 
ysign       -1 
oblique.x   0 
oblique.y   0 
driver      GTiff 
projection  +proj=utm +zone=18 +datum=WGS84 +units=m +no_defs 
file        data/NEON-DS-Airborne-Remote-Sensing/HARV/DSM/HARV_DSMhill.tif 
apparent band summary:
   GDType hasNoDataValue NoDataValue blockSize1 blockSize2
1 Float64           TRUE       -9999          1       1697
apparent band statistics:
        Bmin      Bmax     Bmean       Bsd
1 -0.7136298 0.9999997 0.3125525 0.4812939
Metadata:
AREA_OR_POINT=Area 
  1. If this file has the same CRS as DSM_HARV? Yes: UTM Zone 18, WGS84, meters.
  2. What format NoDataValues take? -9999
  3. The resolution of the raster data? 1x1
  4. How large a 5x5 pixel area would be? 5mx5m How? We are given resolution of 1x1 and units in meters, therefore resolution of 5x5 means 5x5m.
  5. Is the file a multi- or single-band raster? Single.

More Resources

Key Points

  • The GeoTIFF file format includes metadata about the raster data.

  • To plot raster data with the ggplot2 package, we need to convert it to a dataframe.

  • R stores CRS information in the Proj4 format.

  • Be careful when dealing with missing or bad data values.


Plot Raster Data

Overview

Teaching: 40 min
Exercises: 30 min
Questions
  • How can I create categorized or customized maps of raster data?

  • How can I customize the color scheme of a raster image?

  • How can I layer raster data in a single image?

Objectives
  • Build customized plots for a single band raster using the ggplot2 package.

  • Layer a raster dataset on top of a hillshade to create an elegant basemap.

Things You’ll Need To Complete This Episode

See the lesson homepage for detailed information about the software, data, and other prerequisites you will need to work through the examples in this episode.

Plot Raster Data in R

This episode covers how to plot a raster in R using the ggplot2 package with customized coloring schemes. It also covers how to layer a raster on top of a hillshade to produce an eloquent map. We will continue working with the Digital Surface Model (DSM) raster for the NEON Harvard Forest Field Site.

Plotting Data Using Breaks

In the previous episode, we viewed our data using a continuous color ramp. For clarity and visibility of the plot, we may prefer to view the data “symbolized” or colored according to ranges of values. This is comparable to a “classified” map. To do this, we need to tell ggplot how many groups to break our data into, and where those breaks should be. To make these decisions, it is useful to first explore the distribution of the data using a bar plot. To begin with, we will use dplyr’s mutate() function combined with cut() to split the data into 3 bins.

DSM_HARV_df <- DSM_HARV_df %>%
                mutate(fct_elevation = cut(HARV_dsmCrop, breaks = 3))

ggplot() +
    geom_bar(data = DSM_HARV_df, aes(fct_elevation))

plot of chunk histogram-breaks-ggplot

If we want to know the cutoff values for the groups, we can ask for the unique values of fct_elevation:

unique(DSM_HARV_df$fct_elevation)
[1] (379,416] (342,379] (305,342]
Levels: (305,342] (342,379] (379,416]

And we can get the count of values in each group using dplyr’s group_by() and count() functions:

DSM_HARV_df %>%
        group_by(fct_elevation) %>%
        count()
# A tibble: 3 x 2
# Groups:   fct_elevation [3]
  fct_elevation       n
  <fct>           <int>
1 (305,342]      418891
2 (342,379]     1530073
3 (379,416]      370835

We might prefer to customize the cutoff values for these groups. Lets round the cutoff values so that we have groups for the ranges of 301–350 m, 351–400 m, and 401–450 m. To implement this we will give mutate() a numeric vector of break points instead of the number of breaks we want.

custom_bins <- c(300, 350, 400, 450)

DSM_HARV_df <- DSM_HARV_df %>%
  mutate(fct_elevation_2 = cut(HARV_dsmCrop, breaks = custom_bins))

unique(DSM_HARV_df$fct_elevation_2)
[1] (400,450] (350,400] (300,350]
Levels: (300,350] (350,400] (400,450]

Data Tips

Note that when we assign break values a set of 4 values will result in 3 bins of data.

The bin intervals are shown using ( to mean exclusive and ] to mean inclusive. For example: (305, 342] means “from 306 through 342”.

And now we can plot our bar plot again, using the new groups:

ggplot() +
  geom_bar(data = DSM_HARV_df, aes(fct_elevation_2))

plot of chunk histogram-custom-breaks

And we can get the count of values in each group in the same way we did before:

DSM_HARV_df %>%
  group_by(fct_elevation_2) %>%
  count()
# A tibble: 3 x 2
# Groups:   fct_elevation_2 [3]
  fct_elevation_2       n
  <fct>             <int>
1 (300,350]        741815
2 (350,400]       1567316
3 (400,450]         10668

We can use those groups to plot our raster data, with each group being a different color:

ggplot() +
  geom_raster(data = DSM_HARV_df , aes(x = x, y = y, fill = fct_elevation_2)) + 
  coord_quickmap()

plot of chunk raster-with-breaks

The plot above uses the default colors inside ggplot for raster objects. We can specify our own colors to make the plot look a little nicer. R has a built in set of colors for plotting terrain, which are built in to the terrain.colors() function. Since we have three bins, we want to create a 3-color palette:

terrain.colors(3)
[1] "#00A600" "#ECB176" "#F2F2F2"

The terrain.colors() function returns hex colors - each of these character strings represents a color. To use these in our map, we pass them across using the scale_fill_manual() function.

ggplot() +
 geom_raster(data = DSM_HARV_df , aes(x = x, y = y,
                                      fill = fct_elevation_2)) + 
    scale_fill_manual(values = terrain.colors(3)) + 
    coord_quickmap()

plot of chunk ggplot-breaks-customcolors

More Plot Formatting

If we need to create multiple plots using the same color palette, we can create an R object (my_col) for the set of colors that we want to use. We can then quickly change the palette across all plots by modifying the my_col object, rather than each individual plot.

We can label the x- and y-axes of our plot too using xlab and ylab. We can also give the legend a more meaningful title by passing a value to the name argument of the scale_fill_manual() function.

my_col <- terrain.colors(3)

ggplot() +
 geom_raster(data = DSM_HARV_df , aes(x = x, y = y,
                                      fill = fct_elevation_2)) + 
    scale_fill_manual(values = my_col, name = "Elevation") + 
    coord_quickmap()

plot of chunk add-ggplot-labels

Or we can also turn off the labels of both axes by passing element_blank() to the relevant part of the theme() function.

ggplot() +
 geom_raster(data = DSM_HARV_df , aes(x = x, y = y,
                                      fill = fct_elevation_2)) + 
    scale_fill_manual(values = my_col, name = "Elevation") +
    theme(axis.title = element_blank()) + 
    coord_quickmap()

plot of chunk turn-off-axes

Challenge: Plot Using Custom Breaks

Create a plot of the Harvard Forest Digital Surface Model (DSM) that has:

  1. Six classified ranges of values (break points) that are evenly divided among the range of pixel values.
  2. Axis labels.
  3. A plot title.

Answers

DSM_HARV_df <- DSM_HARV_df  %>%
               mutate(fct_elevation_6 = cut(HARV_dsmCrop, breaks = 6)) 

 my_col <- terrain.colors(6)

ggplot() +
    geom_raster(data = DSM_HARV_df , aes(x = x, y = y,
                                      fill = fct_elevation_6)) + 
    scale_fill_manual(values = my_col, name = "Elevation") + 
    ggtitle("Classified Elevation Map - NEON Harvard Forest Field Site") +
    xlab("UTM Westing Coordinate (m)") +
    ylab("UTM Northing Coordinate (m)") + 
    coord_quickmap()

plot of chunk challenge-code-plotting

Layering Rasters

We can layer a raster on top of a hillshade raster for the same area, and use a transparency factor to create a 3-dimensional shaded effect. A hillshade is a raster that maps the shadows and texture that you would see from above when viewing terrain. We will add a custom color, making the plot grey.

First we need to read in our DSM hillshade data and view the structure:

DSM_hill_HARV <-
  raster("data/NEON-DS-Airborne-Remote-Sensing/HARV/DSM/HARV_DSMhill.tif")

DSM_hill_HARV
class      : RasterLayer 
dimensions : 1367, 1697, 2319799  (nrow, ncol, ncell)
resolution : 1, 1  (x, y)
extent     : 731453, 733150, 4712471, 4713838  (xmin, xmax, ymin, ymax)
crs        : +proj=utm +zone=18 +datum=WGS84 +units=m +no_defs +ellps=WGS84 +towgs84=0,0,0 
source     : /home/runner/work/r-raster-vector-geospatial/r-raster-vector-geospatial/_episodes_rmd/data/NEON-DS-Airborne-Remote-Sensing/HARV/DSM/HARV_DSMhill.tif 
names      : HARV_DSMhill 
values     : -0.7136298, 0.9999997  (min, max)

Next we convert it to a dataframe, so that we can plot it using ggplot2:

DSM_hill_HARV_df <- as.data.frame(DSM_hill_HARV, xy = TRUE) 

str(DSM_hill_HARV_df)
'data.frame':	2319799 obs. of  3 variables:
 $ x           : num  731454 731454 731456 731456 731458 ...
 $ y           : num  4713838 4713838 4713838 4713838 4713838 ...
 $ HARV_DSMhill: num  NA NA NA NA NA NA NA NA NA NA ...

Now we can plot the hillshade data:

ggplot() +
  geom_raster(data = DSM_hill_HARV_df,
              aes(x = x, y = y, alpha = HARV_DSMhill)) + 
  scale_alpha(range =  c(0.15, 0.65), guide = "none") + 
  coord_quickmap()

plot of chunk raster-hillshade

Data Tips

Turn off, or hide, the legend on a plot by adding guide = "none" to a scale_something() function or by setting theme(legend.position = "none").

The alpha value determines how transparent the colors will be (0 being transparent, 1 being opaque).

We can layer another raster on top of our hillshade by adding another call to the geom_raster() function. Let’s overlay DSM_HARV on top of the hill_HARV.

ggplot() +
  geom_raster(data = DSM_HARV_df , 
              aes(x = x, y = y, 
                  fill = HARV_dsmCrop)) + 
  geom_raster(data = DSM_hill_HARV_df, 
              aes(x = x, y = y, 
                  alpha = HARV_DSMhill)) +  
  scale_fill_viridis_c() +  
  scale_alpha(range = c(0.15, 0.65), guide = "none") +  
  ggtitle("Elevation with hillshade") +
  coord_quickmap()

plot of chunk overlay-hillshade

Challenge: Create DTM & DSM for SJER

Use the files in the NEON_RemoteSensing/SJER/ directory to create a Digital Terrain Model map and Digital Surface Model map of the San Joaquin Experimental Range field site.

Make sure to:

  • include hillshade in the maps,
  • label axes on the DSM map and exclude them from the DTM map,
  • include a title for each map,
  • experiment with various alpha values and color palettes to represent the data.

Answers

# CREATE DSM MAPS

# import DSM data
DSM_SJER <- raster("data/NEON-DS-Airborne-Remote-Sensing/SJER/DSM/SJER_dsmCrop.tif")
# convert to a df for plotting
DSM_SJER_df <- as.data.frame(DSM_SJER, xy = TRUE)

# import DSM hillshade
DSM_hill_SJER <- raster("data/NEON-DS-Airborne-Remote-Sensing/SJER/DSM/SJER_dsmHill.tif")
# convert to a df for plotting
DSM_hill_SJER_df <- as.data.frame(DSM_hill_SJER, xy = TRUE)

# Build Plot
ggplot() +
    geom_raster(data = DSM_SJER_df , 
                aes(x = x, y = y, 
                     fill = SJER_dsmCrop,
                     alpha = 0.8)
                ) + 
    geom_raster(data = DSM_hill_SJER_df, 
                aes(x = x, y = y, 
                  alpha = SJER_dsmHill)
                ) +
    scale_fill_viridis_c() +
    guides(fill = guide_colorbar()) +
    scale_alpha(range = c(0.4, 0.7), guide = "none") +
    # remove grey background and grid lines
    theme_bw() + 
    theme(panel.grid.major = element_blank(), 
          panel.grid.minor = element_blank()) +
    xlab("UTM Westing Coordinate (m)") +
    ylab("UTM Northing Coordinate (m)") +
    ggtitle("DSM with Hillshade") +
    coord_quickmap()

plot of chunk challenge-hillshade-layering

# CREATE DTM MAP
# import DTM
DTM_SJER <- raster("data/NEON-DS-Airborne-Remote-Sensing/SJER/DTM/SJER_dtmCrop.tif")
DTM_SJER_df <- as.data.frame(DTM_SJER, xy = TRUE)

# DTM Hillshade
DTM_hill_SJER <- raster("data/NEON-DS-Airborne-Remote-Sensing/SJER/DTM/SJER_dtmHill.tif")
DTM_hill_SJER_df <- as.data.frame(DTM_hill_SJER, xy = TRUE)

ggplot() +
    geom_raster(data = DTM_SJER_df ,
                aes(x = x, y = y,
                     fill = SJER_dtmCrop,
                     alpha = 2.0)
                ) +
    geom_raster(data = DTM_hill_SJER_df,
                aes(x = x, y = y,
                  alpha = SJER_dtmHill)
                ) +
    scale_fill_viridis_c() +
    guides(fill = guide_colorbar()) +
    scale_alpha(range = c(0.4, 0.7), guide = "none") +
    theme_bw() +
    theme(panel.grid.major = element_blank(), 
          panel.grid.minor = element_blank()) +
    theme(axis.title.x = element_blank(),
          axis.title.y = element_blank()) +
    ggtitle("DTM with Hillshade") +
    coord_quickmap()

plot of chunk challenge-hillshade-layering

Key Points

  • Continuous data ranges can be grouped into categories using mutate() and cut().

  • Use built-in terrain.colors() or set your preferred color scheme manually.

  • Layer rasters on top of one another by using the alpha aesthetic.


Reproject Raster Data

Overview

Teaching: 40 min
Exercises: 20 min
Questions
  • How do I work with raster data sets that are in different projections?

Objectives
  • Reproject a raster in R.

Things You’ll Need To Complete This Episode

See the lesson homepage for detailed information about the software, data, and other prerequisites you will need to work through the examples in this episode.

Sometimes we encounter raster datasets that do not “line up” when plotted or analyzed. Rasters that don’t line up are most often in different Coordinate Reference Systems (CRS). This episode explains how to deal with rasters in different, known CRSs. It will walk though reprojecting rasters in R using the projectRaster() function in the raster package.

Raster Projection in R

In the Plot Raster Data in R episode, we learned how to layer a raster file on top of a hillshade for a nice looking basemap. In that episode, all of our data were in the same CRS. What happens when things don’t line up?

For this episode, we will be working with the Harvard Forest Digital Terrain Model data. This differs from the surface model data we’ve been working with so far in that the digital surface model (DSM) includes the tops of trees, while the digital terrain model (DTM) shows the ground level.

We’ll be looking at another model (the canopy height model) in a later episode and will see how to calculate the CHM from the DSM and DTM. Here, we will create a map of the Harvard Forest Digital Terrain Model (DTM_HARV) draped or layered on top of the hillshade (DTM_hill_HARV). The hillshade layer maps the terrain using light and shadow to create a 3D-looking image, based on a hypothetical illumination of the ground level.

Source: National Ecological Observatory Network (NEON).

First, we need to import the DTM and DTM hillshade data.

DTM_HARV <- raster("data/NEON-DS-Airborne-Remote-Sensing/HARV/DTM/HARV_dtmCrop.tif")

DTM_hill_HARV <- raster("data/NEON-DS-Airborne-Remote-Sensing/HARV/DTM/HARV_DTMhill_WGS84.tif")

Next, we will convert each of these datasets to a dataframe for plotting with ggplot.

DTM_HARV_df <- as.data.frame(DTM_HARV, xy = TRUE)

DTM_hill_HARV_df <- as.data.frame(DTM_hill_HARV, xy = TRUE)

Now we can create a map of the DTM layered over the hillshade.

ggplot() +
     geom_raster(data = DTM_HARV_df , 
                 aes(x = x, y = y, 
                  fill = HARV_dtmCrop)) + 
     geom_raster(data = DTM_hill_HARV_df, 
                 aes(x = x, y = y, 
                   alpha = HARV_DTMhill_WGS84)) +
     scale_fill_gradientn(name = "Elevation", colors = terrain.colors(10)) + 
     coord_quickmap()

plot of chunk unnamed-chunk-2

Our results are curious - neither the Digital Terrain Model (DTM_HARV_df) nor the DTM Hillshade (DTM_hill_HARV_df) plotted. Let’s try to plot the DTM on its own to make sure there are data there.

ggplot() +
geom_raster(data = DTM_HARV_df,
    aes(x = x, y = y,
    fill = HARV_dtmCrop)) +
scale_fill_gradientn(name = "Elevation", colors = terrain.colors(10)) + 
coord_quickmap()

plot of chunk plot-DTM

Our DTM seems to contain data and plots just fine.

Next we plot the DTM Hillshade on its own to see whether everything is OK.

ggplot() +
geom_raster(data = DTM_hill_HARV_df,
    aes(x = x, y = y,
    alpha = HARV_DTMhill_WGS84)) + 
    coord_quickmap()

plot of chunk plot-DTM-hill

If we look at the axes, we can see that the projections of the two rasters are different. When this is the case, ggplot won’t render the image. It won’t even throw an error message to tell you something has gone wrong. We can look at Coordinate Reference Systems (CRSs) of the DTM and the hillshade data to see how they differ.

Exercise

View the CRS for each of these two datasets. What projection does each use?

Solution

# view crs for DTM
crs(DTM_HARV)
CRS arguments:
 +proj=utm +zone=18 +datum=WGS84 +units=m +no_defs +ellps=WGS84
+towgs84=0,0,0 
# view crs for hillshade
crs(DTM_hill_HARV)
CRS arguments:
 +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0 

DTM_HARV is in the UTM projection, with units of meters. DTM_hill_HARV is in Geographic WGS84 - which is represented by latitude and longitude values.

Because the two rasters are in different CRSs, they don’t line up when plotted in R. We need to reproject (or change the projection of) DTM_hill_HARV into the UTM CRS. Alternatively, we could reproject DTM_HARV into WGS84.

Reproject Rasters

We can use the projectRaster() function to reproject a raster into a new CRS. Keep in mind that reprojection only works when you first have a defined CRS for the raster object that you want to reproject. It cannot be used if no CRS is defined. Lucky for us, the DTM_hill_HARV has a defined CRS.

Data Tip

When we reproject a raster, we move it from one “grid” to another. Thus, we are modifying the data! Keep this in mind as we work with raster data.

To use the projectRaster() function, we need to define two things:

  1. the object we want to reproject and
  2. the CRS that we want to reproject it to.

The syntax is projectRaster(RasterObject, crs = CRSToReprojectTo)

We want the CRS of our hillshade to match the DTM_HARV raster. We can thus assign the CRS of our DTM_HARV to our hillshade within the projectRaster() function as follows: crs = crs(DTM_HARV). Note that we are using the projectRaster() function on the raster object, not the data.frame() we use for plotting with ggplot.

First we will reproject our DTM_hill_HARV raster data to match the DTM_HARV raster CRS:

DTM_hill_UTMZ18N_HARV <- projectRaster(DTM_hill_HARV,
                                       crs = crs(DTM_HARV))

Now we can compare the CRS of our original DTM hillshade and our new DTM hillshade, to see how they are different.

crs(DTM_hill_UTMZ18N_HARV)
CRS arguments:
 +proj=utm +zone=18 +datum=WGS84 +units=m +no_defs +ellps=WGS84
+towgs84=0,0,0 
crs(DTM_hill_HARV)
CRS arguments:
 +proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0 

We can also compare the extent of the two objects.

extent(DTM_hill_UTMZ18N_HARV)
class      : Extent 
xmin       : 731397.3 
xmax       : 733205.3 
ymin       : 4712403 
ymax       : 4713907 
extent(DTM_hill_HARV)
class      : Extent 
xmin       : -72.18192 
xmax       : -72.16061 
ymin       : 42.52941 
ymax       : 42.54234 

Notice in the output above that the crs() of DTM_hill_UTMZ18N_HARV is now UTM. However, the extent values of DTM_hillUTMZ18N_HARV are different from DTM_hill_HARV.

Challenge: Extent Change with CRS Change

Why do you think the two extents differ?

Answers

The extent for DTM_hill_UTMZ18N_HARV is in UTMs so the extent is in meters. The extent for DTM_hill_HARV is in lat/long so the extent is expressed in decimal degrees.

Deal with Raster Resolution

Let’s next have a look at the resolution of our reprojected hillshade versus our original data.

res(DTM_hill_UTMZ18N_HARV)
[1] 1.000 0.998
res(DTM_HARV)
[1] 1 1

These two resolutions are different, but they’re representing the same data. We can tell R to force our newly reprojected raster to be 1m x 1m resolution by adding a line of code res=1 within the projectRaster() function. In the example below, we ensure a resolution match by using res(DTM_HARV) as a variable.

  DTM_hill_UTMZ18N_HARV <- projectRaster(DTM_hill_HARV,
                                         crs = crs(DTM_HARV),
                                         res = res(DTM_HARV)) 

Now both our resolutions and our CRSs match, so we can plot these two data sets together. Let’s double-check our resolution to be sure:

res(DTM_hill_UTMZ18N_HARV)
[1] 1 1
res(DTM_HARV)
[1] 1 1

For plotting with ggplot(), we will need to create a dataframe from our newly reprojected raster.

DTM_hill_HARV_2_df <- as.data.frame(DTM_hill_UTMZ18N_HARV, xy = TRUE)

We can now create a plot of this data.

ggplot() +
     geom_raster(data = DTM_HARV_df , 
                 aes(x = x, y = y, 
                  fill = HARV_dtmCrop)) + 
     geom_raster(data = DTM_hill_HARV_2_df, 
                 aes(x = x, y = y, 
                   alpha = HARV_DTMhill_WGS84)) +
     scale_fill_gradientn(name = "Elevation", colors = terrain.colors(10)) + 
     coord_quickmap()

plot of chunk plot-projected-raster

We have now successfully draped the Digital Terrain Model on top of our hillshade to produce a nice looking, textured map!

Challenge: Reproject, then Plot a Digital Terrain Model

Create a map of the San Joaquin Experimental Range field site using the SJER_DSMhill_WGS84.tif and SJER_dsmCrop.tif files.

Reproject the data as necessary to make things line up!

Answers

# import DSM
DSM_SJER <- raster("data/NEON-DS-Airborne-Remote-Sensing/SJER/DSM/SJER_dsmCrop.tif")
# import DSM hillshade
DSM_hill_SJER_WGS <-
raster("data/NEON-DS-Airborne-Remote-Sensing/SJER/DSM/SJER_DSMhill_WGS84.tif")

# reproject raster
DTM_hill_UTMZ18N_SJER <- projectRaster(DSM_hill_SJER_WGS,
                                  crs = crs(DSM_SJER),
                                  res = 1)

# convert to data.frames
DSM_SJER_df <- as.data.frame(DSM_SJER, xy = TRUE)

DSM_hill_SJER_df <- as.data.frame(DTM_hill_UTMZ18N_SJER, xy = TRUE)

ggplot() +
     geom_raster(data = DSM_hill_SJER_df, 
                 aes(x = x, y = y, 
                   alpha = SJER_DSMhill_WGS84)
                 ) +
     geom_raster(data = DSM_SJER_df, 
             aes(x = x, y = y, 
                  fill = SJER_dsmCrop,
                  alpha=0.8)
             ) + 
     scale_fill_gradientn(name = "Elevation", colors = terrain.colors(10)) + 
     coord_quickmap()

plot of chunk challenge-code-reprojection

If you completed the San Joaquin plotting challenge in the Plot Raster Data in R episode, how does the map you just created compare to that map?

Answers

The maps look identical. Which is what they should be as the only difference is this one was reprojected from WGS84 to UTM prior to plotting.

Key Points

  • In order to plot two raster data sets together, they must be in the same CRS.

  • Use the projectRaster() function to convert between CRSs.


Raster Calculations

Overview

Teaching: 40 min
Exercises: 20 min
Questions
  • How do I subtract one raster from another and extract pixel values for defined locations?

Objectives
  • Perform a subtraction between two rasters using raster math.

  • Perform a more efficient subtraction between two rasters using the raster overlay() function.

  • Export raster data as a GeoTIFF file.

Things You’ll Need To Complete This Episode

See the lesson homepage for detailed information about the software, data, and other prerequisites you will need to work through the examples in this episode.

We often want to combine values of and perform calculations on rasters to create a new output raster. This episode covers how to subtract one raster from another using basic raster math and the overlay() function. It also covers how to extract pixel values from a set of locations - for example a buffer region around plot locations at a field site.

Raster Calculations in R

We often want to perform calculations on two or more rasters to create a new output raster. For example, if we are interested in mapping the heights of trees across an entire field site, we might want to calculate the difference between the Digital Surface Model (DSM, tops of trees) and the Digital Terrain Model (DTM, ground level). The resulting dataset is referred to as a Canopy Height Model (CHM) and represents the actual height of trees, buildings, etc. with the influence of ground elevation removed.

Source: National Ecological Observatory Network (NEON)

More Resources

Load the Data

For this episode, we will use the DTM and DSM from the NEON Harvard Forest Field site and San Joaquin Experimental Range, which we already have loaded from previous episodes.

Exercise

Use the GDALinfo() function to view information about the DTM and DSM data files. Do the two rasters have the same or different CRSs and resolutions? Do they both have defined minimum and maximum values?

Solution

GDALinfo("data/NEON-DS-Airborne-Remote-Sensing/HARV/DTM/HARV_dtmCrop.tif")
rows        1367 
columns     1697 
bands       1 
lower left origin.x        731453 
lower left origin.y        4712471 
res.x       1 
res.y       1 
ysign       -1 
oblique.x   0 
oblique.y   0 
driver      GTiff 
projection  +proj=utm +zone=18 +datum=WGS84 +units=m +no_defs 
file        data/NEON-DS-Airborne-Remote-Sensing/HARV/DTM/HARV_dtmCrop.tif 
apparent band summary:
   GDType hasNoDataValue NoDataValue blockSize1 blockSize2
1 Float64           TRUE       -9999          1       1697
apparent band statistics:
    Bmin   Bmax    Bmean      Bsd
1 304.56 389.82 344.8979 15.86147
Metadata:
AREA_OR_POINT=Area 
GDALinfo("data/NEON-DS-Airborne-Remote-Sensing/HARV/DSM/HARV_dsmCrop.tif")
rows        1367 
columns     1697 
bands       1 
lower left origin.x        731453 
lower left origin.y        4712471 
res.x       1 
res.y       1 
ysign       -1 
oblique.x   0 
oblique.y   0 
driver      GTiff 
projection  +proj=utm +zone=18 +datum=WGS84 +units=m +no_defs 
file        data/NEON-DS-Airborne-Remote-Sensing/HARV/DSM/HARV_dsmCrop.tif 
apparent band summary:
   GDType hasNoDataValue NoDataValue blockSize1 blockSize2
1 Float64           TRUE       -9999          1       1697
apparent band statistics:
    Bmin   Bmax    Bmean      Bsd
1 305.07 416.07 359.8531 17.83169
Metadata:
AREA_OR_POINT=Area 

We’ve already loaded and worked with these two data files in earlier episodes. Let’s plot them each once more to remind ourselves what this data looks like. First we’ll plot the DTM elevation data:

 ggplot() +
      geom_raster(data = DTM_HARV_df , 
              aes(x = x, y = y, fill = HARV_dtmCrop)) +
     scale_fill_gradientn(name = "Elevation", colors = terrain.colors(10)) + 
     coord_quickmap()

plot of chunk harv-dtm-plot

And then the DSM elevation data:

 ggplot() +
      geom_raster(data = DSM_HARV_df , 
              aes(x = x, y = y, fill = HARV_dsmCrop)) +
     scale_fill_gradientn(name = "Elevation", colors = terrain.colors(10)) + 
     coord_quickmap()

plot of chunk harv-dsm-plot

Two Ways to Perform Raster Calculations

We can calculate the difference between two rasters in two different ways:

or for more efficient processing - particularly if our rasters are large and/or the calculations we are performing are complex:

Raster Math & Canopy Height Models

We can perform raster calculations by subtracting (or adding, multiplying, etc) two rasters. In the geospatial world, we call this “raster math”.

Let’s subtract the DTM from the DSM to create a Canopy Height Model. After subtracting, let’s create a dataframe so we can plot with ggplot.

CHM_HARV <- DSM_HARV - DTM_HARV

CHM_HARV_df <- as.data.frame(CHM_HARV, xy = TRUE)

We can now plot the output CHM.

 ggplot() +
   geom_raster(data = CHM_HARV_df , 
               aes(x = x, y = y, fill = layer)) + 
   scale_fill_gradientn(name = "Canopy Height", colors = terrain.colors(10)) + 
   coord_quickmap()

plot of chunk harv-chm-plot

Let’s have a look at the distribution of values in our newly created Canopy Height Model (CHM).

ggplot(CHM_HARV_df) +
    geom_histogram(aes(layer))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 1 rows containing non-finite values (stat_bin).

plot of chunk create-hist

Notice that the range of values for the output CHM is between 0 and 30 meters. Does this make sense for trees in Harvard Forest?

Challenge: Explore CHM Raster Values

It’s often a good idea to explore the range of values in a raster dataset just like we might explore a dataset that we collected in the field.

  1. What is the min and maximum value for the Harvard Forest Canopy Height Model (CHM_HARV) that we just created?
  2. What are two ways you can check this range of data for CHM_HARV?
  3. What is the distribution of all the pixel values in the CHM?
  4. Plot a histogram with 6 bins instead of the default and change the color of the histogram.
  5. Plot the CHM_HARV raster using breaks that make sense for the data. Include an appropriate color palette for the data, plot title and no axes ticks / labels.

Answers

1) There are missing values in our data, so we need to specify na.rm = TRUE.

min(CHM_HARV_df$layer, na.rm = TRUE)
[1] 0
max(CHM_HARV_df$layer, na.rm = TRUE)
[1] 38.16998

2) Possible ways include:

  • Create a histogram
  • Use the min() and max() functions.
  • Print the object and look at the values attribute. 3)
ggplot(CHM_HARV_df) +
    geom_histogram(aes(layer))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 1 rows containing non-finite values (stat_bin).

plot of chunk chm-harv-hist 4)

ggplot(CHM_HARV_df) +
    geom_histogram(aes(layer), colour="black", 
                   fill="darkgreen", bins = 6)
Warning: Removed 1 rows containing non-finite values (stat_bin).

plot of chunk chm-harv-hist-green 5)

custom_bins <- c(0, 10, 20, 30, 40)
CHM_HARV_df <- CHM_HARV_df %>%
                  mutate(canopy_discrete = cut(layer, breaks = custom_bins))

ggplot() +
  geom_raster(data = CHM_HARV_df , aes(x = x, y = y,
                                       fill = canopy_discrete)) + 
     scale_fill_manual(values = terrain.colors(4)) + 
     coord_quickmap()
Warning: Removed 51136 rows containing missing values (geom_raster).

plot of chunk chm-harv-raster

Efficient Raster Calculations: Overlay Function

Raster math, like we just did, is an appropriate approach to raster calculations if:

  1. The rasters we are using are small in size.
  2. The calculations we are performing are simple.

However, raster math is a less efficient approach as computation becomes more complex or as file sizes become large. The overlay() function is more efficient when:

  1. The rasters we are using are larger in size.
  2. The rasters are stored as individual files.
  3. The computations performed are complex.

The overlay() function takes two or more rasters and applies a function to them using efficient processing methods. The syntax is

outputRaster <- overlay(raster1, raster2, fun=functionName)

Data Tip

If the rasters are stacked and stored as RasterStack or RasterBrick objects in R, then we should use calc(). overlay() will not work on stacked rasters.

Let’s perform the same subtraction calculation that we calculated above using raster math, using the overlay() function.

Data Tip

A custom function consists of a defined set of commands performed on a input object. Custom functions are particularly useful for tasks that need to be repeated over and over in the code. A simplified syntax for writing a custom function in R is: function_name <- function(variable1, variable2) { WhatYouWantDone, WhatToReturn}

CHM_ov_HARV <- overlay(DSM_HARV,
                       DTM_HARV,
                       fun = function(r1, r2) { return( r1 - r2) })

Next we need to convert our new object to a data frame for plotting with ggplot.

CHM_ov_HARV_df <- as.data.frame(CHM_ov_HARV, xy = TRUE)

Now we can plot the CHM:

 ggplot() +
   geom_raster(data = CHM_ov_HARV_df, 
               aes(x = x, y = y, fill = layer)) + 
   scale_fill_gradientn(name = "Canopy Height", colors = terrain.colors(10)) + 
   coord_quickmap()

plot of chunk harv-chm-overlay

How do the plots of the CHM created with manual raster math and the overlay() function compare?

Export a GeoTIFF

Now that we’ve created a new raster, let’s export the data as a GeoTIFF file using the writeRaster() function.

When we write this raster object to a GeoTIFF file we’ll name it CHM_HARV.tiff. This name allows us to quickly remember both what the data contains (CHM data) and for where (HARVard Forest). The writeRaster() function by default writes the output file to your working directory unless you specify a full file path.

We will specify the output format (“GTiff”), the no data value (NAflag = -9999. We will also tell R to overwrite any data that is already in a file of the same name.

writeRaster(CHM_ov_HARV, "CHM_HARV.tiff",
            format="GTiff",
            overwrite=TRUE,
            NAflag=-9999)

writeRaster() Options

The function arguments that we used above include:

Challenge: Explore the NEON San Joaquin Experimental Range Field Site

Data are often more interesting and powerful when we compare them across various locations. Let’s compare some data collected over Harvard Forest to data collected in Southern California. The NEON San Joaquin Experimental Range (SJER) field site located in Southern California has a very different ecosystem and climate than the NEON Harvard Forest Field Site in Massachusetts.

Import the SJER DSM and DTM raster files and create a Canopy Height Model. Then compare the two sites. Be sure to name your R objects and outputs carefully, as follows: objectType_SJER (e.g. DSM_SJER). This will help you keep track of data from different sites!

  1. You should have the DSM and DTM data for the SJER site already loaded from the Plot Raster Data in R episode.) Don’t forget to check the CRSs and units of the data.
  2. Create a CHM from the two raster layers and check to make sure the data are what you expect.
  3. Plot the CHM from SJER.
  4. Export the SJER CHM as a GeoTIFF.
  5. Compare the vegetation structure of the Harvard Forest and San Joaquin Experimental Range.

Answers

1) Use the overlay() function to subtract the two rasters & create the CHM.

CHM_ov_SJER <- overlay(DSM_SJER,
                       DTM_SJER,
                       fun = function(r1, r2){ return(r1 - r2) })

Convert the output to a dataframe:

CHM_ov_SJER_df <- as.data.frame(CHM_ov_SJER, xy = TRUE)

Create a histogram to check that the data distribution makes sense:

ggplot(CHM_ov_SJER_df) +
    geom_histogram(aes(layer))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

plot of chunk sjer-chm-overlay-hist

2) Create a plot of the CHM:

 ggplot() +
      geom_raster(data = CHM_ov_SJER_df, 
              aes(x = x, y = y, 
                   fill = layer)
              ) + 
     scale_fill_gradientn(name = "Canopy Height", 
        colors = terrain.colors(10)) + 
     coord_quickmap()

plot of chunk sjer-chm-overlay-raster

3) Export the CHM object to a file:

writeRaster(CHM_ov_SJER, "chm_ov_SJER.tiff",
            format = "GTiff",
            overwrite = TRUE,
            NAflag = -9999)

4) Compare the SJER and HARV CHMs. Tree heights are much shorter in SJER. You can confirm this by looking at the histograms of the two CHMs.

ggplot(CHM_HARV_df) +
    geom_histogram(aes(layer))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Warning: Removed 1 rows containing non-finite values (stat_bin).

plot of chunk compare-chm-harv-sjer

ggplot(CHM_ov_SJER_df) +
    geom_histogram(aes(layer))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

plot of chunk compare-chm-harv-sjer

Key Points

  • Rasters can be computed on using mathematical functions.

  • The overlay() function provides an efficient way to do raster math.

  • The writeRaster() function can be used to write raster data to a file.


Work With Multi-Band Rasters

Overview

Teaching: 40 min
Exercises: 20 min
Questions
  • How can I visualize individual and multiple bands in a raster object?

Objectives
  • Identify a single vs. a multi-band raster file.

  • Import multi-band rasters into R using the raster package.

  • Plot multi-band color image rasters in R using the ggplot package.

Things You’ll Need To Complete This Episode

See the lesson homepage for detailed information about the software, data, and other prerequisites you will need to work through the examples in this episode.

We introduced multi-band raster data in an earlier lesson. This episode explores how to import and plot a multi-band raster in R.

Getting Started with Multi-Band Data in R

In this episode, the multi-band data that we are working with is imagery collected using the NEON Airborne Observation Platform high resolution camera over the NEON Harvard Forest field site. Each RGB image is a 3-band raster. The same steps would apply to working with a multi-spectral image with 4 or more bands - like Landsat imagery.

If we read a RasterStack object into R using the raster() function, it only reads in the first band.

RGB_band1_HARV <- raster("data/NEON-DS-Airborne-Remote-Sensing/HARV/RGB_Imagery/HARV_RGB_Ortho.tif")

We need to convert this data to a data frame in order to plot it with ggplot.

RGB_band1_HARV_df  <- as.data.frame(RGB_band1_HARV, xy = TRUE)
ggplot() +
  geom_raster(data = RGB_band1_HARV_df,
              aes(x = x, y = y, alpha = HARV_RGB_Ortho)) + 
  coord_quickmap()

plot of chunk harv-rgb-band1

Challenge

View the attributes of this band. What are its dimensions, CRS, resolution, min and max values, and band number?

Solution

RGB_band1_HARV
class      : RasterLayer 
band       : 1  (of  3  bands)
dimensions : 2317, 3073, 7120141  (nrow, ncol, ncell)
resolution : 0.25, 0.25  (x, y)
extent     : 731998.5, 732766.8, 4712956, 4713536  (xmin, xmax, ymin, ymax)
crs        : +proj=utm +zone=18 +datum=WGS84 +units=m +no_defs +ellps=WGS84 +towgs84=0,0,0 
source     : /home/runner/work/r-raster-vector-geospatial/r-raster-vector-geospatial/_episodes_rmd/data/NEON-DS-Airborne-Remote-Sensing/HARV/RGB_Imagery/HARV_RGB_Ortho.tif 
names      : HARV_RGB_Ortho 
values     : 0, 255  (min, max)

Notice that when we look at the attributes of this band, we see: band: 1 (of 3 bands)

This is R telling us that this particular raster object has more bands (3) associated with it.

Data Tip

The number of bands associated with a raster object can also be determined using the nbands() function: syntax is nbands(RGB_band1_HARV).

Image Raster Data Values

As we saw in the previous exercise, this raster contains values between 0 and 255. These values represent degrees of brightness associated with the image band. In the case of a RGB image (red, green and blue), band 1 is the red band. When we plot the red band, larger numbers (towards 255) represent pixels with more red in them (a strong red reflection). Smaller numbers (towards 0) represent pixels with less red in them (less red was reflected). To plot an RGB image, we mix red + green + blue values into one single color to create a full color image - similar to the color image a digital camera creates.

Import A Specific Band

We can use the raster() function to import specific bands in our raster object by specifying which band we want with band = N (N represents the band number we want to work with). To import the green band, we would use band = 2.

RGB_band2_HARV <-  raster("data/NEON-DS-Airborne-Remote-Sensing/HARV/RGB_Imagery/HARV_RGB_Ortho.tif", band = 2)

We can convert this data to a data frame and plot the same way we plotted the red band:

RGB_band2_HARV_df <- as.data.frame(RGB_band2_HARV, xy = TRUE)
ggplot() +
  geom_raster(data = RGB_band2_HARV_df,
              aes(x = x, y = y, alpha = HARV_RGB_Ortho)) + 
  coord_equal()

plot of chunk rgb-harv-band2

Challenge: Making Sense of Single Band Images

Compare the plots of band 1 (red) and band 2 (green). Is the forested area darker or lighter in band 2 (the green band) compared to band 1 (the red band)?

Solution

We’d expect a brighter value for the forest in band 2 (green) than in band 1 (red) because the leaves on trees of most often appear “green” - healthy leaves reflect MORE green light than red light.

Raster Stacks in R

Next, we will work with all three image bands (red, green and blue) as an R RasterStack object. We will then plot a 3-band composite, or full color, image.

To bring in all bands of a multi-band raster, we use thestack() function.

RGB_stack_HARV <- stack("data/NEON-DS-Airborne-Remote-Sensing/HARV/RGB_Imagery/HARV_RGB_Ortho.tif")

Let’s preview the attributes of our stack object:

RGB_stack_HARV
class      : RasterStack 
dimensions : 2317, 3073, 7120141, 3  (nrow, ncol, ncell, nlayers)
resolution : 0.25, 0.25  (x, y)
extent     : 731998.5, 732766.8, 4712956, 4713536  (xmin, xmax, ymin, ymax)
crs        : +proj=utm +zone=18 +datum=WGS84 +units=m +no_defs +ellps=WGS84 +towgs84=0,0,0 
names      : HARV_RGB_Ortho.1, HARV_RGB_Ortho.2, HARV_RGB_Ortho.3 
min values :                0,                0,                0 
max values :              255,              255,              255 

We can view the attributes of each band in the stack in a single output:

RGB_stack_HARV@layers
[[1]]
class      : RasterLayer 
band       : 1  (of  3  bands)
dimensions : 2317, 3073, 7120141  (nrow, ncol, ncell)
resolution : 0.25, 0.25  (x, y)
extent     : 731998.5, 732766.8, 4712956, 4713536  (xmin, xmax, ymin, ymax)
crs        : +proj=utm +zone=18 +datum=WGS84 +units=m +no_defs +ellps=WGS84 +towgs84=0,0,0 
source     : /home/runner/work/r-raster-vector-geospatial/r-raster-vector-geospatial/_episodes_rmd/data/NEON-DS-Airborne-Remote-Sensing/HARV/RGB_Imagery/HARV_RGB_Ortho.tif 
names      : HARV_RGB_Ortho.1 
values     : 0, 255  (min, max)


[[2]]
class      : RasterLayer 
band       : 2  (of  3  bands)
dimensions : 2317, 3073, 7120141  (nrow, ncol, ncell)
resolution : 0.25, 0.25  (x, y)
extent     : 731998.5, 732766.8, 4712956, 4713536  (xmin, xmax, ymin, ymax)
crs        : +proj=utm +zone=18 +datum=WGS84 +units=m +no_defs +ellps=WGS84 +towgs84=0,0,0 
source     : /home/runner/work/r-raster-vector-geospatial/r-raster-vector-geospatial/_episodes_rmd/data/NEON-DS-Airborne-Remote-Sensing/HARV/RGB_Imagery/HARV_RGB_Ortho.tif 
names      : HARV_RGB_Ortho.2 
values     : 0, 255  (min, max)


[[3]]
class      : RasterLayer 
band       : 3  (of  3  bands)
dimensions : 2317, 3073, 7120141  (nrow, ncol, ncell)
resolution : 0.25, 0.25  (x, y)
extent     : 731998.5, 732766.8, 4712956, 4713536  (xmin, xmax, ymin, ymax)
crs        : +proj=utm +zone=18 +datum=WGS84 +units=m +no_defs +ellps=WGS84 +towgs84=0,0,0 
source     : /home/runner/work/r-raster-vector-geospatial/r-raster-vector-geospatial/_episodes_rmd/data/NEON-DS-Airborne-Remote-Sensing/HARV/RGB_Imagery/HARV_RGB_Ortho.tif 
names      : HARV_RGB_Ortho.3 
values     : 0, 255  (min, max)

If we have hundreds of bands, we can specify which band we’d like to view attributes for using an index value:

RGB_stack_HARV[[2]]
class      : RasterLayer 
band       : 2  (of  3  bands)
dimensions : 2317, 3073, 7120141  (nrow, ncol, ncell)
resolution : 0.25, 0.25  (x, y)
extent     : 731998.5, 732766.8, 4712956, 4713536  (xmin, xmax, ymin, ymax)
crs        : +proj=utm +zone=18 +datum=WGS84 +units=m +no_defs +ellps=WGS84 +towgs84=0,0,0 
source     : /home/runner/work/r-raster-vector-geospatial/r-raster-vector-geospatial/_episodes_rmd/data/NEON-DS-Airborne-Remote-Sensing/HARV/RGB_Imagery/HARV_RGB_Ortho.tif 
names      : HARV_RGB_Ortho.2 
values     : 0, 255  (min, max)

We can also use the ggplot functions to plot the data in any layer of our RasterStack object. Remember, we need to convert to a data frame first.

RGB_stack_HARV_df  <- as.data.frame(RGB_stack_HARV, xy = TRUE)

Each band in our RasterStack gets its own column in the data frame. Thus we have:

str(RGB_stack_HARV_df)
'data.frame':	7120141 obs. of  5 variables:
 $ x               : num  731999 731999 731999 731999 732000 ...
 $ y               : num  4713535 4713535 4713535 4713535 4713535 ...
 $ HARV_RGB_Ortho.1: num  0 2 6 0 16 0 0 6 1 5 ...
 $ HARV_RGB_Ortho.2: num  1 0 9 0 5 0 4 2 1 0 ...
 $ HARV_RGB_Ortho.3: num  0 10 1 0 17 0 4 0 0 7 ...

Let’s create a histogram of the first band:

ggplot() +
  geom_histogram(data = RGB_stack_HARV_df, aes(HARV_RGB_Ortho.1))
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

plot of chunk rgb-harv-hist-band1

And a raster plot of the second band:

ggplot() +
  geom_raster(data = RGB_stack_HARV_df,
              aes(x = x, y = y, alpha = HARV_RGB_Ortho.2)) + 
  coord_quickmap()

plot of chunk rgb-harv-plot-band2

We can access any individual band in the same way.

Create A Three Band Image

To render a final three band, colored image in R, we use the plotRGB() function.

This function allows us to:

  1. Identify what bands we want to render in the red, green and blue regions. The plotRGB() function defaults to a 1=red, 2=green, and 3=blue band order. However, you can define what bands you’d like to plot manually. Manual definition of bands is useful if you have, for example a near-infrared band and want to create a color infrared image.
  2. Adjust the stretch of the image to increase or decrease contrast.

Let’s plot our 3-band image. Note that we can use the plotRGB() function directly with our RasterStack object (we don’t need a dataframe as this function isn’t part of the ggplot2 package).

plotRGB(RGB_stack_HARV,
        r = 1, g = 2, b = 3)

plot of chunk plot-rgb-image

The image above looks pretty good. We can explore whether applying a stretch to the image might improve clarity and contrast using stretch="lin" or stretch="hist".

Image Stretch

When the range of pixel brightness values is closer to 0, a darker image is rendered by default. We can stretch the values to extend to the full 0-255 range of potential values to increase the visual contrast of the image.

Image Stretch light

When the range of pixel brightness values is closer to 255, a lighter image is rendered by default. We can stretch the values to extend to the full 0-255 range of potential values to increase the visual contrast of the image.

plotRGB(RGB_stack_HARV,
        r = 1, g = 2, b = 3,
        scale = 800,
        stretch = "lin")

plot of chunk plot-rbg-image-linear

plotRGB(RGB_stack_HARV,
        r = 1, g = 2, b = 3,
        scale = 800,
        stretch = "hist")

plot of chunk plot-rgb-image-hist

In this case, the stretch doesn’t enhance the contrast our image significantly given the distribution of reflectance (or brightness) values is distributed well between 0 and 255.

Challenge - NoData Values

Let’s explore what happens with NoData values when working with RasterStack objects and using the plotRGB() function. We will use the HARV_Ortho_wNA.tif GeoTIFF file in the NEON-DS-Airborne-Remote-Sensing/HARVRGB_Imagery/ directory.

  1. View the files attributes. Are there NoData values assigned for this file?
  2. If so, what is the NoData Value?
  3. How many bands does it have?
  4. Load the multi-band raster file into R.
  5. Plot the object as a true color image.
  6. What happened to the black edges in the data?
  7. What does this tell us about the difference in the data structure between HARV_Ortho_wNA.tif and HARV_RGB_Ortho.tif (R object RGB_stack). How can you check?

Answers

1) First we use the GDALinfo() function to view the data attributes.

GDALinfo("data/NEON-DS-Airborne-Remote-Sensing/HARV/RGB_Imagery/HARV_Ortho_wNA.tif")
rows        2317 
columns     3073 
bands       3 
lower left origin.x        731998.5 
lower left origin.y        4712956 
res.x       0.25 
res.y       0.25 
ysign       -1 
oblique.x   0 
oblique.y   0 
driver      GTiff 
projection  +proj=utm +zone=18 +datum=WGS84 +units=m +no_defs 
file        data/NEON-DS-Airborne-Remote-Sensing/HARV/RGB_Imagery/HARV_Ortho_wNA.tif 
apparent band summary:
   GDType hasNoDataValue NoDataValue blockSize1 blockSize2
1 Float64           TRUE       -9999          1       3073
2 Float64           TRUE       -9999          1       3073
3 Float64           TRUE       -9999          1       3073
apparent band statistics:
  Bmin Bmax     Bmean      Bsd
1    0  255 107.83651 30.01918
2    0  255 130.09595 32.00168
3    0  255  95.75979 16.57704
Metadata:
AREA_OR_POINT=Area 

2) From the output above, we see that there are NoData values and they are assigned the value of -9999.

3) The data has three bands.

4) To read in the file, we will use the stack() function:

HARV_NA <- stack("data/NEON-DS-Airborne-Remote-Sensing/HARV/RGB_Imagery/HARV_Ortho_wNA.tif")

5) We can plot the data with the plotRGB() function:

plotRGB(HARV_NA,
        r = 1, g = 2, b = 3)

plot of chunk harv-na-rgb

6) The black edges are not plotted. 7) Both data sets have NoData values, however, in the RGB_stack the NoData value is not defined in the tiff tags, thus R renders them as black as the reflectance values are 0. The black edges in the other file are defined as -9999 and R renders them as NA.

GDALinfo("data/NEON-DS-Airborne-Remote-Sensing/HARV/RGB_Imagery/HARV_RGB_Ortho.tif")
rows        2317 
columns     3073 
bands       3 
lower left origin.x        731998.5 
lower left origin.y        4712956 
res.x       0.25 
res.y       0.25 
ysign       -1 
oblique.x   0 
oblique.y   0 
driver      GTiff 
projection  +proj=utm +zone=18 +datum=WGS84 +units=m +no_defs 
file        data/NEON-DS-Airborne-Remote-Sensing/HARV/RGB_Imagery/HARV_RGB_Ortho.tif 
apparent band summary:
   GDType hasNoDataValue NoDataValue blockSize1 blockSize2
1 Float64           TRUE   -1.7e+308          1       3073
2 Float64           TRUE   -1.7e+308          1       3073
3 Float64           TRUE   -1.7e+308          1       3073
apparent band statistics:
  Bmin Bmax Bmean Bsd
1    0  255   NaN NaN
2    0  255   NaN NaN
3    0  255   NaN NaN
Metadata:
AREA_OR_POINT=Area 

Data Tip

We can create a RasterStack from several, individual single-band GeoTIFFs too. We will do this in a later episode, Raster Time Series Data in R.

RasterStack vs RasterBrick in R

The R RasterStack and RasterBrick object types can both store multiple bands. However, how they store each band is different. The bands in a RasterStack are stored as links to raster data that is located somewhere on our computer. A RasterBrick contains all of the objects stored within the actual R object. In most cases, we can work with a RasterBrick in the same way we might work with a RasterStack. However a RasterBrick is often more efficient and faster to process - which is important when working with larger files.

More Resources

You can read the help for the brick() function by typing ?brick.

We can turn a RasterStack into a RasterBrick in R by using brick(StackName). Let’s use the object.size() function to compare RasterStack and RasterBrick objects. First we will check the size of our RasterStack object:

object.size(RGB_stack_HARV)
44344 bytes

Now we will create a RasterBrick object from our RasterStack data and view its size:

RGB_brick_HARV <- brick(RGB_stack_HARV)

object.size(RGB_brick_HARV)
170897168 bytes

Notice that in the RasterBrick, all of the bands are stored within the actual object. Thus, the RasterBrick object size is much larger than the RasterStack object.

You use the plotRGB() function to plot a RasterBrick too:

plotRGB(RGB_brick_HARV)

plot of chunk plot-brick

Challenge: What Functions Can Be Used on an R Object of a particular class?

We can view various functions (or methods) available to use on an R object with methods(class=class(objectNameHere)). Use this to figure out:

  1. What methods can be used on the RGB_stack_HARV object?
  2. What methods can be used on a single band within RGB_stack_HARV?
  3. Why do you think there is a difference?

Answers

1) We can see a list of all of the methods available for our RasterStack object:

methods(class=class(RGB_stack_HARV))
  [1] !                     !=                    [                    
  [4] [[                    [[<-                  [<-                  
  [7] %in%                  ==                    $                    
 [10] $<-                   addLayer              adjacent             
 [13] aggregate             all.equal             animate              
 [16] approxNA              area                  Arith                
 [19] as.array              as.character          as.data.frame        
 [22] as.integer            as.list               as.logical           
 [25] as.matrix             as.vector             atan2                
 [28] bbox                  boxplot               brick                
 [31] calc                  cellFromRowCol        cellFromRowColCombine
 [34] cellFromXY            cellStats             clamp                
 [37] click                 coerce                colFromCell          
 [40] colFromX              colSums               Compare              
 [43] coordinates           corLocal              couldBeLonLat        
 [46] cover                 crop                  crosstab             
 [49] crs<-                 cut                   cv                   
 [52] density               dim                   dim<-                
 [55] disaggregate          dropLayer             extend               
 [58] extent                extract               flip                 
 [61] freq                  getValues             getValuesBlock       
 [64] getValuesFocal        hasValues             head                 
 [67] hist                  image                 init                 
 [70] interpolate           intersect             is.factor            
 [73] is.finite             is.infinite           is.na                
 [76] is.nan                isLonLat              KML                  
 [79] labels                length                levels               
 [82] levels<-              log                   Logic                
 [85] mask                  match                 Math                 
 [88] Math2                 maxValue              mean                 
 [91] merge                 minValue              modal                
 [94] mosaic                names                 names<-              
 [97] ncell                 ncol                  ncol<-               
[100] nlayers               nrow                  nrow<-               
[103] origin                origin<-              overlay              
[106] pairs                 persp                 plot                 
[109] plotRGB               predict               print                
[112] proj4string           proj4string<-         quantile             
[115] raster                rasterize             ratify               
[118] readAll               readStart             readStop             
[121] reclassify            rectify               res                  
[124] res<-                 resample              rotate               
[127] rowColFromCell        rowFromCell           rowFromY             
[130] rowSums               sampleRandom          sampleRegular        
[133] scale                 select                setMinMax            
[136] setValues             shift                 show                 
[139] spplot                stack                 stackSelect          
[142] stretch               subs                  subset               
[145] Summary               summary               t                    
[148] tail                  text                  trim                 
[151] unique                unstack               values               
[154] values<-              weighted.mean         which.max            
[157] which.min             whiches.max           whiches.min          
[160] wkt                   writeRaster           xFromCell            
[163] xFromCol              xmax                  xmax<-               
[166] xmin                  xmin<-                xres                 
[169] xyFromCell            yFromCell             yFromRow             
[172] ymax                  ymax<-                ymin                 
[175] ymin<-                yres                  zonal                
[178] zoom                 
see '?methods' for accessing help and source code

2) And compare that with the methods available for a single band:

methods(class=class(RGB_stack_HARV[1]))
Warning in .S3methods(generic.function, class, envir): 'class' is of length > 1;
only the first element will be used
 [1] [             [<-           anyDuplicated as_tibble     as.data.frame
 [6] as.raster     bind          boxplot       brick         coerce       
[11] coordinates   determinant   duplicated    edit          extent       
[16] extract       head          initialize    isSymmetric   Math         
[21] Math2         Ops           raster        rasterize     relist       
[26] subset        summary       surfaceArea   tail          trim         
[31] unique        weighted.mean writeValues  
see '?methods' for accessing help and source code

3) There are far more things one could or want to ask of a full stack than of a single band.

Key Points

  • A single raster file can contain multiple bands or layers.

  • Use the stack() function to load all bands in a multi-layer raster file into R.

  • Individual bands within a stack can be accessed, analyzed, and visualized using the same functions as single bands.


Open and Plot Shapefiles

Overview

Teaching: 20 min
Exercises: 10 min
Questions
  • How can I distinguish between and visualize point, line and polygon vector data?

Objectives
  • Know the difference between point, line, and polygon vector elements.

  • Load point, line, and polygon shapefiles into R.

  • Access the attributes of a spatial object in R.

Things You’ll Need To Complete This Episode

See the lesson homepage for detailed information about the software, data, and other prerequisites you will need to work through the examples in this episode.

Starting with this episode, we will be moving from working with raster data to working with vector data. In this episode, we will open and plot point, line and polygon vector data stored in shapefile format in R. These data refer to the NEON Harvard Forest field site, which we have been working with in previous episodes. In later episodes, we will learn how to work with raster and vector data together and combine them into a single plot.

Import Shapefiles

We will use the sf package to work with vector data in R. Notice that the rgdal package automatically loads when sf is loaded. We will also use the raster package, which has been loaded in previous episodes, so we can explore raster and vector spatial metadata using similar commands. Make sure you have the sf library loaded.

library(sf)

The shapefiles that we will import are:

The first shapefile that we will open contains the boundary of our study area (or our Area Of Interest or AOI, hence the name aoiBoundary). To import shapefiles we use the sf function st_read(). st_read() requires the file path to the shapefile.

Let’s import our AOI:

aoi_boundary_HARV <- st_read(
  "data/NEON-DS-Site-Layout-Files/HARV/HarClip_UTMZ18.shp")
Reading layer `HarClip_UTMZ18' from data source `/home/runner/work/r-raster-vector-geospatial/r-raster-vector-geospatial/_episodes_rmd/data/NEON-DS-Site-Layout-Files/HARV/HarClip_UTMZ18.shp' using driver `ESRI Shapefile'
Simple feature collection with 1 feature and 1 field
geometry type:  POLYGON
dimension:      XY
bbox:           xmin: 732128 ymin: 4713209 xmax: 732251.1 ymax: 4713359
CRS:            32618

Shapefile Metadata & Attributes

When we import the HarClip_UTMZ18 shapefile layer into R (as our aoi_boundary_HARV object), the st_read() function automatically stores information about the data. We are particularly interested in the geospatial metadata, describing the format, CRS, extent, and other components of the vector data, and the attributes which describe properties associated with each individual vector object.

Data Tip

The Explore and Plot by Shapefile Attributes episode provides more information on both metadata and attributes and using attributes to subset and plot data.

Spatial Metadata

Key metadata for all shapefiles include:

  1. Object Type: the class of the imported object.
  2. Coordinate Reference System (CRS): the projection of the data.
  3. Extent: the spatial extent (i.e. geographic area that the shapefile covers) of the shapefile. Note that the spatial extent for a shapefile represents the combined extent for all spatial objects in the shapefile.

We can view shapefile metadata using the st_geometry_type(), st_crs() and st_bbox() functions. First, let’s view the geometry type for our AOI shapefile:

st_geometry_type(aoi_boundary_HARV)
[1] POLYGON
18 Levels: GEOMETRY POINT LINESTRING POLYGON MULTIPOINT ... TRIANGLE

Our aoi_boundary_HARV is a polygon object. The 18 levels shown below our output list the possible categories of the geometry type. Now let’s check what CRS this file data is in:

st_crs(aoi_boundary_HARV)
Coordinate Reference System:
  User input: 32618 
  wkt:
PROJCS["WGS_1984_UTM_Zone_18N",
    GEOGCS["GCS_WGS_1984",
        DATUM["WGS_1984",
            SPHEROID["WGS_84",6378137,298.257223563]],
        PRIMEM["Greenwich",0],
        UNIT["Degree",0.017453292519943295],
        AUTHORITY["EPSG","4326"]],
    PROJECTION["Transverse_Mercator"],
    PARAMETER["latitude_of_origin",0],
    PARAMETER["central_meridian",-75],
    PARAMETER["scale_factor",0.9996],
    PARAMETER["false_easting",500000],
    PARAMETER["false_northing",0],
    UNIT["Meter",1],
    AUTHORITY["EPSG","32618"]]

Our data in the CRS UTM zone 18N. The CRS is critical to interpreting the object’s extent values as it specifies units. To find the extent of our AOI, we can use the st_bbox() function:

st_bbox(aoi_boundary_HARV)
     xmin      ymin      xmax      ymax 
 732128.0 4713208.7  732251.1 4713359.2 

The spatial extent of a shapefile or R spatial object represents the geographic “edge” or location that is the furthest north, south east and west. Thus is represents the overall geographic coverage of the spatial object. Image Source: National Ecological Observatory Network (NEON).

Extent image

Lastly, we can view all of the metadata and attributes for this shapefile object by printing it to the screen:

aoi_boundary_HARV
Simple feature collection with 1 feature and 1 field
geometry type:  POLYGON
dimension:      XY
bbox:           xmin: 732128 ymin: 4713209 xmax: 732251.1 ymax: 4713359
CRS:            32618
  id                       geometry
1  1 POLYGON ((732128 4713359, 7...

Spatial Data Attributes

We introduced the idea of spatial data attributes in an earlier lesson. Now we will explore how to use spatial data attributes stored in our data to plot different features.

Plot a Shapefile

Next, let’s visualize the data in our sf object using the ggplot package. Unlike with raster data, we do not need to convert vector data to a dataframe before plotting with ggplot.

We’re going to customize our boundary plot by setting the size, color, and fill for our plot. When plotting sf objects with ggplot2, you need to use the coord_sf() coordinate system.

ggplot() + 
  geom_sf(data = aoi_boundary_HARV, size = 3, color = "black", fill = "cyan1") + 
  ggtitle("AOI Boundary Plot") + 
  coord_sf()

plot of chunk plot-shapefile

Challenge: Import Line and Point Shapefiles

Using the steps above, import the HARV_roads and HARVtower_UTM18N layers into R. Call the HARV_roads object lines_HARV and the HARVtower_UTM18N point_HARV.

Answer the following questions:

  1. What type of R spatial object is created when you import each layer?

  2. What is the CRS and extent for each object?

  3. Do the files contain points, lines, or polygons?

  4. How many spatial objects are in each file?

Answers

First we import the data:

lines_HARV <- st_read("data/NEON-DS-Site-Layout-Files/HARV/HARV_roads.shp")
Reading layer `HARV_roads' from data source `/home/runner/work/r-raster-vector-geospatial/r-raster-vector-geospatial/_episodes_rmd/data/NEON-DS-Site-Layout-Files/HARV/HARV_roads.shp' using driver `ESRI Shapefile'
Simple feature collection with 13 features and 15 fields
geometry type:  MULTILINESTRING
dimension:      XY
bbox:           xmin: 730741.2 ymin: 4711942 xmax: 733295.5 ymax: 4714260
CRS:            32618
point_HARV <- st_read("data/NEON-DS-Site-Layout-Files/HARV/HARVtower_UTM18N.shp")
Reading layer `HARVtower_UTM18N' from data source `/home/runner/work/r-raster-vector-geospatial/r-raster-vector-geospatial/_episodes_rmd/data/NEON-DS-Site-Layout-Files/HARV/HARVtower_UTM18N.shp' using driver `ESRI Shapefile'
Simple feature collection with 1 feature and 14 fields
geometry type:  POINT
dimension:      XY
bbox:           xmin: 732183.2 ymin: 4713265 xmax: 732183.2 ymax: 4713265
CRS:            32618

Then we check its class:

class(lines_HARV)
[1] "sf"         "data.frame"
class(point_HARV)
[1] "sf"         "data.frame"

We also check the CRS and extent of each object:

st_crs(lines_HARV)
Coordinate Reference System:
  User input: 32618 
  wkt:
PROJCS["WGS_1984_UTM_Zone_18N",
    GEOGCS["GCS_WGS_1984",
        DATUM["WGS_1984",
            SPHEROID["WGS_84",6378137,298.257223563]],
        PRIMEM["Greenwich",0],
        UNIT["Degree",0.017453292519943295],
        AUTHORITY["EPSG","4326"]],
    PROJECTION["Transverse_Mercator"],
    PARAMETER["latitude_of_origin",0],
    PARAMETER["central_meridian",-75],
    PARAMETER["scale_factor",0.9996],
    PARAMETER["false_easting",500000],
    PARAMETER["false_northing",0],
    UNIT["Meter",1],
    AUTHORITY["EPSG","32618"]]
st_bbox(lines_HARV)
     xmin      ymin      xmax      ymax 
 730741.2 4711942.0  733295.5 4714260.0 
st_crs(point_HARV)
Coordinate Reference System:
  User input: 32618 
  wkt:
PROJCS["WGS_1984_UTM_Zone_18N",
    GEOGCS["GCS_WGS_1984",
        DATUM["WGS_1984",
            SPHEROID["WGS_84",6378137,298.257223563]],
        PRIMEM["Greenwich",0],
        UNIT["Degree",0.017453292519943295],
        AUTHORITY["EPSG","4326"]],
    PROJECTION["Transverse_Mercator"],
    PARAMETER["latitude_of_origin",0],
    PARAMETER["central_meridian",-75],
    PARAMETER["scale_factor",0.9996],
    PARAMETER["false_easting",500000],
    PARAMETER["false_northing",0],
    UNIT["Meter",1],
    AUTHORITY["EPSG","32618"]]
st_bbox(point_HARV)
     xmin      ymin      xmax      ymax 
 732183.2 4713265.0  732183.2 4713265.0 

To see the number of objects in each file, we can look at the output from when we read these objects into R. lines_HARV contains 13 features (all lines) and point_HARV contains only one point.

Key Points

  • Shapefile metadata include geometry type, CRS, and extent.

  • Load spatial objects into R with the st_read() function.

  • Spatial objects can be plotted directly with ggplot using the geom_sf() function. No need to convert to a dataframe.


Explore and Plot by Vector Layer Attributes

Overview

Teaching: 40 min
Exercises: 20 min
Questions
  • How can I compute on the attributes of a spatial object?

Objectives
  • Query attributes of a spatial object.

  • Subset spatial objects using specific attribute values.

  • Plot a vector feature, colored by unique attribute values.

Things You’ll Need To Complete This Episode

See the lesson homepage for detailed information about the software, data, and other prerequisites you will need to work through the examples in this episode.

This episode continues our discussion of vector layer attributes and covers how to work with vector layer attributes in R. It covers how to identify and query layer attributes, as well as how to subset features by specific attribute values. Finally, we will learn how to plot a feature according to a set of attribute values.

Load the Data

We will continue using the sf, raster and ggplot2 packages in this episode. Make sure that you have these packages loaded. We will continue to work with the three shapefiles (vector layers) that we loaded in the Open and Plot Shapefiles in R episode.

Query Vector Feature Metadata

As we discussed in the Open and Plot Shapefiles in R episode, we can view metadata associated with an R object using:

We started to explore our point_HARV object in the previous episode. To see a summary of all of the metadata associated with our point_HARV object, we can view the object with View(point_HARV) or print a summary of the object itself to the console.

point_HARV
Simple feature collection with 1 feature and 14 fields
geometry type:  POINT
dimension:      XY
bbox:           xmin: 732183.2 ymin: 4713265 xmax: 732183.2 ymax: 4713265
CRS:            32618
  Un_ID Domain DomainName       SiteName Type       Sub_Type     Lat      Long
1     A      1  Northeast Harvard Forest Core Advanced Tower 42.5369 -72.17266
  Zone  Easting Northing                Ownership    County annotation
1   18 732183.2  4713265 Harvard University, LTER Worcester         C1
                  geometry
1 POINT (732183.2 4713265)

We can use the ncol function to count the number of attributes associated with a spatial object too. Note that the geometry is just another column and counts towards the total.

ncol(lines_HARV)
[1] 16

We can view the individual name of each attribute using the names() function in R:

names(lines_HARV)
 [1] "OBJECTID_1" "OBJECTID"   "TYPE"       "NOTES"      "MISCNOTES" 
 [6] "RULEID"     "MAPLABEL"   "SHAPE_LENG" "LABEL"      "BIKEHORSE" 
[11] "RESVEHICLE" "RECMAP"     "Shape_Le_1" "ResVehic_1" "BicyclesHo"
[16] "geometry"  

We could also view just the first 6 rows of attribute values using the head() function to get a preview of the data:

head(lines_HARV)
Simple feature collection with 6 features and 15 fields
geometry type:  MULTILINESTRING
dimension:      XY
bbox:           xmin: 730741.2 ymin: 4712685 xmax: 732232.3 ymax: 4713726
CRS:            32618
  OBJECTID_1 OBJECTID       TYPE             NOTES MISCNOTES RULEID
1         14       48 woods road Locust Opening Rd      <NA>      5
2         40       91   footpath              <NA>      <NA>      6
3         41      106   footpath              <NA>      <NA>      6
4        211      279 stone wall              <NA>      <NA>      1
5        212      280 stone wall              <NA>      <NA>      1
6        213      281 stone wall              <NA>      <NA>      1
           MAPLABEL SHAPE_LENG             LABEL BIKEHORSE RESVEHICLE RECMAP
1 Locust Opening Rd 1297.35706 Locust Opening Rd         Y         R1      Y
2              <NA>  146.29984              <NA>         Y         R1      Y
3              <NA>  676.71804              <NA>         Y         R2      Y
4              <NA>  231.78957              <NA>      <NA>       <NA>   <NA>
5              <NA>   45.50864              <NA>      <NA>       <NA>   <NA>
6              <NA>  198.39043              <NA>      <NA>       <NA>   <NA>
  Shape_Le_1                            ResVehic_1                  BicyclesHo
1 1297.10617    R1 - All Research Vehicles Allowed Bicycles and Horses Allowed
2  146.29983    R1 - All Research Vehicles Allowed Bicycles and Horses Allowed
3  676.71807 R2 - 4WD/High Clearance Vehicles Only Bicycles and Horses Allowed
4  231.78962                                  <NA>                        <NA>
5   45.50859                                  <NA>                        <NA>
6  198.39041                                  <NA>                        <NA>
                        geometry
1 MULTILINESTRING ((730819.2 ...
2 MULTILINESTRING ((732040.2 ...
3 MULTILINESTRING ((732057 47...
4 MULTILINESTRING ((731903.6 ...
5 MULTILINESTRING ((732039.1 ...
6 MULTILINESTRING ((732056.2 ...

Challenge: Attributes for Different Spatial Classes

Explore the attributes associated with the point_HARV and aoi_boundary_HARV spatial objects.

  1. How many attributes does each have?
  2. Who owns the site in the point_HARV data object?
  3. Which of the following is NOT an attribute of the point_HARV data object?

    A) Latitude B) County C) Country

Answers

1) To find the number of attributes, we use the ncol() function:

ncol(point_HARV)
[1] 15
ncol(aoi_boundary_HARV)
[1] 2

2) Ownership information is in a column named Ownership:

point_HARV$Ownership
[1] "Harvard University, LTER"

3) To see a list of all of the attributes, we can use the names() function:

names(point_HARV)
 [1] "Un_ID"      "Domain"     "DomainName" "SiteName"   "Type"      
 [6] "Sub_Type"   "Lat"        "Long"       "Zone"       "Easting"   
[11] "Northing"   "Ownership"  "County"     "annotation" "geometry"  

“Country” is not an attribute of this object.

Explore Values within One Attribute

We can explore individual values stored within a particular attribute. Comparing attributes to a spreadsheet or a data frame, this is similar to exploring values in a column. We did this with the gapminder dataframe in an earlier lesson. For spatial objects, we can use the same syntax: objectName$attributeName.

We can see the contents of the TYPE field of our lines feature:

lines_HARV$TYPE
 [1] "woods road" "footpath"   "footpath"   "stone wall" "stone wall"
 [6] "stone wall" "stone wall" "stone wall" "stone wall" "boardwalk" 
[11] "woods road" "woods road" "woods road"

To see only unique values within the TYPE field, we can use the levels() function for extracting the possible values of a categorical variable. The special term for categorical variables within R is factor. We worked with factors a little bit in an earlier lesson.

levels(lines_HARV$TYPE)
NULL

Subset Features

We can use the filter() function from dplyr that we worked with in an earlier lesson to select a subset of features from a spatial object in R, just like with data frames.

For example, we might be interested only in features that are of TYPE “footpath”. Once we subset out this data, we can use it as input to other code so that code only operates on the footpath lines.

footpath_HARV <- lines_HARV %>% 
  filter(TYPE == "footpath")
nrow(footpath_HARV)
[1] 2

Our subsetting operation reduces the features count to 2. This means that only two feature lines in our spatial object have the attribute TYPE == footpath. We can plot only the footpath lines:

ggplot() + 
  geom_sf(data = footpath_HARV) +
  ggtitle("NEON Harvard Forest Field Site", subtitle = "Footpaths") + 
  coord_sf()

plot of chunk plot-subset-shapefile

There are two features in our footpaths subset. Why does the plot look like there is only one feature? Let’s adjust the colors used in our plot. If we have 2 features in our vector object, we can plot each using a unique color by assigning a column name to the color aesthetic (color =). We use the syntax aes(color = ) to do this. We can also alter the default line thickness by using the size = parameter, as the default value of 0.5 can be hard to see. Note that size is placed outside of the aes() function, as we are not connecting line thickness to a data variable.

ggplot() + 
  geom_sf(data = footpath_HARV, aes(color = factor(OBJECTID)), size = 1.5) +
  labs(color = 'Footpath ID') +
  ggtitle("NEON Harvard Forest Field Site", subtitle = "Footpaths") + 
  coord_sf()

plot of chunk plot-subset-shapefile-unique-colors

Now, we see that there are in fact two features in our plot!

Challenge: Subset Spatial Line Objects Part 1

Subset out all boardwalk from the lines layer and plot it.

Answers

First we will save an object with only the boardwalk lines:

boardwalk_HARV <- lines_HARV %>% 
  filter(TYPE == "boardwalk")

Let’s check how many features there are in this subset:

nrow(boardwalk_HARV)
[1] 1

Now let’s plot that data:

ggplot() + 
  geom_sf(data = boardwalk_HARV, size = 1.5) +
  ggtitle("NEON Harvard Forest Field Site", subtitle = "Boardwalks") + 
  coord_sf()

plot of chunk harv-boardwalk-map

Challenge: Subset Spatial Line Objects Part 2

Subset out all stone wall features from the lines layer and plot it. For each plot, color each feature using a unique color.

Answer

First we will save an object with only the stone wall lines and check the number of features:

stoneWall_HARV <- lines_HARV %>% 
  filter(TYPE == "stone wall")
nrow(stoneWall_HARV)
[1] 6

Now we can plot the data:

ggplot() +
  geom_sf(data = stoneWall_HARV, aes(color = factor(OBJECTID)), size = 1.5) +
  labs(color = 'Wall ID') +
  ggtitle("NEON Harvard Forest Field Site", subtitle = "Stonewalls") + 
  coord_sf()

plot of chunk harv-stone-wall-map

Customize Plots

In the examples above, ggplot() automatically selected colors for each line based on a default color order. If we don’t like those default colors, we can create a vector of colors - one for each feature. To create this vector we can use the following syntax:

c("color_one", "color_two", "color_three")[object$factor]

Note in the above example we have

  1. a vector of colors - one for each factor value (unique attribute value)
  2. the attribute itself ([object$factor]) of class factor.

First we will check how many unique levels our factor has:

levels(lines_HARV$TYPE)
NULL

Then we can create a palette of four colors, one for each feature in our vector object.

road_colors <- c("blue", "green", "navy", "purple")

We can tell ggplot to use these colors when we plot the data.

ggplot() +
  geom_sf(data = lines_HARV, aes(color = TYPE)) + 
  scale_color_manual(values = road_colors) +
  labs(color = 'Road Type') +
  ggtitle("NEON Harvard Forest Field Site", subtitle = "Roads & Trails") + 
  coord_sf()

plot of chunk harv-paths-map

Adjust Line Width

We adjusted line width universally earlier. If we want a unique line width for each factor level or attribute category in our spatial object, we can use the same syntax that we used for colors, above.

We already know that we have four different TYPE levels in the lines_HARV object, so we will set four different line widths.

line_widths <- c(1, 2, 3, 4)

We can use those line widths when we plot the data.

ggplot() +
  geom_sf(data = lines_HARV, aes(color = TYPE, size = TYPE)) + 
  scale_color_manual(values = road_colors) +
  labs(color = 'Road Type') +
  scale_size_manual(values = line_widths) +
  ggtitle("NEON Harvard Forest Field Site", subtitle = "Roads & Trails - Line width varies") + 
  coord_sf()

plot of chunk harv-paths-map-wide

Note that we could also use aes(size = TYPE) to tie the line thickness to the TYPE variable, so long as we had been careful to set factor levels appropriately. ggplot prints a warning when you do this, because it is not considered a good practice to plot non-spatial data this way.

Challenge: Plot Line Width by Attribute

In the example above, we set the line widths to be 1, 2, 3, and 4. Because R orders factor levels alphabetically by default, this gave us a plot where woods roads (the last factor level) were the thickest and boardwalks were the thinnest.

Let’s create another plot where we show the different line types with the following thicknesses:

  1. woods road size = 6
  2. boardwalks size = 1
  3. footpath size = 3
  4. stone wall size = 2

Answers

First we need to look at the levels of our factor to see what order the road types are in:

levels(lines_HARV$TYPE)
NULL

We then can create our line_width vector setting each of the levels to the desired thickness.

line_width <- c(1, 3, 2, 6)

Now we can create our plot.

ggplot() +
  geom_sf(data = lines_HARV, aes(size = TYPE)) +
  scale_size_manual(values = line_width) +
  ggtitle("NEON Harvard Forest Field Site", subtitle = "Roads & Trails - Line width varies") + 
  coord_sf()

plot of chunk harv-path-line-types

Add Plot Legend

We can add a legend to our plot too. When we add a legend, we use the following elements to specify labels and colors:

Let’s add a legend to our plot. We will use the road_colors object that we created above to color the legend. We can customize the appearance of our legend by manually setting different parameters.

ggplot() + 
  geom_sf(data = lines_HARV, aes(color = TYPE), size = 1.5) +
  scale_color_manual(values = road_colors) +
  labs(color = 'Road Type') + 
  ggtitle("NEON Harvard Forest Field Site", 
          subtitle = "Roads & Trails - Default Legend") + 
  coord_sf()

plot of chunk add-legend-to-plot

We can change the appearance of our legend by manually setting different parameters.

ggplot() + 
  geom_sf(data = lines_HARV, aes(color = TYPE), size = 1.5) +
  scale_color_manual(values = road_colors) + 
  labs(color = 'Road Type') +
  theme(legend.text = element_text(size = 20), 
        legend.box.background = element_rect(size = 1)) + 
  ggtitle("NEON Harvard Forest Field Site", 
          subtitle = "Roads & Trails - Modified Legend") +
  coord_sf()

plot of chunk modify-legend-plot

new_colors <- c("springgreen", "blue", "magenta", "orange")

ggplot() + 
  geom_sf(data = lines_HARV, aes(color = TYPE), size = 1.5) + 
  scale_color_manual(values = new_colors) +
  labs(color = 'Road Type') +
  theme(legend.text = element_text(size = 20), 
        legend.box.background = element_rect(size = 1)) + 
  ggtitle("NEON Harvard Forest Field Site", 
          subtitle = "Roads & Trails - Pretty Colors") +
  coord_sf()

plot of chunk plot-different-colors

Data Tip

You can modify the default R color palette using the palette method. For example palette(rainbow(6)) or palette(terrain.colors(6)). You can reset the palette colors using palette("default")!

Challenge: Plot Lines by Attribute

Create a plot that emphasizes only roads where bicycles and horses are allowed. To emphasize this, make the lines where bicycles are not allowed THINNER than the roads where bicycles are allowed. NOTE: this attribute information is located in the lines_HARV$BicyclesHo attribute.

Be sure to add a title and legend to your map. You might consider a color palette that has all bike/horse-friendly roads displayed in a bright color. All other lines can be black.

Answers

First we need to make sure that the BicyclesHo attribute is a factor and check how many levels it has.

class(lines_HARV$BicyclesHo)
[1] "character"
levels(lines_HARV$BicyclesHo)
NULL

Next, we will create a new object lines_removeNA that removes missing values.

lines_removeNA <- lines_HARV[!is.na(lines_HARV$BicyclesHo),] 

In our plot, we will set colors so that only the allowed roads are magenta, and we will set line width so that the first factor level is thicker than the others.

# First, create a data frame with only those roads where bicycles and horses are allowed
lines_showHarv <- lines_removeNA %>% filter(BicyclesHo == "Bicycles and Horses Allowed")

# Next, visualise using ggplot
ggplot() + 
  geom_sf(data = lines_HARV) + 
  geom_sf(data = lines_showHarv, aes(color = BicyclesHo), size = 2) + 
  scale_color_manual(values = "magenta") +
  ggtitle("NEON Harvard Forest Field Site", subtitle = "Roads Where Bikes and Horses Are Allowed") + 
  coord_sf()

plot of chunk harv-paths-bike-horses

Challenge: Plot Polygon by Attribute

  1. Create a map of the state boundaries in the United States using the data located in your downloaded data folder: NEON-DS-Site-Layout-Files/US-Boundary-Layers\US-State-Boundaries-Census-2014. Apply a fill color to each state using its region value. Add a legend.

Answers

First we read in the data and check how many levels there are in the region column:

state_boundary_US <- 
st_read("data/NEON-DS-Site-Layout-Files/US-Boundary-Layers/US-State-Boundaries-Census-2014.shp")
Reading layer `US-State-Boundaries-Census-2014' from data source `/home/runner/work/r-raster-vector-geospatial/r-raster-vector-geospatial/_episodes_rmd/data/NEON-DS-Site-Layout-Files/US-Boundary-Layers/US-State-Boundaries-Census-2014.shp' using driver `ESRI Shapefile'
Simple feature collection with 58 features and 10 fields
geometry type:  MULTIPOLYGON
dimension:      XYZ
bbox:           xmin: -124.7258 ymin: 24.49813 xmax: -66.9499 ymax: 49.38436
z_range:        zmin: 0 zmax: 0
CRS:            4326
levels(state_boundary_US$region)
NULL

Next we set a color vector with that many items:

colors <- c("purple", "springgreen", "yellow", "brown", "navy")

Now we can create our plot:

ggplot() +
  geom_sf(data = state_boundary_US, aes(color = region), size = 1) +
  scale_color_manual(values = colors) +
  ggtitle("Contiguous U.S. State Boundaries") + 
  coord_sf()

plot of chunk colored-state-boundaries

Key Points

  • Spatial objects in sf are similar to standard data frames and can be manipulated using the same functions.

  • Almost any feature of a plot can be customized using the various functions and options in the ggplot2 package.


Plot Multiple Shapefiles

Overview

Teaching: 40 min
Exercises: 20 min
Questions
  • How can I create map compositions with custom legends using ggplot?

  • How can I plot raster and vector data together?

Objectives
  • Plot multiple shapefiles in the same plot.

  • Apply custom symbols to spatial objects in a plot.

  • Create a multi-layered plot with raster and vector data.

Things You’ll Need To Complete This Episode

See the lesson homepage for detailed information about the software, data, and other prerequisites you will need to work through the examples in this episode.

This episode builds upon the previous episode to work with shapefile attributes in R and explores how to plot multiple shapefiles. It also covers how to plot raster and vector data together on the same plot.

Load the Data

To work with vector data in R, we can use the sf library. The raster package also allows us to explore metadata using similar commands for both raster and vector files. Make sure that you have these packages loaded.

We will continue to work with the three shapefiles that we loaded in the Open and Plot Shapefiles in R episode.

Plotting Multiple Shapefiles

In the previous episode, we learned how to plot information from a single shapefile and do some plot customization including adding a custom legend. However, what if we want to create a more complex plot with many shapefiles and unique symbols that need to be represented clearly in a legend?

Now, let’s create a plot that combines our tower location (point_HARV), site boundary (aoi_boundary_HARV) and roads (lines_HARV) spatial objects. We will need to build a custom legend as well.

To begin, we will create a plot with the site boundary as the first layer. Then layer the tower location and road data on top using +.

ggplot() + 
  geom_sf(data = aoi_boundary_HARV, fill = "grey", color = "grey") +
  geom_sf(data = lines_HARV, aes(color = TYPE), size = 1) +
  geom_sf(data = point_HARV) +
  ggtitle("NEON Harvard Forest Field Site") + 
  coord_sf()

plot of chunk plot-many-shapefiles

Next, let’s build a custom legend using the symbology (the colors and symbols) that we used to create the plot above. For example, it might be good if the lines were symbolized as lines. In the previous episode, you may have noticed that the default legend behavior for geom_sf is to draw a ‘patch’ for each legend entry. If you want the legend to draw lines or points, you need to add an instruction to the geom_sf call - in this case, show.legend = 'line'.

ggplot() + 
  geom_sf(data = aoi_boundary_HARV, fill = "grey", color = "grey") +
  geom_sf(data = lines_HARV, aes(color = TYPE),
          show.legend = "line", size = 1) +
  geom_sf(data = point_HARV, aes(fill = Sub_Type), color = "black") +
  scale_color_manual(values = road_colors) +
  scale_fill_manual(values = "black") +
  ggtitle("NEON Harvard Forest Field Site") + 
  coord_sf()

plot of chunk plot-custom-shape

Now lets adjust the legend titles by passing a name to the respective color and fill palettes.

ggplot() + 
  geom_sf(data = aoi_boundary_HARV, fill = "grey", color = "grey") +
  geom_sf(data = point_HARV, aes(fill = Sub_Type)) +
  geom_sf(data = lines_HARV, aes(color = TYPE), show.legend = "line",
          size = 1) + 
  scale_color_manual(values = road_colors, name = "Line Type") + 
  scale_fill_manual(values = "black", name = "Tower Location") + 
  ggtitle("NEON Harvard Forest Field Site") + 
  coord_sf()

plot of chunk create-custom-legend

Finally, it might be better if the points were symbolized as a symbol. We can customize this using shape parameters in our call to geom_sf: 16 is a point symbol, 15 is a box.

Data Tip

To view a short list of shape symbols, type ?pch into the R console.

ggplot() +
  geom_sf(data = aoi_boundary_HARV, fill = "grey", color = "grey") +
  geom_sf(data = point_HARV, aes(fill = Sub_Type), shape = 15) +
  geom_sf(data = lines_HARV, aes(color = TYPE),
          show.legend = "line", size = 1) +
  scale_color_manual(values = road_colors, name = "Line Type") +
  scale_fill_manual(values = "black", name = "Tower Location") +
  ggtitle("NEON Harvard Forest Field Site") + 
  coord_sf()

plot of chunk custom-symbols

Challenge: Plot Polygon by Attribute

  1. Using the NEON-DS-Site-Layout-Files/HARV/PlotLocations_HARV.shp shapefile, create a map of study plot locations, with each point colored by the soil type (soilTypeOr). How many different soil types are there at this particular field site? Overlay this layer on top of the lines_HARV layer (the roads). Create a custom legend that applies line symbols to lines and point symbols to the points.

  2. Modify the plot above. Tell R to plot each point, using a different symbol of shape value.

Answers

First we need to read in the data and see how many unique soils are represented in the soilTypeOr attribute.

plot_locations <- st_read("data/NEON-DS-Site-Layout-Files/HARV/PlotLocations_HARV.shp")
Reading layer `PlotLocations_HARV' from data source `/home/runner/work/r-raster-vector-geospatial/r-raster-vector-geospatial/_episodes_rmd/data/NEON-DS-Site-Layout-Files/HARV/PlotLocations_HARV.shp' using driver `ESRI Shapefile'
Simple feature collection with 21 features and 25 fields
geometry type:  POINT
dimension:      XY
bbox:           xmin: 731405.3 ymin: 4712845 xmax: 732275.3 ymax: 4713846
CRS:            32618
levels(plot_locations$soilTypeOr)
NULL

Next we can create a new color palette with one color for each soil type.

blue_green <- c("blue", "darkgreen")

Finally, we will create our plot.

ggplot() + 
  geom_sf(data = lines_HARV, aes(color = TYPE), show.legend = "line") + 
  geom_sf(data = plot_locations, aes(fill = soilTypeOr), 
          shape = 21, color = NA, show.legend = 'point') + 
  scale_color_manual(name = "Line Type", values = road_colors,
     guide = guide_legend(override.aes = list(linetype = "solid", shape = NA))) + 
  scale_fill_manual(name = "Soil Type", values = blue_green,
     guide = guide_legend(override.aes = list(linetype = "blank", shape = 21, colour = NA))) + 
  ggtitle("NEON Harvard Forest Field Site") + 
  coord_sf()
Warning: Removed 21 rows containing missing values (geom_sf).
Error in check.length(gparname): 'gpar' element 'fontsize' must not be length 0

plot of chunk harv-plot-locations-bg

If we want each soil to be shown with a different symbol, we can give multiple values to the scale_shape_manual() argument.

ggplot() + 
  geom_sf(data = lines_HARV, aes(color = TYPE), show.legend = "line", size = 1) + 
  geom_sf(data = plot_locations, aes(fill = soilTypeOr, shape = soilTypeOr),
          show.legend = 'point', colour = NA, size = 3) + 
  scale_shape_manual(name = "Soil Type", values = c(21, 22)) +
  scale_color_manual(name = "Line Type", values = road_colors,
     guide = guide_legend(override.aes = list(linetype = "solid", shape = NA))) + 
  scale_fill_manual(name = "Soil Type", values = blue_green,
     guide = guide_legend(override.aes = list(linetype = "blank", shape = c(21, 22),
     color = blue_green))) + 
  ggtitle("NEON Harvard Forest Field Site") + 
  coord_sf()
Warning: Removed 21 rows containing missing values (geom_sf).
Error in check.length(gparname): 'gpar' element 'fontsize' must not be length 0

plot of chunk harv-plot-locations-pch

Challenge: Plot Raster & Vector Data Together

You can plot vector data layered on top of raster data using the + to add a layer in ggplot. Create a plot that uses the NEON AOI Canopy Height Model NEON_RemoteSensing/HARV/CHM/HARV_chmCrop.tif as a base layer. On top of the CHM, please add:

  • The study site AOI.
  • Roads.
  • The tower location.

Be sure to give your plot a meaningful title.

Answers

ggplot() +
  geom_raster(data = CHM_HARV_df, aes(x = x, y = y, fill = HARV_chmCrop)) +
  geom_sf(data = lines_HARV, color = "black") +
  geom_sf(data = aoi_boundary_HARV, color = "grey20", size = 1) +
  geom_sf(data = point_HARV, pch = 8) +
  ggtitle("NEON Harvard Forest Field Site w/ Canopy Height Model") + 
  coord_sf()

plot of chunk challenge-vector-raster-overlay

Key Points

  • Use the + operator to add multiple layers to a ggplot.

  • Multi-layered plots can combine raster and vector datasets.

  • Use the show.legend argument to set legend symbol types.

  • Use the scale_fill_manual() function to set legend colors.


Handling Spatial Projection & CRS

Overview

Teaching: 40 min
Exercises: 20 min
Questions
  • What do I do when vector data don’t line up?

Objectives
  • Plot vector objects with different CRSs in the same plot.

Things You’ll Need To Complete This Episode

See the lesson homepage for detailed information about the software, data, and other prerequisites you will need to work through the examples in this episode.

In an earlier episode we learned how to handle a situation where you have two different files with raster data in different projections. Now we will apply those same principles to working with vector data. We will create a base map of our study site using United States state and country boundary information accessed from the United States Census Bureau. We will learn how to map vector data that are in different CRSs and thus don’t line up on a map.

We will continue to work with the three shapefiles that we loaded in the Open and Plot Shapefiles in R episode.

Working With Spatial Data From Different Sources

We often need to gather spatial datasets from different sources and/or data that cover different spatial extents. These data are often in different Coordinate Reference Systems (CRSs).

Some reasons for data being in different CRSs include:

  1. The data are stored in a particular CRS convention used by the data provider (for example, a government agency).
  2. The data are stored in a particular CRS that is customized to a region. For instance, many states in the US prefer to use a State Plane projection customized for that state.

Maps of the United States using data in different projections. Source: opennews.org, from: https://media.opennews.org/cache/06/37/0637aa2541b31f526ad44f7cb2db7b6c.jpg

Notice the differences in shape associated with each different projection. These differences are a direct result of the calculations used to “flatten” the data onto a 2-dimensional map. Often data are stored purposefully in a particular projection that optimizes the relative shape and size of surrounding geographic boundaries (states, counties, countries, etc).

In this episode we will learn how to identify and manage spatial data in different projections. We will learn how to reproject the data so that they are in the same projection to support plotting / mapping. Note that these skills are also required for any geoprocessing / spatial analysis. Data need to be in the same CRS to ensure accurate results.

We will continue to use the sf and raster packages in this episode.

Import US Boundaries - Census Data

There are many good sources of boundary base layers that we can use to create a basemap. Some R packages even have these base layers built in to support quick and efficient mapping. In this episode, we will use boundary layers for the contiguous United States, provided by the United States Census Bureau. It is useful to have shapefiles to work with because we can add additional attributes to them if need be - for project specific mapping.

Read US Boundary File

We will use the st_read() function to import the /US-Boundary-Layers/US-State-Boundaries-Census-2014 layer into R. This layer contains the boundaries of all contiguous states in the U.S. Please note that these data have been modified and reprojected from the original data downloaded from the Census website to support the learning goals of this episode.

state_boundary_US <- st_read("data/NEON-DS-Site-Layout-Files/US-Boundary-Layers/US-State-Boundaries-Census-2014.shp")
Reading layer `US-State-Boundaries-Census-2014' from data source `/home/runner/work/r-raster-vector-geospatial/r-raster-vector-geospatial/_episodes_rmd/data/NEON-DS-Site-Layout-Files/US-Boundary-Layers/US-State-Boundaries-Census-2014.shp' using driver `ESRI Shapefile'
Simple feature collection with 58 features and 10 fields
geometry type:  MULTIPOLYGON
dimension:      XYZ
bbox:           xmin: -124.7258 ymin: 24.49813 xmax: -66.9499 ymax: 49.38436
z_range:        zmin: 0 zmax: 0
CRS:            4326

Next, let’s plot the U.S. states data:

ggplot() +
  geom_sf(data = state_boundary_US) +
  ggtitle("Map of Contiguous US State Boundaries") +
  coord_sf()

plot of chunk find-coordinates

U.S. Boundary Layer

We can add a boundary layer of the United States to our map - to make it look nicer. We will import NEON-DS-Site-Layout-Files/US-Boundary-Layers/US-Boundary-Dissolved-States.

country_boundary_US <- st_read("data/NEON-DS-Site-Layout-Files/US-Boundary-Layers/US-Boundary-Dissolved-States.shp")
Reading layer `US-Boundary-Dissolved-States' from data source `/home/runner/work/r-raster-vector-geospatial/r-raster-vector-geospatial/_episodes_rmd/data/NEON-DS-Site-Layout-Files/US-Boundary-Layers/US-Boundary-Dissolved-States.shp' using driver `ESRI Shapefile'
Simple feature collection with 1 feature and 9 fields
geometry type:  MULTIPOLYGON
dimension:      XYZ
bbox:           xmin: -124.7258 ymin: 24.49813 xmax: -66.9499 ymax: 49.38436
z_range:        zmin: 0 zmax: 0
CRS:            4326

If we specify a thicker line width using size = 2 for the border layer, it will make our map pop! We will also manually set the colors of the state boundaries and country boundaries.

ggplot() +
  geom_sf(data = country_boundary_US, color = "gray18", size = 2) +
  geom_sf(data = state_boundary_US, color = "gray40") +
  ggtitle("Map of Contiguous US State Boundaries") +
  coord_sf()

plot of chunk us-boundaries-thickness

Next, let’s add the location of a flux tower where our study area is. As we are adding these layers, take note of the CRS of each object. First let’s look at the CRS of our tower location object:

st_crs(point_HARV)
Coordinate Reference System:
  User input: 32618 
  wkt:
PROJCS["WGS_1984_UTM_Zone_18N",
    GEOGCS["GCS_WGS_1984",
        DATUM["WGS_1984",
            SPHEROID["WGS_84",6378137,298.257223563]],
        PRIMEM["Greenwich",0],
        UNIT["Degree",0.017453292519943295],
        AUTHORITY["EPSG","4326"]],
    PROJECTION["Transverse_Mercator"],
    PARAMETER["latitude_of_origin",0],
    PARAMETER["central_meridian",-75],
    PARAMETER["scale_factor",0.9996],
    PARAMETER["false_easting",500000],
    PARAMETER["false_northing",0],
    UNIT["Meter",1],
    AUTHORITY["EPSG","32618"]]

Our project string for DSM_HARV specifies the UTM projection as follows:

+proj=utm +zone=18 +datum=WGS84 +units=m +no_defs +ellps=WGS84 +towgs84=0,0,0

Note that the zone is unique to the UTM projection. Not all CRSs will have a zone.

Let’s check the CRS of our state and country boundary objects:

st_crs(state_boundary_US)
Coordinate Reference System:
  User input: 4326 
  wkt:
GEOGCS["GCS_WGS_1984",
    DATUM["WGS_1984",
        SPHEROID["WGS_84",6378137,298.257223563]],
    PRIMEM["Greenwich",0],
    UNIT["Degree",0.017453292519943295],
    AUTHORITY["EPSG","4326"]]
st_crs(country_boundary_US)
Coordinate Reference System:
  User input: 4326 
  wkt:
GEOGCS["GCS_WGS_1984",
    DATUM["WGS_1984",
        SPHEROID["WGS_84",6378137,298.257223563]],
    PRIMEM["Greenwich",0],
    UNIT["Degree",0.017453292519943295],
    AUTHORITY["EPSG","4326"]]

Our project string for state_boundary_US and country_boundary_US specifies the lat/long projection as follows:

+proj=longlat +datum=WGS84 +no_defs +ellps=WGS84 +towgs84=0,0,0

Note that there are no specified units above. This is because this geographic coordinate reference system is in latitude and longitude which is most often recorded in decimal degrees.

Data Tip

the last portion of each proj4 string is +towgs84=0,0,0 . This is a conversion factor that is used if a datum conversion is required. We will not deal with datums in this episode series.

CRS Units - View Object Extent

Next, let’s view the extent or spatial coverage for the point_HARV spatial object compared to the state_boundary_US object.

First we’ll look at the extent for our study site:

st_bbox(point_HARV)
     xmin      ymin      xmax      ymax 
 732183.2 4713265.0  732183.2 4713265.0 

And then the extent for the state boundary data.

st_bbox(state_boundary_US)
      xmin       ymin       xmax       ymax 
-124.72584   24.49813  -66.94989   49.38436 

Note the difference in the units for each object. The extent for state_boundary_US is in latitude and longitude which yields smaller numbers representing decimal degree units. Our tower location point is in UTM, is represented in meters.

Proj4 & CRS Resources

Reproject Vector Data or No?

We saw in an earlier episode that when working with raster data in different CRSs, we needed to convert all objects to the same CRS. We can do the same thing with our vector data - however, we don’t need to! When using the ggplot2 package, ggplot automatically converts all objects to the same CRS before plotting. This means we can plot our three data sets together without doing any conversion:

ggplot() +
  geom_sf(data = country_boundary_US, size = 2, color = "gray18") +
  geom_sf(data = state_boundary_US, color = "gray40") +
  geom_sf(data = point_HARV, shape = 19, color = "purple") +
  ggtitle("Map of Contiguous US State Boundaries") +
  coord_sf()

plot of chunk layer-point-on-states

Challenge - Plot Multiple Layers of Spatial Data

Create a map of the North Eastern United States as follows:

  1. Import and plot Boundary-US-State-NEast.shp. Adjust line width as necessary.
  2. Layer the Fisher Tower (in the NEON Harvard Forest site) point location point_HARV onto the plot.
  3. Add a title.
  4. Add a legend that shows both the state boundary (as a line) and the Tower location point.

Answers

NE.States.Boundary.US <- st_read("data/NEON-DS-Site-Layout-Files/US-Boundary-Layers/Boundary-US-State-NEast.shp")
Reading layer `Boundary-US-State-NEast' from data source `/home/runner/work/r-raster-vector-geospatial/r-raster-vector-geospatial/_episodes_rmd/data/NEON-DS-Site-Layout-Files/US-Boundary-Layers/Boundary-US-State-NEast.shp' using driver `ESRI Shapefile'
Simple feature collection with 12 features and 9 fields
geometry type:  MULTIPOLYGON
dimension:      XYZ
bbox:           xmin: -80.51989 ymin: 37.91685 xmax: -66.9499 ymax: 47.45716
z_range:        zmin: 0 zmax: 0
CRS:            4326
ggplot() +
    geom_sf(data = NE.States.Boundary.US, aes(color ="color"), show.legend = "line") +
    scale_color_manual(name = "", labels = "State Boundary", values = c("color" = "gray18")) +
    geom_sf(data = point_HARV, aes(shape = "shape"), color = "purple") +
    scale_shape_manual(name = "", labels = "Fisher Tower", values = c("shape" = 19)) +
    ggtitle("Fisher Tower location") +
    theme(legend.background = element_rect(color = NA)) +
    coord_sf()

plot of chunk ne-states-harv

Key Points

  • ggplot2 automatically converts all objects in a plot to the same CRS.

  • Still be aware of the CRS and extent for each object.


Convert from .csv to a Shapefile

Overview

Teaching: 40 min
Exercises: 20 min
Questions
  • How can I import CSV files as shapefiles in R?

Objectives
  • Import .csv files containing x,y coordinate locations into R as a data frame.

  • Convert a data frame to a spatial object.

  • Export a spatial object to a text file.

Things You’ll Need To Complete This Episode

See the lesson homepage for detailed information about the software, data, and other prerequisites you will need to work through the examples in this episode.

This episode will review how to import spatial points stored in .csv (Comma Separated Value) format into R as an sf spatial object. We will also reproject data imported from a shapefile format, export this data as a shapefile, and plot raster and vector data as layers in the same plot.

Spatial Data in Text Format

The HARV_PlotLocations.csv file contains x, y (point) locations for study plot where NEON collects data on vegetation and other ecological metrics. We would like to:

Spatial data are sometimes stored in a text file format (.txt or .csv). If the text file has an associated x and y location column, then we can convert it into an sf spatial object. The sf object allows us to store both the x,y values that represent the coordinate location of each point and the associated attribute data - or columns describing each feature in the spatial object.

We will continue using the sf and raster packages in this episode.

Import .csv

To begin let’s import a .csv file that contains plot coordinate x, y locations at the NEON Harvard Forest Field Site (HARV_PlotLocations.csv) and look at the structure of that new object:

plot_locations_HARV <-
  read.csv("data/NEON-DS-Site-Layout-Files/HARV/HARV_PlotLocations.csv")

str(plot_locations_HARV)
'data.frame':	21 obs. of  16 variables:
 $ easting   : num  731405 731934 731754 731724 732125 ...
 $ northing  : num  4713456 4713415 4713115 4713595 4713846 ...
 $ geodeticDa: chr  "WGS84" "WGS84" "WGS84" "WGS84" ...
 $ utmZone   : chr  "18N" "18N" "18N" "18N" ...
 $ plotID    : chr  "HARV_015" "HARV_033" "HARV_034" "HARV_035" ...
 $ stateProvi: chr  "MA" "MA" "MA" "MA" ...
 $ county    : chr  "Worcester" "Worcester" "Worcester" "Worcester" ...
 $ domainName: chr  "Northeast" "Northeast" "Northeast" "Northeast" ...
 $ domainID  : chr  "D01" "D01" "D01" "D01" ...
 $ siteID    : chr  "HARV" "HARV" "HARV" "HARV" ...
 $ plotType  : chr  "distributed" "tower" "tower" "tower" ...
 $ subtype   : chr  "basePlot" "basePlot" "basePlot" "basePlot" ...
 $ plotSize  : int  1600 1600 1600 1600 1600 1600 1600 1600 1600 1600 ...
 $ elevation : num  332 342 348 334 353 ...
 $ soilTypeOr: chr  "Inceptisols" "Inceptisols" "Inceptisols" "Histosols" ...
 $ plotdim_m : int  40 40 40 40 40 40 40 40 40 40 ...

We now have a data frame that contains 21 locations (rows) and 16 variables (attributes). Note that all of our character data was imported into R as factor (categorical) data. Next, let’s explore the dataframe to determine whether it contains columns with coordinate values. If we are lucky, our .csv will contain columns labeled:

Let’s check out the column names of our dataframe.

names(plot_locations_HARV)
 [1] "easting"    "northing"   "geodeticDa" "utmZone"    "plotID"    
 [6] "stateProvi" "county"     "domainName" "domainID"   "siteID"    
[11] "plotType"   "subtype"    "plotSize"   "elevation"  "soilTypeOr"
[16] "plotdim_m" 

Identify X,Y Location Columns

Our column names include several fields that might contain spatial information. The plot_locations_HARV$easting and plot_locations_HARV$northing columns contain coordinate values. We can confirm this by looking at the first six rows of our data.

head(plot_locations_HARV$easting)
[1] 731405.3 731934.3 731754.3 731724.3 732125.3 731634.3
head(plot_locations_HARV$northing)
[1] 4713456 4713415 4713115 4713595 4713846 4713295

We have coordinate values in our data frame. In order to convert our data frame to an sf object, we also need to know the CRS associated with those coordinate values.

There are several ways to figure out the CRS of spatial data in text format.

  1. We can check the file metadata in hopes that the CRS was recorded in the data.
  2. We can explore the file itself to see if CRS information is embedded in the file header or somewhere in the data columns.

Following the easting and northing columns, there is a geodeticDa and a utmZone column. These appear to contain CRS information (datum and projection). Let’s view those next.

head(plot_locations_HARV$geodeticDa)
[1] "WGS84" "WGS84" "WGS84" "WGS84" "WGS84" "WGS84"
head(plot_locations_HARV$utmZone)
[1] "18N" "18N" "18N" "18N" "18N" "18N"

It is not typical to store CRS information in a column. But this particular file contains CRS information this way. The geodeticDa and utmZone columns contain the information that helps us determine the CRS:

In When Vector Data Don’t Line Up - Handling Spatial Projection & CRS in R we learned about the components of a proj4 string. We have everything we need to assign a CRS to our data frame.

To create the proj4 associated with UTM Zone 18 WGS84 we can look up the projection on the Spatial Reference website, which contains a list of CRS formats for each projection. From here, we can extract the proj4 string for UTM Zone 18N WGS84.

However, if we have other data in the UTM Zone 18N projection, it’s much easier to use the st_crs() function to extract the CRS in proj4 format from that object and assign it to our new spatial object. We’ve seen this CRS before with our Harvard Forest study site (point_HARV).

st_crs(point_HARV)
Coordinate Reference System:
  User input: 32618 
  wkt:
PROJCS["WGS_1984_UTM_Zone_18N",
    GEOGCS["GCS_WGS_1984",
        DATUM["WGS_1984",
            SPHEROID["WGS_84",6378137,298.257223563]],
        PRIMEM["Greenwich",0],
        UNIT["Degree",0.017453292519943295],
        AUTHORITY["EPSG","4326"]],
    PROJECTION["Transverse_Mercator"],
    PARAMETER["latitude_of_origin",0],
    PARAMETER["central_meridian",-75],
    PARAMETER["scale_factor",0.9996],
    PARAMETER["false_easting",500000],
    PARAMETER["false_northing",0],
    UNIT["Meter",1],
    AUTHORITY["EPSG","32618"]]

The output above shows that the points shapefile is in UTM zone 18N. We can thus use the CRS from that spatial object to convert our non-spatial dataframe into an sf object.

Next, let’s create a crs object that we can use to define the CRS of our sf object when we create it.

utm18nCRS <- st_crs(point_HARV)
utm18nCRS
Coordinate Reference System:
  User input: 32618 
  wkt:
PROJCS["WGS_1984_UTM_Zone_18N",
    GEOGCS["GCS_WGS_1984",
        DATUM["WGS_1984",
            SPHEROID["WGS_84",6378137,298.257223563]],
        PRIMEM["Greenwich",0],
        UNIT["Degree",0.017453292519943295],
        AUTHORITY["EPSG","4326"]],
    PROJECTION["Transverse_Mercator"],
    PARAMETER["latitude_of_origin",0],
    PARAMETER["central_meridian",-75],
    PARAMETER["scale_factor",0.9996],
    PARAMETER["false_easting",500000],
    PARAMETER["false_northing",0],
    UNIT["Meter",1],
    AUTHORITY["EPSG","32618"]]
class(utm18nCRS)
[1] "crs"

.csv to sf object

Next, let’s convert our dataframe into an sf object. To do this, we need to specify:

  1. The columns containing X (easting) and Y (northing) coordinate values
  2. The CRS that the column coordinate represent (units are included in the CRS) - stored in our utmCRS object.

We will use the st_as_sf() function to perform the conversion.

plot_locations_sp_HARV <- st_as_sf(plot_locations_HARV, coords = c("easting", "northing"), crs = utm18nCRS)

We should double check the CRS to make sure it is correct.

st_crs(plot_locations_sp_HARV)
Coordinate Reference System:
  User input: 32618 
  wkt:
PROJCS["WGS_1984_UTM_Zone_18N",
    GEOGCS["GCS_WGS_1984",
        DATUM["WGS_1984",
            SPHEROID["WGS_84",6378137,298.257223563]],
        PRIMEM["Greenwich",0],
        UNIT["Degree",0.017453292519943295],
        AUTHORITY["EPSG","4326"]],
    PROJECTION["Transverse_Mercator"],
    PARAMETER["latitude_of_origin",0],
    PARAMETER["central_meridian",-75],
    PARAMETER["scale_factor",0.9996],
    PARAMETER["false_easting",500000],
    PARAMETER["false_northing",0],
    UNIT["Meter",1],
    AUTHORITY["EPSG","32618"]]

Plot Spatial Object

We now have a spatial R object, we can plot our newly created spatial object.

ggplot() +
  geom_sf(data = plot_locations_sp_HARV) +
  ggtitle("Map of Plot Locations")

plot of chunk plot-data-points

Plot Extent

In Open and Plot Shapefiles in R we learned about spatial object extent. When we plot several spatial layers in R using ggplot, all of the layers of the plot are considered in setting the boundaries of the plot. To show this, let’s plot our aoi_boundary_HARV object with our vegetation plots.

ggplot() +
  geom_sf(data = aoi_boundary_HARV) +
  geom_sf(data = plot_locations_sp_HARV) +
  ggtitle("AOI Boundary Plot")

plot of chunk plot-data

When we plot the two layers together, ggplot sets the plot boundaries so that they are large enough to include all of the data included in all of the layers. That’s really handy!

Challenge - Import & Plot Additional Points

We want to add two phenology plots to our existing map of vegetation plot locations.

Import the .csv: HARV/HARV_2NewPhenPlots.csv into R and do the following:

  1. Find the X and Y coordinate locations. Which value is X and which value is Y?
  2. These data were collected in a geographic coordinate system (WGS84). Convert the dataframe into an sf object.
  3. Plot the new points with the plot location points from above. Be sure to add a legend. Use a different symbol for the 2 new points!

If you have extra time, feel free to add roads and other layers to your map!

Answers

1) First we will read in the new csv file and look at the data structure.

newplot_locations_HARV <-
  read.csv("data/NEON-DS-Site-Layout-Files/HARV/HARV_2NewPhenPlots.csv")
str(newplot_locations_HARV)
'data.frame':	2 obs. of  13 variables:
 $ decimalLat: num  42.5 42.5
 $ decimalLon: num  -72.2 -72.2
 $ country   : chr  "unitedStates" "unitedStates"
 $ stateProvi: chr  "MA" "MA"
 $ county    : chr  "Worcester" "Worcester"
 $ domainName: chr  "Northeast" "Northeast"
 $ domainID  : chr  "D01" "D01"
 $ siteID    : chr  "HARV" "HARV"
 $ plotType  : chr  "tower" "tower"
 $ subtype   : chr  "phenology" "phenology"
 $ plotSize  : int  40000 40000
 $ plotDimens: chr  "200m x 200m" "200m x 200m"
 $ elevation : num  358 346

2) The US boundary data we worked with previously is in a geographic WGS84 CRS. We can use that data to establish a CRS for this data. First we will extract the CRS from the country_boundary_US object and confirm that it is WGS84.

geogCRS <- st_crs(country_boundary_US)
geogCRS
Coordinate Reference System:
  User input: 4326 
  wkt:
GEOGCS["GCS_WGS_1984",
    DATUM["WGS_1984",
        SPHEROID["WGS_84",6378137,298.257223563]],
    PRIMEM["Greenwich",0],
    UNIT["Degree",0.017453292519943295],
    AUTHORITY["EPSG","4326"]]

Then we will convert our new data to a spatial dataframe, using the geogCRS object as our CRS.

newPlot.Sp.HARV <- st_as_sf(newplot_locations_HARV, coords = c("decimalLon", "decimalLat"), crs = geogCRS)

Next we’ll confirm that the CRS for our new object is correct.

st_crs(newPlot.Sp.HARV)
Coordinate Reference System:
  User input: 4326 
  wkt:
GEOGCS["GCS_WGS_1984",
    DATUM["WGS_1984",
        SPHEROID["WGS_84",6378137,298.257223563]],
    PRIMEM["Greenwich",0],
    UNIT["Degree",0.017453292519943295],
    AUTHORITY["EPSG","4326"]]

We will be adding these new data points to the plot we created before. The data for the earlier plot was in UTM. Since we’re using ggplot, it will reproject the data for us.

3) Now we can create our plot.

ggplot() +
  geom_sf(data = plot_locations_sp_HARV, color = "orange") +
  geom_sf(data = newPlot.Sp.HARV, color = "lightblue") +
  ggtitle("Map of All Plot Locations")

plot of chunk plot-locations-harv-orange

Export a Shapefile

We can write an R spatial object to a shapefile using the st_write function in sf. To do this we need the following arguments:

We can now export the spatial object as a shapefile.

st_write(plot_locations_sp_HARV,
         "data/PlotLocations_HARV.shp", driver = "ESRI Shapefile")

Key Points

  • Know the projection (if any) of your point data prior to converting to a spatial object.

  • Convert a data frame to an sf object using the st_as_sf() function.

  • Export an sf object as text using the st_write() function.


Manipulate Raster Data

Overview

Teaching: 40 min
Exercises: 20 min
Questions
  • How can I crop raster objects to vector objects, and extract the summary of raster pixels?

Objectives
  • Crop a raster to the extent of a vector layer.

  • Extract values from a raster that correspond to a vector file overlay.

Things You’ll Need To Complete This Episode

See the lesson homepage for detailed information about the software, data, and other prerequisites you will need to work through the examples in this episode.

This episode explains how to crop a raster using the extent of a vector shapefile. We will also cover how to extract values from a raster that occur within a set of polygons, or in a buffer (surrounding) region around a set of points.

Crop a Raster to Vector Extent

We often work with spatial layers that have different spatial extents. The spatial extent of a shapefile or R spatial object represents the geographic “edge” or location that is the furthest north, south east and west. Thus is represents the overall geographic coverage of the spatial object.

Extent illustration Image Source: National Ecological Observatory Network (NEON)

The graphic below illustrates the extent of several of the spatial layers that we have worked with in this workshop:

plot of chunk compare-data-extents

Frequent use cases of cropping a raster file include reducing file size and creating maps. Sometimes we have a raster file that is much larger than our study area or area of interest. It is often more efficient to crop the raster to the extent of our study area to reduce file sizes as we process our data. Cropping a raster can also be useful when creating pretty maps so that the raster layer matches the extent of the desired vector layers.

Crop a Raster Using Vector Extent

We can use the crop() function to crop a raster to the extent of another spatial object. To do this, we need to specify the raster to be cropped and the spatial object that will be used to crop the raster. R will use the extent of the spatial object as the cropping boundary.

To illustrate this, we will crop the Canopy Height Model (CHM) to only include the area of interest (AOI). Let’s start by plotting the full extent of the CHM data and overlay where the AOI falls within it. The boundaries of the AOI will be colored blue, and we use fill = NA to make the area transparent.

ggplot() +
  geom_raster(data = CHM_HARV_df, aes(x = x, y = y, fill = HARV_chmCrop)) + 
  scale_fill_gradientn(name = "Canopy Height", colors = terrain.colors(10)) +
  geom_sf(data = aoi_boundary_HARV, color = "blue", fill = NA) +
  coord_sf()

plot of chunk crop-by-vector-extent

Now that we have visualized the area of the CHM we want to subset, we can perform the cropping operation. We are going to crop() function from the raster package to create a new object with only the portion of the CHM data that falls within the boundaries of the AOI.

CHM_HARV_Cropped <- crop(x = CHM_HARV, y = aoi_boundary_HARV)

Now we can plot the cropped CHM data, along with a boundary box showing the full CHM extent. However, remember, since this is raster data, we need to convert to a data frame in order to plot using ggplot. To get the boundary box from CHM, the st_bbox() will extract the 4 corners of the rectangle that encompass all the features contained in this object. The st_as_sfc() converts these 4 coordinates into a polygon that we can plot:

CHM_HARV_Cropped_df <- as.data.frame(CHM_HARV_Cropped, xy = TRUE)

ggplot() +
  geom_sf(data = st_as_sfc(st_bbox(CHM_HARV)), fill = "green",
          color = "green", alpha = .2) +  
  geom_raster(data = CHM_HARV_Cropped_df,
              aes(x = x, y = y, fill = HARV_chmCrop)) + 
  scale_fill_gradientn(name = "Canopy Height", colors = terrain.colors(10)) + 
  coord_sf()

plot of chunk show-cropped-area

The plot above shows that the full CHM extent (plotted in green) is much larger than the resulting cropped raster. Our new cropped CHM now has the same extent as the aoi_boundary_HARV object that was used as a crop extent (blue border below).

ggplot() +
  geom_raster(data = CHM_HARV_Cropped_df,
              aes(x = x, y = y, fill = HARV_chmCrop)) + 
  geom_sf(data = aoi_boundary_HARV, color = "blue", fill = NA) + 
  scale_fill_gradientn(name = "Canopy Height", colors = terrain.colors(10)) + 
  coord_sf()

plot of chunk view-crop-extent

We can look at the extent of all of our other objects for this field site.

st_bbox(CHM_HARV)
   xmin    ymin    xmax    ymax 
 731453 4712471  733150 4713838 
st_bbox(CHM_HARV_Cropped)
   xmin    ymin    xmax    ymax 
 732128 4713209  732251 4713359 
st_bbox(aoi_boundary_HARV)
     xmin      ymin      xmax      ymax 
 732128.0 4713208.7  732251.1 4713359.2 
st_bbox(plot_locations_sp_HARV)
     xmin      ymin      xmax      ymax 
 731405.3 4712845.0  732275.3 4713846.3 

Our plot location extent is not the largest but is larger than the AOI Boundary. It would be nice to see our vegetation plot locations plotted on top of the Canopy Height Model information.

Challenge: Crop to Vector Points Extent

  1. Crop the Canopy Height Model to the extent of the study plot locations.
  2. Plot the vegetation plot location points on top of the Canopy Height Model.

Answers

CHM_plots_HARVcrop <- crop(x = CHM_HARV, y = plot_locations_sp_HARV)

CHM_plots_HARVcrop_df <- as.data.frame(CHM_plots_HARVcrop, xy = TRUE)

ggplot() + 
  geom_raster(data = CHM_plots_HARVcrop_df, aes(x = x, y = y, fill = HARV_chmCrop)) + 
  scale_fill_gradientn(name = "Canopy Height", colors = terrain.colors(10)) + 
  geom_sf(data = plot_locations_sp_HARV) + 
  coord_sf()

plot of chunk challenge-code-crop-raster-points

In the plot above, created in the challenge, all the vegetation plot locations (black dots) appear on the Canopy Height Model raster layer except for one. One is situated on the blank space to the left of the map. Why?

A modification of the first figure in this episode is below, showing the relative extents of all the spatial objects. Notice that the extent for our vegetation plot layer (black) extends further west than the extent of our CHM raster (bright green). The crop() function will make a raster extent smaller, it will not expand the extent in areas where there are no data. Thus, the extent of our vegetation plot layer will still extend further west than the extent of our (cropped) raster data (dark green).

plot of chunk repeat-compare-data-extents

Define an Extent

So far, we have used a shapefile to crop the extent of a raster dataset. Alternatively, we can also the extent() function to define an extent to be used as a cropping boundary. This creates a new object of class extent. Here we will provide the extent() function our xmin, xmax, ymin, and ymax (in that order).

new_extent <- extent(732161.2, 732238.7, 4713249, 4713333)
class(new_extent)
[1] "Extent"
attr(,"package")
[1] "raster"

Data Tip

The extent can be created from a numeric vector (as shown above), a matrix, or a list. For more details see the extent() function help file (?raster::extent).

Once we have defined our new extent, we can use the crop() function to crop our raster to this extent object.

CHM_HARV_manual_cropped <- crop(x = CHM_HARV, y = new_extent)

To plot this data using ggplot() we need to convert it to a dataframe.

CHM_HARV_manual_cropped_df <- as.data.frame(CHM_HARV_manual_cropped, xy = TRUE)

Now we can plot this cropped data. We will show the AOI boundary on the same plot for scale.

ggplot() + 
  geom_sf(data = aoi_boundary_HARV, color = "blue", fill = NA) +
  geom_raster(data = CHM_HARV_manual_cropped_df,
              aes(x = x, y = y, fill = HARV_chmCrop)) + 
  scale_fill_gradientn(name = "Canopy Height", colors = terrain.colors(10)) + 
  coord_sf()

plot of chunk show-manual-crop-area

Extract Raster Pixels Values Using Vector Polygons

Often we want to extract values from a raster layer for particular locations - for example, plot locations that we are sampling on the ground. We can extract all pixel values within 20m of our x,y point of interest. These can then be summarized into some value of interest (e.g. mean, maximum, total).

Extract raster information using a polygon boundary. From https://www.neonscience.org/sites/default/files/images/spatialData/BufferSquare.png

To do this in R, we use the extract() function. The extract() function requires:

We will begin by extracting all canopy height pixel values located within our aoi_boundary_HARV polygon which surrounds the tower located at the NEON Harvard Forest field site.

tree_height <- extract(x = CHM_HARV, y = aoi_boundary_HARV, df = TRUE)

str(tree_height)
'data.frame':	18450 obs. of  2 variables:
 $ ID          : num  1 1 1 1 1 1 1 1 1 1 ...
 $ HARV_chmCrop: num  21.2 23.9 23.8 22.4 23.9 ...

When we use the extract() function, R extracts the value for each pixel located within the boundary of the polygon being used to perform the extraction - in this case the aoi_boundary_HARV object (a single polygon). Here, the function extracted values from 18,450 pixels.

We can create a histogram of tree height values within the boundary to better understand the structure or height distribution of trees at our site. We will use the column layer from our data frame as our x values, as this column represents the tree heights for each pixel.

ggplot() + 
  geom_histogram(data = tree_height, aes(x = HARV_chmCrop)) +
  ggtitle("Histogram of CHM Height Values (m)") +
  xlab("Tree Height") + 
  ylab("Frequency of Pixels")
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

plot of chunk view-extract-histogram

We can also use the summary() function to view descriptive statistics including min, max, and mean height values. These values help us better understand vegetation at our field site.

summary(tree_height$HARV_chmCrop)
   Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
   2.03   21.36   22.81   22.43   23.97   38.17 

Summarize Extracted Raster Values

We often want to extract summary values from a raster. We can tell R the type of summary statistic we are interested in using the fun = argument. Let’s extract a mean height value for our AOI. Because we are extracting only a single number, we will not use the df = TRUE argument.

mean_tree_height_AOI <- extract(x = CHM_HARV, y = aoi_boundary_HARV, fun = mean)

mean_tree_height_AOI
         [,1]
[1,] 22.43018

It appears that the mean height value, extracted from our LiDAR data derived canopy height model is 22.43 meters.

Extract Data using x,y Locations

We can also extract pixel values from a raster by defining a buffer or area surrounding individual point locations using the extract() function. To do this we define the summary argument (fun = mean) and the buffer distance (buffer = 20) which represents the radius of a circular region around each point. By default, the units of the buffer are the same units as the data’s CRS. All pixels that are touched by the buffer region are included in the extract.

Extract raster information using a buffer region. From: https://www.neonscience.org/sites/default/files/images/spatialData/BufferCircular.png

Source: National Ecological Observatory Network (NEON).

Let’s put this into practice by figuring out the mean tree height in the 20m around the tower location (point_HARV). Because we are extracting only a single number, we will not use the df = TRUE argument.

mean_tree_height_tower <- extract(x = CHM_HARV,
                                  y = point_HARV,
                                  buffer = 20,
                                  fun = mean)

mean_tree_height_tower
[1] 22.38812

Challenge: Extract Raster Height Values For Plot Locations

1) Use the plot locations object (plot_locations_sp_HARV) to extract an average tree height for the area within 20m of each vegetation plot location in the study area. Because there are multiple plot locations, there will be multiple averages returned, so the df = TRUE argument should be used.

2) Create a plot showing the mean tree height of each area.

Answers

# extract data at each plot location
mean_tree_height_plots_HARV <- extract(x = CHM_HARV,
                                       y = plot_locations_sp_HARV,
                                       buffer = 20,
                                       fun = mean,
                                       df = TRUE)

# view data
mean_tree_height_plots_HARV
   ID HARV_chmCrop
1   1           NA
2   2     23.96708
3   3     22.35182
4   4     16.49719
5   5     21.55459
6   6     19.16891
7   7     20.61542
8   8     21.61490
9   9     12.23897
10 10     19.13231
11 11     21.36908
12 12     19.31904
13 13     17.25802
14 14     20.47314
15 15     12.68322
16 16     15.51574
17 17     18.90796
18 18     18.19454
19 19     19.67558
20 20     20.23258
21 21     20.44836
# plot data
ggplot(data = mean_tree_height_plots_HARV, aes(ID, HARV_chmCrop)) + 
  geom_col() + 
  ggtitle("Mean Tree Height at each Plot") + 
  xlab("Plot ID") + 
  ylab("Tree Height (m)")
Warning: Removed 1 rows containing missing values (position_stack).

plot of chunk hist-tree-height-veg-plot

Key Points

  • Use the crop() function to crop a raster object.

  • Use the extract() function to extract pixels from a raster object that fall within a particular extent boundary.

  • Use the extent() function to define an extent.


Raster Time Series Data

Overview

Teaching: 40 min
Exercises: 20 min
Questions
  • How can I view and and plot data for different times of the year?

Objectives
  • Understand the format of a time series raster dataset.

  • Work with time series rasters.

  • Import a set of rasters stored in a single directory.

  • Create a multi-paneled plot.

  • Convert character data to date format.

Things You’ll Need To Complete This Episode

See the lesson homepage for detailed information about the software, data, and other prerequisites you will need to work through the examples in this episode.

This episode covers how to work with and plot a raster time series, using an R RasterStack object. It also covers practical assessment of data quality in remote sensing derived imagery.

About Raster Time Series Data

A raster data file can contain one single band or many bands. If the raster data contains imagery data, each band may represent reflectance for a different wavelength (color or type of light) or set of wavelengths - for example red, green and blue. A multi-band raster may two or more bands or layers of data collected at different times for the same extent (region) and of the same resolution. For this episode, we will work with a time series of normalized difference vegetation index (NDVI) and RGB data from the Harvard Forest site. We introduced the concepts of NDVI and RGB data in an earlier lesson and worked with an RGB RasterStack in the Work with Multi-band Rasters in R episode.

In this episode, we will:

  1. Import NDVI data in GeoTIFF format.
  2. Import, explore and plot NDVI data derived for several dates throughout the year.
  3. View the RGB imagery used to derived the NDVI time series to better understand unusual / outlier values.

RGB Data

While the NDVI data is a single band product, the RGB images that contain the red band used to derive NDVI, contain 3 (of the 7) 30m resolution bands available from Landsat data. The RGB directory contains RGB images for each time period that NDVI is available.

Getting Started

In this episode, we will use the raster, rgdal, reshape, and scales packages. Make sure you have them loaded.

library(raster)
library(rgdal)
library(reshape)
library(scales)

To begin, we will create a list of raster files using the list.files() function. This list will be used to generate a RasterStack. We will only add files that have a .tif extension to our list. To do this, we will use the syntax pattern=".tif$". If we specify full.names = TRUE, the full path for each file will be added to the list.

Data Tip

In the pattern above, the $ character represents the end of a line. Using it ensures that our pattern will only match files that end in .tif. This pattern matching uses a language called “regular expressions”, which is beyond the scope of this workshop.

NDVI_HARV_path <- "data/NEON-DS-Landsat-NDVI/HARV/2011/NDVI"

all_NDVI_HARV <- list.files(NDVI_HARV_path,
                            full.names = TRUE,
                            pattern = ".tif$")

It’s a good idea to look at the file names that matched our search to make sure they meet our expectations.

all_NDVI_HARV
 [1] "data/NEON-DS-Landsat-NDVI/HARV/2011/NDVI/005_HARV_ndvi_crop.tif"
 [2] "data/NEON-DS-Landsat-NDVI/HARV/2011/NDVI/037_HARV_ndvi_crop.tif"
 [3] "data/NEON-DS-Landsat-NDVI/HARV/2011/NDVI/085_HARV_ndvi_crop.tif"
 [4] "data/NEON-DS-Landsat-NDVI/HARV/2011/NDVI/133_HARV_ndvi_crop.tif"
 [5] "data/NEON-DS-Landsat-NDVI/HARV/2011/NDVI/181_HARV_ndvi_crop.tif"
 [6] "data/NEON-DS-Landsat-NDVI/HARV/2011/NDVI/197_HARV_ndvi_crop.tif"
 [7] "data/NEON-DS-Landsat-NDVI/HARV/2011/NDVI/213_HARV_ndvi_crop.tif"
 [8] "data/NEON-DS-Landsat-NDVI/HARV/2011/NDVI/229_HARV_ndvi_crop.tif"
 [9] "data/NEON-DS-Landsat-NDVI/HARV/2011/NDVI/245_HARV_ndvi_crop.tif"
[10] "data/NEON-DS-Landsat-NDVI/HARV/2011/NDVI/261_HARV_ndvi_crop.tif"
[11] "data/NEON-DS-Landsat-NDVI/HARV/2011/NDVI/277_HARV_ndvi_crop.tif"
[12] "data/NEON-DS-Landsat-NDVI/HARV/2011/NDVI/293_HARV_ndvi_crop.tif"
[13] "data/NEON-DS-Landsat-NDVI/HARV/2011/NDVI/309_HARV_ndvi_crop.tif"

Now we have a list of all GeoTIFF files in the NDVI directory for Harvard Forest. Next, we will create a RasterStack from this list using the stack() function. We introduced the stack() function in an earlier episode.

NDVI_HARV_stack <- stack(all_NDVI_HARV)

We can explore the GeoTIFF tags (the embedded metadata) in a stack using the same syntax that we used on single-band raster objects in R including: crs() (coordinate reference system), extent() and res() (resolution; specifically yres() and xres()).

crs(NDVI_HARV_stack)
CRS arguments:
 +proj=utm +zone=19 +ellps=WGS84 +units=m +no_defs 

The CRS for our stack is +proj=utm +zone=19 +ellps=WGS84 +units=m +no_defs. The CRS is in UTM Zone 19. If you have completed the previous episodes in this workshop, you may have noticed that the UTM zone for the NEON collected remote sensing data was in Zone 18 rather than Zone 19. Why are the Landsat data in Zone 19?

Source: National Ecological Observatory Network (NEON).

A Landsat scene is extremely wide - spanning over 170km north to south and 180km east to west. This means that Landsat data often cover multiple UTM zones. When the data are processed, the zone in which the majority of the data cover is the zone which is used for the final CRS. Thus, our field site at Harvard Forest is located in UTM Zone 18, but the Landsat data is in a CRS of UTM Zone 19.

Challenge: Raster Metadata

Investigate the metadata for our RasterStack and answer the following questions.

  1. What are the x and y resolution of the data?
  2. What units are the above resolution in?

Answers

extent(NDVI_HARV_stack)
class      : Extent 
xmin       : 239415 
xmax       : 239535 
ymin       : 4714215 
ymax       : 4714365 
yres(NDVI_HARV_stack)
[1] 30
xres(NDVI_HARV_stack)
[1] 30

Plotting Time Series Data

Once we have created our RasterStack, we can visualize our data. We can use the ggplot() command to create a multi-panelled plot showing each band in our RasterStack. First we need to create a data frame object. Because there are multiple bands in our data, we will reshape (or “melt”) the data so that we have a single column with the NDVI observations. We will use the function melt() from the reshape package to do this:

NDVI_HARV_stack_df <- as.data.frame(NDVI_HARV_stack, xy = TRUE) %>%
    melt(id.vars = c('x','y'))

Now we can plot our data using ggplot(). We want to create a separate panel for each time point in our time series, so we will use the facet_wrap() function to create a multi-paneled plot:

ggplot() +
  geom_raster(data = NDVI_HARV_stack_df , aes(x = x, y = y, fill = value)) +
  facet_wrap(~ variable)

plot of chunk ndvi-wrap

Look at the range of NDVI values observed in the plot above. We know that the accepted values for NDVI range from 0-1. Why does our data range from 0 - 10,000?

Scale Factors

The metadata for this NDVI data specifies a scale factor: 10,000. A scale factor is sometimes used to maintain smaller file sizes by removing decimal places. Storing data in integer format keeps files sizes smaller.

Let’s apply the scale factor before we go any further. Conveniently, we can quickly apply this factor using raster math on the entire stack as follows:

NDVI_HARV_stack <- NDVI_HARV_stack/10000

After applying our scale factor, we can recreate our plot using the same code we used above.

NDVI_HARV_stack_df <- as.data.frame(NDVI_HARV_stack, xy = TRUE) %>%
    melt(id.vars = c('x','y'))

ggplot() +
  geom_raster(data = NDVI_HARV_stack_df , aes(x = x, y = y, fill = value)) +
  facet_wrap(~variable)

plot of chunk ndvi-stack-wrap

Take a Closer Look at Our Data

Let’s take a closer look at the plots of our data. Massachusetts, where the NEON Harvard Forest Field Site is located, has a fairly consistent fall, winter, spring, and summer season where vegetation turns green in the spring, continues to grow throughout the summer, and begins to change colors and senesce in the fall through winter. Do you notice anything that seems unusual about the patterns of greening and browning observed in the plots above?

Hint: the number after the “X” in each tile title is the Julian day which in this case represents the number of days into each year. If you are unfamiliar with Julian day, check out the NEON Data Skills Converting to Julian Day tutorial.

View Distribution of Raster Values

In the above exercise, we viewed plots of our NDVI time series and noticed a few images seem to be unusually light. However this was only a visual representation of potential issues in our data. What is another way we can look at these data that is quantitative?

Next we will use histograms to explore the distribution of NDVI values stored in each raster.

ggplot(NDVI_HARV_stack_df) +
  geom_histogram(aes(value)) +
    facet_wrap(~variable)
`stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

plot of chunk view-stack-histogram

It seems like things get green in the spring and summer like we expect, but the data at Julian days 277 and 293 are unusual. It appears as if the vegetation got green in the spring, but then died back only to get green again towards the end of the year. Is this right?

Explore Unusual Data Patterns

The NDVI data that we are using comes from 2011, perhaps a strong freeze around Julian day 277 could cause a vegetation to senesce early, however in the eastern United States, it seems unusual that it would proceed to green up again shortly thereafter.

Let’s next view some temperature data for our field site to see whether there were some unusual fluctuations that may explain this pattern of greening and browning seen in the NDVI data. First we will read in the temperature data and preview the structure of that dataframe:

har_met_daily <-
  read.csv("data/NEON-DS-Met-Time-Series/HARV/FisherTower-Met/hf001-06-daily-m.csv")

str(har_met_daily)
'data.frame':	5345 obs. of  46 variables:
 $ date     : chr  "2001-02-11" "2001-02-12" "2001-02-13" "2001-02-14" ...
 $ jd       : int  42 43 44 45 46 47 48 49 50 51 ...
 $ airt     : num  -10.7 -9.8 -2 -0.5 -0.4 -3 -4.5 -9.9 -4.5 3.2 ...
 $ f.airt   : chr  "" "" "" "" ...
 $ airtmax  : num  -6.9 -2.4 5.7 1.9 2.4 1.3 -0.7 -3.3 0.7 8.9 ...
 $ f.airtmax: chr  "" "" "" "" ...
 $ airtmin  : num  -15.1 -17.4 -7.3 -5.7 -5.7 -9 -12.7 -17.1 -11.7 -1.3 ...
 $ f.airtmin: chr  "" "" "" "" ...
 $ rh       : int  40 45 70 78 69 82 66 51 57 62 ...
 $ f.rh     : chr  "" "" "" "" ...
 $ rhmax    : int  58 85 100 100 100 100 100 71 81 78 ...
 $ f.rhmax  : chr  "" "" "" "" ...
 $ rhmin    : int  22 14 34 59 37 46 30 34 37 42 ...
 $ f.rhmin  : chr  "" "" "" "" ...
 $ dewp     : num  -22.2 -20.7 -7.6 -4.1 -6 -5.9 -10.8 -18.5 -12 -3.5 ...
 $ f.dewp   : chr  "" "" "" "" ...
 $ dewpmax  : num  -16.8 -9.2 -4.6 1.9 2 -0.4 -0.7 -14.4 -4 0.6 ...
 $ f.dewpmax: chr  "" "" "" "" ...
 $ dewpmin  : num  -25.7 -27.9 -10.2 -10.2 -12.1 -10.6 -25.4 -25 -16.5 -5.7 ...
 $ f.dewpmin: chr  "" "" "" "" ...
 $ prec     : num  0 0 0 6.9 0 2.3 0 0 0 0 ...
 $ f.prec   : chr  "" "" "" "" ...
 $ slrt     : num  14.9 14.8 14.8 2.6 10.5 6.4 10.3 15.5 15 7.7 ...
 $ f.slrt   : chr  "" "" "" "" ...
 $ part     : num  NA NA NA NA NA NA NA NA NA NA ...
 $ f.part   : chr  "M" "M" "M" "M" ...
 $ netr     : num  NA NA NA NA NA NA NA NA NA NA ...
 $ f.netr   : chr  "M" "M" "M" "M" ...
 $ bar      : int  1025 1033 1024 1016 1010 1016 1008 1022 1022 1017 ...
 $ f.bar    : chr  "" "" "" "" ...
 $ wspd     : num  3.3 1.7 1.7 2.5 1.6 1.1 3.3 2 2.5 2 ...
 $ f.wspd   : chr  "" "" "" "" ...
 $ wres     : num  2.9 0.9 0.9 1.9 1.2 0.5 3 1.9 2.1 1.8 ...
 $ f.wres   : chr  "" "" "" "" ...
 $ wdir     : int  287 245 278 197 300 182 281 272 217 218 ...
 $ f.wdir   : chr  "" "" "" "" ...
 $ wdev     : int  27 55 53 38 40 56 24 24 31 27 ...
 $ f.wdev   : chr  "" "" "" "" ...
 $ gspd     : num  15.4 7.2 9.6 11.2 12.7 5.8 16.9 10.3 11.1 10.9 ...
 $ f.gspd   : chr  "" "" "" "" ...
 $ s10t     : num  NA NA NA NA NA NA NA NA NA NA ...
 $ f.s10t   : chr  "M" "M" "M" "M" ...
 $ s10tmax  : num  NA NA NA NA NA NA NA NA NA NA ...
 $ f.s10tmax: chr  "M" "M" "M" "M" ...
 $ s10tmin  : num  NA NA NA NA NA NA NA NA NA NA ...
 $ f.s10tmin: chr  "M" "M" "M" "M" ...

The date column is currently coded as a factor. We want to be able to treat it as a date, so we will use the as.Date() function to convert it. We need to tell R what format the data is in. Our dates are YYY-MM-DD, which is represented by R as %Y-%m-%d.

har_met_daily$date <- as.Date(har_met_daily$date, format = "%Y-%m-%d")

We only want to look at the data from 2011:

yr_11_daily_avg <- subset(har_met_daily,
                            date >= as.Date('2011-01-01') &
                            date <= as.Date('2011-12-31'))

Now we can plot the air temperature (the airt column) by Julian day (the jd column):

ggplot() +
  geom_point(data = yr_11_daily_avg, aes(jd, airt)) +
  ggtitle("Daily Mean Air Temperature",
          subtitle = "NEON Harvard Forest Field Site") +
  xlab("Julian Day 2011") +
  ylab("Mean Air Temperature (C)")

plot of chunk air-temperature

There are no significant peaks or dips in the temperature during the late summer or early fall time period that might account for patterns seen in the NDVI data. Let’s have a look at the source Landsat imagery that was partially used used to derive our NDVI rasters to try to understand what appear to be outlier NDVI values.

plot of chunk ndvi-plotsplot of chunk ndvi-plots

Challenge: Examine RGB Raster Files

Plot the RGB images for the Julian days 277 and 293. Compare those with the RGB plots for Julian days 133 and 197 (shown above). Does the RGB imagery from these two days explain the low NDVI values observed on these days?

Answers

First we need to load in the RGB data for Julian day 277 and look at its metadata.

RGB_277 <- stack("data/NEON-DS-Landsat-NDVI/HARV/2011/RGB/277_HARV_landRGB.tif")
RGB_277
class      : RasterStack 
dimensions : 652, 696, 453792, 3  (nrow, ncol, ncell, nlayers)
resolution : 30, 30  (x, y)
extent     : 230775, 251655, 4704825, 4724385  (xmin, xmax, ymin, ymax)
crs        : +proj=utm +zone=19 +datum=WGS84 +units=m +no_defs +ellps=WGS84 +towgs84=0,0,0 
names      : X277_HARV_landRGB.1, X277_HARV_landRGB.2, X277_HARV_landRGB.3 
min values :                  26,                  29,                  79 
max values :                 255,                 255,                 255 

The RGB data has a max value of 255, but we need our color intensity to be between 0 and 1, so we will divide our RasterStack object by 255.

RGB_277 <- RGB_277/255

Next we convert it to a dataframe.

RGB_277_df <- as.data.frame(RGB_277, xy = TRUE)

We create RGB colors from the three channels:

RGB_277_df$rgb <- with(RGB_277_df, rgb(X277_HARV_landRGB.1, X277_HARV_landRGB.2, X277_HARV_landRGB.3,1))

Finally, we can plot the RGB data for Julian day 277.

ggplot() +
  geom_raster(data=RGB_277_df, aes(x, y), fill=RGB_277_df$rgb) + 
  ggtitle("Julian day 277") 

plot of chunk rgb-277 We then do the same steps for Julian day 293

# Julian day 293
RGB_293 <- stack("data/NEON-DS-Landsat-NDVI/HARV/2011/RGB/293_HARV_landRGB.tif")
RGB_293 <- RGB_293/255
RGB_293_df <- as.data.frame(RGB_293, xy = TRUE)
RGB_293_df$rgb <- with(RGB_293_df, rgb(X293_HARV_landRGB.1, X293_HARV_landRGB.2, X293_HARV_landRGB.3,1))
ggplot() +
  geom_raster(data = RGB_293_df, aes(x, y), fill = RGB_293_df$rgb) +
  ggtitle("Julian day 293")

plot of chunk rgb-293 This example highlights the importance of exploring the source of a derived data product. In this case, the NDVI data product was created using Landsat imagery - specifically the red and near-infrared bands. When we look at the RGB collected at Julian days 277 and 293 we see that most of the image is filled with clouds. The very low NDVI values resulted from cloud cover — a common challenge that we encounter when working with satellite remote sensing imagery.

Key Points

  • Use the list.files() function to get a list of filenames matching a specific pattern.

  • Use the facet_wrap() function to create multi-paneled plots with ggplot2.

  • Use the as.Date() function to convert data to date format.


Create Publication-quality Graphics

Overview

Teaching: 40 min
Exercises: 20 min
Questions
  • How can I create a publication-quality graphic and customize plot parameters?

Objectives
  • Assign custom names to bands in a RasterStack.

  • Customize raster plots using the ggplot2 package.

Things You’ll Need To Complete This Episode

See the lesson homepage for detailed information about the software, data, and other prerequisites you will need to work through the examples in this episode.

This episode covers how to customize your raster plots using the ggplot2 package in R to create publication-quality plots.

Before and After

In the previous episode, we learned how to plot multi-band raster data in R using the facet_wrap() function. This created a separate panel in our plot for each raster band. The plot we created together is shown below:

plot of chunk levelplot-time-series-before

Although this plot is informative, it isn’t something we would expect to see in a journal publication. The x and y-axis labels aren’t informative. There is a lot of unnecessary gray background and the titles of each panel don’t clearly state that the number refers to the Julian day the data was collected. In this episode, we will customize this plot above to produce a publication quality graphic. We will go through these steps iteratively. When we’re done, we will have created the plot shown below.

plot of chunk levelplot-time-series-after

Adjust the Plot Theme

The first thing we will do to our plot remove the x and y-axis labels and axis ticks, as these are unnecessary and make our plot look messy. We can do this by setting the plot theme to void.

ggplot() +
  geom_raster(data = NDVI_HARV_stack_df , aes(x = x, y = y, fill = value)) +
  facet_wrap(~variable) +
  ggtitle("Landsat NDVI", subtitle = "NEON Harvard Forest") + 
  theme_void()

plot of chunk adjust-theme

Next we will center our plot title and subtitle. We need to do this after the theme_void() layer, because R interprets the ggplot layers in order. If we first tell R to center our plot title, and then set the theme to void, any adjustments we’ve made to the plot theme will be over-written by the theme_void() function. So first we make the theme void and then we center the title. We center both the title and subtitle by using the theme() function and setting the hjust parameter to 0.5. The hjust parameter stands for “horizontal justification” and takes any value between 0 and 1. A setting of 0 indicates left justification and a setting of 1 indicates right justification.

ggplot() +
  geom_raster(data = NDVI_HARV_stack_df , aes(x = x, y = y, fill = value)) +
  facet_wrap(~variable) +
  ggtitle("Landsat NDVI", subtitle = "NEON Harvard Forest") + 
  theme_void() + 
  theme(plot.title = element_text(hjust = 0.5),
        plot.subtitle = element_text(hjust = 0.5))

plot of chunk adjust-theme-2

Challenge

Change the plot title (but not the subtitle) to bold font. You can (and should!) use the help menu in RStudio or any internet resources to figure out how to change this setting.

Answers

Learners can find this information in the help files for the theme() function. The parameter to set is called face.

ggplot() +
  geom_raster(data = NDVI_HARV_stack_df,
              aes(x = x, y = y, fill = value)) +
  facet_wrap(~ variable) +
  ggtitle("Landsat NDVI", subtitle = "NEON Harvard Forest") + 
  theme_void() + 
  theme(plot.title = element_text(hjust = 0.5, face = "bold"), 
        plot.subtitle = element_text(hjust = 0.5))

plot of chunk use-bold-face

Adjust the Color Ramp

Next, let’s adjust the color ramp used to render the rasters. First, we can change the blue color ramp to a green one that is more visually suited to our NDVI (greenness) data using the colorRampPalette() function in combination with colorBrewer. Then we use scale_fill_gradientn to pass the list of colours (here 20 different colours) to ggplot.

First we need to create a set of colors to use. We will select a set of nine colors from the “YlGn” (yellow-green) color palette. This returns a set of hex color codes:

brewer.pal(9, "YlGn")
[1] "#FFFFE5" "#F7FCB9" "#D9F0A3" "#ADDD8E" "#78C679" "#41AB5D" "#238443"
[8] "#006837" "#004529"

Then we will pass those color codes to the colorRampPalette function, which will interpolate from those colors a more nuanced color range.

green_colors <- brewer.pal(9, "YlGn") %>%
  colorRampPalette()

We can tell the colorRampPalette() function how many discrete colors within this color range to create. In our case, we will use 20 colors when we plot our graphic.

ggplot() +
  geom_raster(data = NDVI_HARV_stack_df , aes(x = x, y = y, fill = value)) +
  facet_wrap(~variable) +
  ggtitle("Landsat NDVI", subtitle = "NEON Harvard Forest") + 
  theme_void() + 
  theme(plot.title = element_text(hjust = 0.5, face = "bold"), 
    plot.subtitle = element_text(hjust = 0.5)) + 
  scale_fill_gradientn(name = "NDVI", colours = green_colors(20))

plot of chunk change-color-ramp

The yellow to green color ramp visually represents NDVI well given it’s a measure of greenness. Someone looking at the plot can quickly understand that pixels that are more green have a higher NDVI value.

Data Tip

For all of the brewer.pal ramp names see the brewerpal page.

Data Tip

Cynthia Brewer, the creator of ColorBrewer, offers an online tool to help choose suitable color ramps, or to create your own. ColorBrewer 2.0; Color Advise for Cartography

Refine Plot & Tile Labels

Next, let’s label each panel in our plot with the Julian day that the raster data for that panel was collected. The current names come from the band “layer names”” stored in the RasterStack and the first part of each name is the Julian day.

To create a more meaningful label we can remove the “x” and replace it with “day” using the gsub() function in R. The syntax is as follows: gsub("StringToReplace", "TextToReplaceIt", object).

First let’s remove “_HARV_NDVI_crop” from each label to make the labels shorter and remove repetition. To illustrate how this works, we will first look at the names for our NDVI_HARV_stack object:

names(NDVI_HARV_stack)
 [1] "X005_HARV_ndvi_crop" "X037_HARV_ndvi_crop" "X085_HARV_ndvi_crop"
 [4] "X133_HARV_ndvi_crop" "X181_HARV_ndvi_crop" "X197_HARV_ndvi_crop"
 [7] "X213_HARV_ndvi_crop" "X229_HARV_ndvi_crop" "X245_HARV_ndvi_crop"
[10] "X261_HARV_ndvi_crop" "X277_HARV_ndvi_crop" "X293_HARV_ndvi_crop"
[13] "X309_HARV_ndvi_crop"

Now we will use the gsub() function to find the character string “_HARV_ndvi_crop” and replace it with a blank string (“”). We will assign this output to a new object (raster_names) and look at that object to make sure our code is doing what we want it to.

raster_names <- names(NDVI_HARV_stack)

raster_names <- gsub("_HARV_ndvi_crop", "", raster_names)
raster_names
 [1] "X005" "X037" "X085" "X133" "X181" "X197" "X213" "X229" "X245" "X261"
[11] "X277" "X293" "X309"

So far so good. Now we will use gsub() again to replace the “X” with the word “Day” followed by a space.

raster_names  <- gsub("X", "Day ", raster_names)
raster_names
 [1] "Day 005" "Day 037" "Day 085" "Day 133" "Day 181" "Day 197" "Day 213"
 [8] "Day 229" "Day 245" "Day 261" "Day 277" "Day 293" "Day 309"

Our labels look good now. Let’s reassign them to our all_NDVI_HARV object:

labels_names <- setNames(raster_names, unique(NDVI_HARV_stack_df$variable))

Once the names for each band have been reassigned, we can render our plot with the new labels using alabeller.

ggplot() +
  geom_raster(data = NDVI_HARV_stack_df , aes(x = x, y = y, fill = value)) +
  facet_wrap(~variable, labeller = labeller(variable = labels_names)) +
  ggtitle("Landsat NDVI", subtitle = "NEON Harvard Forest") + 
  theme_void() + 
  theme(plot.title = element_text(hjust = 0.5, face = "bold"), 
    plot.subtitle = element_text(hjust = 0.5)) + 
  scale_fill_gradientn(name = "NDVI", colours = green_colors(20))

plot of chunk create-levelplot

Change Layout of Panels

We can adjust the columns of our plot by setting the number of columns ncol and the number of rows nrow in facet_wrap. Let’s make our plot so that it has a width of five panels.

ggplot() +
  geom_raster(data = NDVI_HARV_stack_df , aes(x = x, y = y, fill = value)) +
  facet_wrap(~variable, ncol = 5, labeller = labeller(variable = labels_names)) +
  ggtitle("Landsat NDVI", subtitle = "NEON Harvard Forest") + 
  theme_void() + 
  theme(plot.title = element_text(hjust = 0.5, face = "bold"), 
    plot.subtitle = element_text(hjust = 0.5)) + 
  scale_fill_gradientn(name = "NDVI", colours = green_colors(20))

plot of chunk adjust-layout

Now we have a beautiful, publication quality plot!

Challenge: Divergent Color Ramps

When we used the gsub() function to modify the tile labels we replaced the beginning of each tile title with “Day”. A more descriptive name could be “Julian Day”. Update the plot above with the following changes:

  1. Label each tile “Julian Day” with the julian day value following.
  2. Change the color ramp to a divergent brown to green color ramp.

Questions: Does having a divergent color ramp represent the data better than a sequential color ramp (like “YlGn”)? Can you think of other data sets where a divergent color ramp may be best?

Answers

raster_names  <- gsub("Day","Julian Day ", raster_names)
labels_names <- setNames(raster_names, unique(NDVI_HARV_stack_df$variable))

brown_green_colors <- colorRampPalette(brewer.pal(9, "BrBG"))

ggplot() +
  geom_raster(data = NDVI_HARV_stack_df , aes(x = x, y = y, fill = value)) +
  facet_wrap(~variable, ncol = 5, labeller = labeller(variable = labels_names)) +
  ggtitle("Landsat NDVI - Julian Days", subtitle = "Harvard Forest 2011") +
  theme_void() +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"), 
  plot.subtitle = element_text(hjust = 0.5)) +
  scale_fill_gradientn(name = "NDVI", colours = brown_green_colors(20))

plot of chunk final-figure

For NDVI data, the sequential color ramp is better than the divergent as it is more akin to the process of greening up, which starts off at one end and just keeps increasing.

Key Points

  • Use the theme_void() function for a clean background to your plot.

  • Use the element_text() function to adjust text size, font, and position.

  • Use the brewer.pal() function to create a custom color palette.

  • Use the gsub() function to do pattern matching and replacement in text.


Derive Values from Raster Time Series

Overview

Teaching: 40 min
Exercises: 20 min
Questions
  • How can I calculate, extract, and export summarized raster pixel data?

Objectives
  • Extract summary pixel values from a raster.

  • Save summary values to a .csv file.

  • Plot summary pixel values using ggplot().

  • Compare NDVI values between two different sites.

Things You’ll Need To Complete This Episode

See the lesson homepage for detailed information about the software, data, and other prerequisites you will need to work through the examples in this episode.

In this episode, we will extract NDVI values from a raster time series dataset and plot them using the ggplot2 package.

Extract Summary Statistics From Raster Data

We often want to extract summary values from raster data. For example, we might want to understand overall greeness across a field site or at each plot within a field site. These values can then be compared between different field sites and combined with other related metrics to support modeling and further analysis.

Calculate Average NDVI

Our goal in this episode is to create a dataframe that contains a single, mean NDVI value for each raster in our time series. This value represents the mean NDVI value for this area on a given day.

We can calculate the mean for each raster using the cellStats() function. The cellStats() function produces a named numeric vector, where each value is associated with the name of raster stack it was derived from.

avg_NDVI_HARV <- cellStats(NDVI_HARV_stack, mean)
avg_NDVI_HARV
X005_HARV_ndvi_crop X037_HARV_ndvi_crop X085_HARV_ndvi_crop X133_HARV_ndvi_crop 
           0.365150            0.242645            0.251390            0.599300 
X181_HARV_ndvi_crop X197_HARV_ndvi_crop X213_HARV_ndvi_crop X229_HARV_ndvi_crop 
           0.878725            0.893250            0.878395            0.881505 
X245_HARV_ndvi_crop X261_HARV_ndvi_crop X277_HARV_ndvi_crop X293_HARV_ndvi_crop 
           0.850120            0.796360            0.033050            0.056895 
X309_HARV_ndvi_crop 
           0.541130 

We can then convert our output to a data frame using as.data.frame(). It’s a good idea to view the first few rows of our data frame with head() to make sure the structure is what we expect.

avg_NDVI_HARV <- as.data.frame(avg_NDVI_HARV)
head(avg_NDVI_HARV)
                    avg_NDVI_HARV
X005_HARV_ndvi_crop      0.365150
X037_HARV_ndvi_crop      0.242645
X085_HARV_ndvi_crop      0.251390
X133_HARV_ndvi_crop      0.599300
X181_HARV_ndvi_crop      0.878725
X197_HARV_ndvi_crop      0.893250

We now have a data frame with row names that are based on the original file name and a mean NDVI value for each file. Next, let’s clean up the column names in our data frame to make it easier for colleagues to work with our code.

It is a bit confusing to have duplicate object & column names (avg_NDVI_HARV). Additionally the “avg” does not clearly indicate what the value in that particular column is. Let’s change the NDVI column name to MeanNDVI.

names(avg_NDVI_HARV) <- "meanNDVI"
head(avg_NDVI_HARV)
                    meanNDVI
X005_HARV_ndvi_crop 0.365150
X037_HARV_ndvi_crop 0.242645
X085_HARV_ndvi_crop 0.251390
X133_HARV_ndvi_crop 0.599300
X181_HARV_ndvi_crop 0.878725
X197_HARV_ndvi_crop 0.893250

By renaming the column, we lose the “HARV” in the header that reminds us what site our data are from. While we are only working with one site now, we might want to compare several sites worth of data in the future. Let’s add a column to our dataframe called “site”.

avg_NDVI_HARV$site <- "HARV"

We can populate this column with the site name - HARV. Let’s also create a year column and populate it with 2011 - the year our data were collected.

avg_NDVI_HARV$year <- "2011"
head(avg_NDVI_HARV)
                    meanNDVI site year
X005_HARV_ndvi_crop 0.365150 HARV 2011
X037_HARV_ndvi_crop 0.242645 HARV 2011
X085_HARV_ndvi_crop 0.251390 HARV 2011
X133_HARV_ndvi_crop 0.599300 HARV 2011
X181_HARV_ndvi_crop 0.878725 HARV 2011
X197_HARV_ndvi_crop 0.893250 HARV 2011

We now have a dataframe that contains a row for each raster file processed, and columns for meanNDVI, site, and year.

Extract Julian Day from row names

We’d like to produce a plot where Julian days (the numeric day of the year, 0 - 365/366) are on the x-axis and NDVI is on the y-axis. To create this plot, we’ll need a column that contains the Julian day value.

One way to create a Julian day column is to use gsub() on the file name in each row. We can replace both the X and the _HARV_NDVI_crop to extract the Julian Day value, just like we did in the previous episode.

This time we will use one additional trick to do both of these steps at the same time. The vertical bar character ( | ) is equivalent to the word “or”. Using this character in our search pattern allows us to search for more than one pattern in our text strings.

julianDays <- gsub("X|_HARV_ndvi_crop", "", row.names(avg_NDVI_HARV))
julianDays
 [1] "005" "037" "085" "133" "181" "197" "213" "229" "245" "261" "277" "293"
[13] "309"

Now that we’ve extracted the Julian days from our row names, we can add that data to the data frame as a column called “julianDay”.

avg_NDVI_HARV$julianDay <- julianDays

Let’s check the class of this new column:

class(avg_NDVI_HARV$julianDay)
[1] "character"

Convert Julian Day to Date Class

Currently, the values in the Julian day column are stored as class character. Storing this data as a date object is better - for plotting, data subsetting and working with our data. Let’s convert. We worked with data conversions in an earlier episode. For a more introduction to date-time classes, see the NEON Data Skills tutorial Convert Date & Time Data from Character Class to Date-Time Class (POSIX) in R.

To convert a Julian day number to a date class, we need to set the origin, which is the day that our Julian days start counting from. Our data is from 2011 and we know that the USGS Landsat Team created Julian day values for this year. Therefore, the first day or “origin” for our Julian day count is 01 January 2011.

origin <- as.Date("2011-01-01")

Next we convert the julianDay column from character to integer.

avg_NDVI_HARV$julianDay <- as.integer(avg_NDVI_HARV$julianDay)

Once we set the Julian day origin, we can add the Julian day value (as an integer) to the origin date.

Note that when we convert our integer class julianDay values to dates, we subtracted 1. This is because the origin day is 01 January 2011, so the extracted day is 01. The Julian Day (or year day) for this is also 01. When we convert from the integer 05 julianDay value (indicating 5th of January), we cannot simply add origin + julianDay because 01 + 05 = 06 or 06 January 2011. To correct, this error we then subtract 1 to get the correct day, January 05 2011.

avg_NDVI_HARV$Date<- origin + (avg_NDVI_HARV$julianDay - 1)
head(avg_NDVI_HARV$Date)
[1] "2011-01-05" "2011-02-06" "2011-03-26" "2011-05-13" "2011-06-30"
[6] "2011-07-16"

Since the origin date was originally set as a Date class object, the new Date column is also stored as class Date.

class(avg_NDVI_HARV$Date)
[1] "Date"

Challenge: NDVI for the San Joaquin Experimental Range

We often want to compare two different sites. The National Ecological Observatory Network (NEON) also has a field site in Southern California at the San Joaquin Experimental Range (SJER).

For this challenge, create a dataframe containing the mean NDVI values and the Julian days the data was collected (in date format) for the NEON San Joaquin Experimental Range field site. NDVI data for SJER are located in the NEON-DS-Landsat-NDVI/SJER/2011/NDVI directory.

Answers

First we will read in the NDVI data for the SJER field site.

NDVI_path_SJER <- "data/NEON-DS-Landsat-NDVI/SJER/2011/NDVI"

all_NDVI_SJER <- list.files(NDVI_path_SJER,
                            full.names = TRUE,
                            pattern = ".tif$")

NDVI_stack_SJER <- stack(all_NDVI_SJER)

NDVI_stack_SJER <- NDVI_stack_SJER/10000

Then we can calculate the mean values for each day and put that in a dataframe.

avg_NDVI_SJER <- as.data.frame(cellStats(NDVI_stack_SJER, mean))

Next we rename the NDVI column, and add site and year columns to our data.

names(avg_NDVI_SJER) <- "meanNDVI"
avg_NDVI_SJER$site <- "SJER"
avg_NDVI_SJER$year <- "2011"

Now we will create our Julian day column

julianDays_SJER <- gsub("X|_SJER_ndvi_crop", "", row.names(avg_NDVI_SJER))
origin <- as.Date("2011-01-01")
avg_NDVI_SJER$julianDay <- as.integer(julianDays_SJER)

avg_NDVI_SJER$Date <- origin + (avg_NDVI_SJER$julianDay - 1)

head(avg_NDVI_SJER)
                    meanNDVI site year julianDay       Date
X014_SJER_ndvi_crop 0.048216 SJER 2011        14 2011-01-14
X046_SJER_ndvi_crop 0.529780 SJER 2011        46 2011-02-15
X062_SJER_ndvi_crop 0.554368 SJER 2011        62 2011-03-03
X094_SJER_ndvi_crop 0.601096 SJER 2011        94 2011-04-04
X110_SJER_ndvi_crop 0.555836 SJER 2011       110 2011-04-20
X126_SJER_ndvi_crop 0.538336 SJER 2011       126 2011-05-06

Plot NDVI Using ggplot

We now have a clean dataframe with properly scaled NDVI and Julian days. Let’s plot our data.

ggplot(avg_NDVI_HARV, aes(julianDay, meanNDVI)) +
  geom_point() +
  ggtitle("Landsat Derived NDVI - 2011", subtitle = "NEON Harvard Forest Field Site") +
  xlab("Julian Days") + ylab("Mean NDVI")

plot of chunk ggplot-data

Challenge: Plot San Joaquin Experimental Range Data

Create a complementary plot for the SJER data. Plot the data points in a different color.

Answers

ggplot(avg_NDVI_SJER, aes(julianDay, meanNDVI)) +
  geom_point(colour = "SpringGreen4") +
  ggtitle("Landsat Derived NDVI - 2011", subtitle = "NEON SJER Field Site") +
  xlab("Julian Day") + ylab("Mean NDVI")

plot of chunk avg-ndvi-sjer

Compare NDVI from Two Different Sites in One Plot

Comparison of plots is often easiest when both plots are side by side. Or, even better, if both sets of data are plotted in the same plot. We can do this by merging the two data sets together. The date frames must have the same number of columns and exact same column names to be merged.

NDVI_HARV_SJER <- rbind(avg_NDVI_HARV, avg_NDVI_SJER)

Now we can plot both datasets on the same plot.

ggplot(NDVI_HARV_SJER, aes(x = julianDay, y = meanNDVI, colour = site)) +
  geom_point(aes(group = site)) +
  geom_line(aes(group = site)) +
  ggtitle("Landsat Derived NDVI - 2011", subtitle = "Harvard Forest vs San Joaquin") +
  xlab("Julian Day") + ylab("Mean NDVI")

plot of chunk ndvi-harv-sjer-comp

Challenge: Plot NDVI with date

Plot the SJER and HARV data in one plot but use date, rather than Julian day, on the x-axis.

Answers

ggplot(NDVI_HARV_SJER, aes(x = Date, y = meanNDVI, colour = site)) +
  geom_point(aes(group = site)) +
  geom_line(aes(group = site)) +
  ggtitle("Landsat Derived NDVI - 2011", subtitle = "Harvard Forest vs San Joaquin") +
  xlab("Date") + ylab("Mean NDVI")

plot of chunk ndvi-harv-sjer-date

Remove Outlier Data

As we look at these plots we see variation in greenness across the year. However, the pattern is interrupted by a few points where NDVI quickly drops towards 0 during a time period when we might expect the vegetation to have a higher greenness value. Is the vegetation truly senescent or gone or are these outlier values that should be removed from the data?

We’ve seen in an earlier episode that data points with very low NDVI values can be associated with images that are filled with clouds. Thus, we can attribute the low NDVI values to high levels of cloud cover. Is the same thing happening at SJER?

plot of chunk view-all-rgb-SJER

Without significant additional processing, we will not be able to retrieve a strong reflection from vegetation, from a remotely sensed image that is predominantly cloud covered. Thus, these points are likely bad data points. Let’s remove them.

First, we will identify the good data points that should be retained. One way to do this is by identifying a threshold value. All values below that threshold will be removed from our analysis. We will use 0.1 as an example for this episode. We can then use the subset function to remove outlier datapoints (below our identified threshold).

Data Tip

Thresholding, or removing outlier data, can be tricky business. In this case, we can be confident that some of our NDVI values are not valid due to cloud cover. However, a threshold value may not always be sufficient given that 0.1 could be a valid NDVI value in some areas. This is where decision-making should be fueled by practical scientific knowledge of the data and the desired outcomes!

avg_NDVI_HARV_clean <- subset(avg_NDVI_HARV, meanNDVI > 0.1)
avg_NDVI_HARV_clean$meanNDVI < 0.1
 [1] FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

Now we can create another plot without the suspect data.

ggplot(avg_NDVI_HARV_clean, aes(x = julianDay, y = meanNDVI)) +
  geom_point() +
  ggtitle("Landsat Derived NDVI - 2011", subtitle = "NEON Harvard Forest Field Site") +
  xlab("Julian Days") + ylab("Mean NDVI")

plot of chunk plot-clean-HARV

Now our outlier data points are removed and the pattern of “green-up” and “brown-down” makes more sense.

Write NDVI data to a .csv File

We can write our final NDVI dataframe out to a text format, to quickly share with a colleague or to reuse for analysis or visualization purposes. We will export in Comma Seperated Value (.csv) file format because it is usable in many different tools and across platforms (MAC, PC, etc).

We will use write.csv() to write a specified dataframe to a .csv file. Unless you designate a different directory, the output file will be saved in your working directory.

Before saving our file, let’s view the format to make sure it is what we want as an output format.

head(avg_NDVI_HARV_clean)
                    meanNDVI site year julianDay       Date
X005_HARV_ndvi_crop 0.365150 HARV 2011         5 2011-01-05
X037_HARV_ndvi_crop 0.242645 HARV 2011        37 2011-02-06
X085_HARV_ndvi_crop 0.251390 HARV 2011        85 2011-03-26
X133_HARV_ndvi_crop 0.599300 HARV 2011       133 2011-05-13
X181_HARV_ndvi_crop 0.878725 HARV 2011       181 2011-06-30
X197_HARV_ndvi_crop 0.893250 HARV 2011       197 2011-07-16

It looks like we have a series of row.names that we do not need because we have this information stored in individual columns in our data frame. Let’s remove the row names.

row.names(avg_NDVI_HARV_clean) <- NULL
head(avg_NDVI_HARV_clean)
  meanNDVI site year julianDay       Date
1 0.365150 HARV 2011         5 2011-01-05
2 0.242645 HARV 2011        37 2011-02-06
3 0.251390 HARV 2011        85 2011-03-26
4 0.599300 HARV 2011       133 2011-05-13
5 0.878725 HARV 2011       181 2011-06-30
6 0.893250 HARV 2011       197 2011-07-16
write.csv(avg_NDVI_HARV_clean, file="meanNDVI_HARV_2011.csv")

Challenge: Write to .csv

  1. Create a NDVI .csv file for the NEON SJER field site that is comparable with the one we just created for the Harvard Forest. Be sure to inspect for questionable values before writing any data to a .csv file.
  2. Create a NDVI .csv file that includes data from both field sites.

Answers

avg_NDVI_SJER_clean <- subset(avg_NDVI_SJER, meanNDVI > 0.1)
row.names(avg_NDVI_SJER_clean) <- NULL
head(avg_NDVI_SJER_clean)
  meanNDVI site year julianDay       Date
1 0.529780 SJER 2011        46 2011-02-15
2 0.554368 SJER 2011        62 2011-03-03
3 0.601096 SJER 2011        94 2011-04-04
4 0.555836 SJER 2011       110 2011-04-20
5 0.538336 SJER 2011       126 2011-05-06
6 0.400868 SJER 2011       142 2011-05-22
write.csv(avg_NDVI_SJER_clean, file = "meanNDVI_SJER_2011.csv")

Key Points

  • Use the cellStats() function to calculate summary statistics for cells in a raster object.

  • The pipe (|) operator means or.

  • Use the rbind() function to combine data frames that have the same column names.