intro_spatial_data_R.Rmd

---
title: "Intro to Spatial Analysis in R"
author: "Jamie Afflerbach"
output:
  html_document:
    code_folding: show
    toc: yes
    toc_depth: 3
    toc_float: yes
  pdf_document:
    toc: yes
---


#Introduction

This is an introduction to Spatial Analysis in R. I've developed this code based on some common questions from friends and colleagues or ones that I've asked myself. There is a lot here to help you get started, but there is also **a lot** more to learn!

The focus here will be on raster analysis, rather than vector (shapefiles, polygons, polylines, points, etc.).
  
  
I've drafted this informal introductory session in the form of answering a scientific question....  

**Where are optimal sites for Sea Monkey aquaculture off the west coast?**  

```{r ,echo=FALSE,warning=F,message=F}

knitr::opts_chunk$set(warning=FALSE, message=FALSE)
library(png)
library(grid)
img <- readPNG('./images/seamonkeys.png')
 grid.raster(img)
 
 options(scipen = 999)
```

We will answer this question by taking into consideration the following spatial data:  
**1. Sea Surface Temperature**
**2. Net Primary Productivity**
**3. Marine Protected Areas**  

Key information for optimal growth:  

- Sea surface temperatures between **12 and 18 degrees Celsius**  
- Net Primary Productiviy between **2.6 and 3 mgC/m2/day**


***

#Background

Raster or **gridded** data are stored as a grid of value which are rendered on a map as pixels. Each pixel value represents an area on the Earth's surface.


```{r ,echo=FALSE}
img <- readPNG('images/raster_concept.png')
 grid.raster(img)
```

##Examples

Some examples of raster data include oceanographic datasets such as Sea Surface Temperature, land use maps and digital elevation maps.


```{r ,echo=FALSE}
img <- readPNG('images/examples.png')
 grid.raster(img)
```


##What is a GeoTIFF??

Raster data can come in many different formats. In this tutorial, we will use the geotiff format which has the extension `.tif`. A `.tif` file stores metadata or attributes about the file as embedded tif tags. These tags can include the following raster metadata:

  1. A Coordinate Reference System (`CRS`)
  2. Spatial Extent (`extent`)
  3. Values that represent missing data (`NoDataValue`)
  4. The `resolution` of the data

*Information in this section is borrowed from [NEON's Intro to Raster Data in R](http://neondataskills.org/R/Introduction-to-Raster-Data-In-R/) tutorial, another great resource*

***

#Setup  

##Libraries & Settings

There are a lot of spatial packages for R, we will touch on some of them here but not all of them. Here is brief overview, taken from [this site](http://geoawesomeness.com/r-goes-spatial/):  

* **raster:** Reading, writing, manipulating, analyzing and modeling of gridded spatial data  
* **rgdal:** Provides the most important and basic spatial functionalities. Provides bindings to Frank Warmerdam’s Geospatial Data Abstraction Library (GDAL) (>= 1.6.3, < 2) and access to projection/transformation operations from the PROJ.4 library   
* **sp:** provides classes and functions for spatial data   
* **rgeos:** Provides spatial vector operations like buffer and intersect. Interface to Geometry Engine – Open Source (GEOS) using the C API for topology operations on geometries.  
* **maps**: This package has pre-loaded maps stored which can be added to your map plots. 
* **maptools:** tools for reading and writing spatial data (visualisation)  
* **ncdf4:** Use with NetCDF files. Note that the `raster` package is also able to read NetCDF files and I prefer to use Raster whenever possible.  
* **tmap:** ggplot 2 for maps! A new package that gives Arc-like functionality to creating publication ready maps.

Load all libraries:

```{r libraries}

library(raster)       #Main raster library with nearly all functions used in this analysis
library(rgdal)        #Spatial library - most functions used from rgdal are for vectors (shapefiles)
library(rasterVis)    #Useful for raster visualizations
library(maps)         #Has a database of maps. I use this to add a map to my raster to visualize land boundaries
library(rgeos)        #Need this library for topology operations on geometries
library(dplyr)        #NOT spatial - this is a data wrangling library
library(RColorBrewer) #Also not spatial - used to set the spectral color scheme 

```  


For raster data visualization I prefer the spectral color scheme rather than the base graphics package. I'm also setting the plotting margins much smaller so that the plots will show up larger in the viewing pane.
```{r settings}
# rainbow color scheme
cols = rev(colorRampPalette(brewer.pal(11, 'Spectral'))(255)) 

#setting margins for plotting
par(mar=c(2,2,1,1))
```

***

#Data Prep

My first step in a spatial analysis is prepping the data, which includes the following:

- Read in data 
- Pre-process the data is it "plays nicely", 
- Visualize the data


##Shapefiles


### Load a Shapefile  

Read in a shapefile of the US West Coast and northern Baja peninsula by using `readOGR` from the `rgdal` package.

```{r shapefile}
# dsn is the path name & layer is the name of the file. NOTE: you do not need to add an extension to the layer name

#named cc for california current
cc = readOGR(dsn='data',layer='ca_current')
```

### Plot a Shapefile  

Plotting a shapefile is just as easy as:
```{r plot shp}
plot(cc)
```

And to add land to the map, do the following (from the `maps` package)
```{r maps}
plot(cc)

#add a landmap to your shapefile
map('world',fill=T,add=T,col='gray')
```

The information in the summary of the shapefile is important if you need to understand what `projection` your **SpatialPolygonsDataFrame** is in, along with the `extent` and number of features. In this case there are just two features, the EEZs of the US West Coast and northern Baja Mexico. You can get this information by just typing in the name 'cc'

```{r shapefile summary}

cc

```

### Attribute Tables    

You can look at the data held within a shapefile by calling `cc@data`. The dataframe associated with the shapefile can be treated just as any other dataframe. Columns can be added, names can be changed, tables can be joined. 

```{r attribute table}

cc@data

#Add another column:

cc@data$short = c('USA','MEX')

cc@data
```

***  

## Raster Data  

Now that we have our boundary area defined by the shapefile, we can start prepping the raster data.

**Sea Surface Temperature**

In the Data folder, there are 5 `.tif` files with the naming pattern `average_annual_sst_[year].tif`, which are 5 annual average sea surface temperatures for our region (2008-2012). We want just **one** raster file of the average SST over that time period.

To create a single average Sea Surface Temperature layer, do the following:  

### Read in Raster Data  

```{r sst}

sst_files = list.files('data',pattern='average_',full.names = T) #We need full file paths

sst_files
```

### Visualize 

I like to visualize the raw data before running any calculation or analysis.

Create a raster of the first file by calling `raster()` and then `plot()` to visualize.  


```{r raster one sst}
#This function reads in raster files. Think of it as similar to read.csv()
r = raster(sst_files[1]) 

#remember cols was defined at the beginning to plot values in the spectral color scheme
plot(r,col=cols)
```
*Notice the data values are in Kelvin - we will change this to celsius later.*


You can plot rasters or shapefiles on top of each other
```{r plot one sst}
plot(r,col=cols, main = 'Sea Surface Temperature (K) 2008')
plot(cc,add=T)
```

I also like to look at the distribution of data. Using the `histogram()` function from `rasterVis` is my preference over `hist()` from the base package purely because of the visual output.

```{r histogram sst}
histogram(r)
```

### Stack rasters

To get a single layer of average SST in degrees Celsius we need to first `stack` all layers.

```{r stackImg,echo=F}

img <- readPNG('images/singletomulti.png')
grid.raster(img)

```


```{r calc avg SST}

#stack is a function from the raster package that puts all RasterLayers into a RasterStack
sstStack = stack(sst_files)

plot(sstStack,col=cols)
```

### Raster Calcuations

You can perform operations on a RasterStack by using the `calc()` function from the `raster` package. `calc()` lets you define a function to apply across all layers in the stack.

Calculate the mean value per cell and then convert to Celsius by subtracting 273.15.
```{r calc sst}
# By adding 'filename=' R will directly save the raster into the defined file rather than memory
sstAvg = calc(sstStack,fun=function(x){mean(x,na.rm=T)-273.15})#,filename='data/sstAvg.tif', overwrite=T) 

plot(sstAvg,col=cols, main = 'Mean Sea Surface Temperature (Celsius)');plot(cc,add=T)
```

A more compact way of doing multiple raster analysis is by using pipes...you can run `stack()` and `calc()` in one call!
```{r pipes sst}

sstAvg = stack(sst_files)%>%
          calc(.,fun=function(x){mean(x,na.rm=T)-273.15})

plot(sstAvg,col=cols, main = 'Mean Sea Surface Temperature (Celsius)');plot(cc,add=T)
```


**Net Primary Production (NPP)**

#### Read in raster data   

Read in this data the same way as the SST data, using `raster()`. This data is the net primary production (mgC/m2/day).   
```{r avg npp}

npp = raster('data/annual_npp.tif');npp

plot(npp,col=cols, main = 'Net Primary Production (mgC/m2/day)')

```

You'll see that this is in a different projection, extent and cell size from the SST data. It is really obvious when you look at the plot, but the summary above also gives clues as to what projection/resolution/extent this data is in.  

To do any sort of analysis using multiple rasters, they all need to be in the same extent, projection and cell resolution.

First look at the differences:
```{r}

sstAvg
npp

```

To get the primary productivity data in the same format as the SST data, we need to 

  1. `reproject`
  2. `crop`
  3. `resample`
  

### Reproject   

Use `projectRaster()` from the raster package to reproject a RasterLayer from one projection to another. You will need to define what the new projection should be by setting a coordinate reference system.  

Defining a **coordinate reference system (crs)** can be done in many ways. See [Melanie's great cheat sheet](https://www.nceas.ucsb.edu/~frazier/RSpatialGuides/OverviewCoordinateReferenceSystems.pdf) for more details about Coordinate Reference Systems.  

```{r crs,echo=F}

img <- readPNG('images/crs.png')
grid.raster(img)

```

Here, we want to project from *Mollweide* to *longlat*  

```{r reproject,warning=FALSE}

nppProj = projectRaster(npp,crs = ('+proj=longlat'))

#You will see a warning about finite points. I'm not entirely sure what this warning means but I suspect it has to do with warping of some cells when reprojecting. I've googled it and it seems like this is a harmless warning.

plot(nppProj,col=cols)
```

### Crop  

Now that the layer is in the right projection, we need to crop it to our study area and make sure all raster layers have the same extent. 

You can set the extent using `extent()`. You can also use another raster object to define the extent (in this case we will use sstAvg)

```{r crop}
nppCrop = crop(nppProj,sstAvg) #crop nppProj to the extent of sstAvg
plot(nppCrop,col=cols);plot(cc,add=T)
```

### Resample    

Just by plotting both the SST and OA data, you can tell right away that these two datasets have different cell resolutions. The NPP data needs to be resampled to the same cell size as SST in order to do any sort of analysis on these two. Use the *nearest neighbor* method to avoid interpolation between points. We want to keep the data values the same as the original data, but at a higher resolution.  

Here you can see the difference:

```{r resample,message=F,warning=F,quiet=T,verbose=F,results=F}

#include progress='text' in any raster function to see a progress bar as the function runs! This is one of my favorite R things!
npp_res = resample(nppCrop,sstAvg,method='ngb')#,progress='text')
npp_bil = resample(nppCrop,sstAvg,method='bilinear')#,progress='text')

par(mfrow=c(1,2))
plot(npp_res,col=cols);plot(npp_bil,col=cols)
dev.off()  # This turns off the graphic device to stop plotting in 2x1 format
```  

  
*NOTE: Typically you'll want to disaggregate cells to match data of a higher resolution. Otherwise, if we aggregate the cells from the SST data, we would lose data.*  

Again we can condense this script by using pipes!  
```{r pipes npp,warning=F}

npp = projectRaster(npp,crs=('+proj=longlat'))%>%
        crop(.,sstAvg)%>%
          resample(.,sstAvg,method='ngb')#,progress='text')

plot(npp,col=cols)
```

#### Check prepped data  

Check to see that we can use the SST and NPP data together now
```{r stack data}
stack(npp,sstAvg) #No error and the stack has two layers so it looks good!
```

***

# Analysis   

Now that our data is prepped and guaranteed to play nicely, we can move onto the fun stuff - **analyzing the data**. For this specific analysis, we need to use the SST and NPP data to find areas within the `cc` region that are suitable for growing seamonkeys. This requires removal of all cells from NPP and SST that are not within the ideal growth parameter range.


## Reclassify Cells

**Sea Surface Temperature**

Remove all cells from the Sea Surface Temperature layer that fall out of the species temperature range

Seamonkeys grow best in waters that are **between 12 and 18 degrees Celsius.**  

Remembering that `sstAvg` is our current SST layer, you can eliminate all cells with values outside of your range in a few different ways. 


A clunky way of getting our suitable cells is by using brackets to get a subset of the cells.

```{r}
plot(sstAvg,col=cols);plot(cc,add=T)
# trying something like this gives you all of the cell values that fit your condition. Not exactly what we want
#sstPref=sstAvg[sstAvg>=12 & sstAvg<=18]
```

You can set all values outside of our range to NA by subsetting - but don't overwrite your original data!
```{r}
sstPref = sstAvg #rename variable so we maintain the original sstAvg.
sstPref[sstPref<12 | sstPref>18]<-NA #set all cells outside of our range to NA using brackets to subset
plot(sstPref,col=cols, main = 'Mean Sea Surface Temperature (Celsius)');plot(cc,add=T)
```

Repeat the process with a new variable and set all suitable cells in SST to equal 1  
```{r}
sstBin = sstPref            #assigning new variable sstBin (binary) so that sstPref is not overwritten
sstBin[!is.na(sstBin)]<-1   #setting all cells that are not NA to 1
plot(sstBin, col='coral', main='Areas with suitable SST');plot(cc,add=T)
```

**Net Primary Production**

Instead of repeating the same subsetting process we did for SST, we can use `reclassify()` which allows us to define the value for cells that fall in a specific range.

Seamonkeys prefer water with a mean primary production of **between 2.6 and 3 mgC/m2/day**


```{r}
plot(npp,col=cols)

#define your range and the output value within the reclassify() function as the second argument 
nppBin <- reclassify(npp,c(-Inf,2.6,NA,
                            2.6,3,1,
                            3,Inf, NA))

plot(nppBin,col='darkorchid2',main='Areas with Suitable NPP');plot(cc,add=T)

```

### Combine rasters with overlay()

Now that we have these two binary layers, we can combine them using `overlay()` from the raster package and the resulting cells, equal to 1, are the cells that meet both SST and NPP requirements
```{r}

cells = overlay(sstBin,nppBin,fun=function(x,y){x*y})
plot(cells,col='lightblue',main='Suitable Aquaculture Areas');plot(cc,add=T)

```
***  

**SIDE NOTE:**  
You can perform mathematical operations on single or multiple raster layers using base R functions. Both `calc()` and `overlay()` are useful when you have complex functions working on these layers. Here are some examples of how you can use raster layers with base R functions:

```{r}
#sum
sum = sum(sstBin,nppBin,na.rm=T);plot(sum) #gives you cells equal to 0, 1 and 2. 
#power
power = sstAvg^4;plot(power,col=cols)
#average
avg = mean(sstAvg,npp);plot(avg,col=cols)
```
***  

#Additional Functions  

This could be all you need - but I want to show some additional steps to highlight more functionality in R.  
  
  
### Mask  

You can remove cells outside of your region by using the `mask()` function. Here, you only want to keep cells that are within the Mexican or US EEZs.
```{r mask}
plot(cells, col='lightblue');plot(cc,add=T)
cellsEEZ = mask(cells,cc)

plot(cellsEEZ,col='lightblue', main = 'Suitable Aquaculture Areas');plot(cc,add=T)
```

### Crop to region

You can then crop your raster to a smaller extent
```{r}

cellsCrop = crop(cellsEEZ,extent(-125,-110,25,35)) #setting extent by eyeing it
plot(cellsCrop,col='lightblue',main = 'Suitable Aquaculture Areas');plot(cc,add=T)
```

### Draw Extent

Another nifty thing - if you don't know the extent, you can draw it and R will tell you! You can then save this extent and crop to it
```{r,warning=F,message=F}

plot(cellsEEZ,col='lightblue');plot(cc,add=T)
ext <- drawExtent(show=TRUE,col='red')

#class       : Extent 
#xmin        : -126.729 
#xmax        : -113.3571 
#ymin        : 27.48709 
#ymax        : 35.00825 

ext=extent(-126.729,-113.3571,27.48709,35.00825) #I picked this one,
cropCells=crop(cellsEEZ,ext)
plot(cropCells,col='lightblue', main = 'Suitable Aquaculture Areas');plot(cc,add=T)
```

### Remove cells overlapping with polygons

There are two shapefiles in the **data** folder. One is 'MPA_State' which are state MPAs, and one is 'National Marine Sanctuaries'. To go along with the aquaculture analysis, lets say that cells within these regions must be excluded from the suitability analysis.

```{r mpas,message=F,warning=F}

#marine protected areas
mpa = readOGR(dsn='data',layer='MPA_State',verbose=F);mpa
plot(mpa, main = 'Marine Protected Areas along the West Coast')

# try to crop to our extent
mpa_c = crop(mpa,ext);mpa_c
```

Notice that when creaing `mpa_c` there is no error. But when you call it the output is NULL, indicating a NULL object. This is because the MPA shapefile is **not in the same projection**  

Using `spTransform()` from the `rgdal` package, you can project a shapefile in a similar manner as raster.

```{r}

mpa = spTransform(mpa,CRS("+proj=longlat")) #spTransform is part of rgdal package

#crop again
mpa_c = crop(mpa,ext);plot(mpa_c, main = "Marine Protected Areas");plot(cc,add=T)

```

The same can be done with the National Marine Sanctuary shapefile, using pipes!
```{r}

#national marine sanctuaries
nms = readOGR(dsn='data',layer='National Marine Sanctuaries',verbose=F)%>%
          spTransform(.,CRS("+proj=longlat"))%>%
            crop(.,ext)

plot(nms, main = 'National Marine Sanctuaries');plot(cc,add=T)

```

To remove cells in the MPAs and NMS just use the `mask()` function but set `inverse=T` so that all cells **outside** of the polygons are kept and those inside are set to NA.

```{r mask inverse}

cellsAQ = mask(cropCells,mpa_c,inverse=T)%>%
            mask(.,nms,inverse=T) #by setting inverse=T, all cells that do not overlap with the mask are kept and those overlapping the mask are set to NA

plot(cellsAQ,col='lightblue', main = 'Suitable Aquaculture Areas');plot(cc,add=T)
```
  

With any raster analysis, you likely aren't just creating a pretty map. Here is an example of running **zonal statistics** on a raster.  

We want to look at the total area (km2) in Mexico and California for seamonkey aquaculture.  

### Rasterize Shapefile

First you want to turn a shapefile (california current) into a raster of the same projection/extent/resolution as your cells.

```{r rasterize}

#let's take a look at what the CC dataframe looks like again
head(cc)

ccZones = rasterize(cc,cellsAQ)#,progress='text')
ccZones
plot(ccZones,col=cols, main='Shapefile rasterized into zones (US and Mexico)')

```  
R automatically assigned 1 and 2 as the cell values. You can set your own values too based on a field or another defined vector of ids.  

```{r rasterize by field}

#using a field from the shapefile

cc_ras_ID = rasterize(cc,cellsAQ,field="ID");cc_ras_ID
plot(cc_ras_ID, col=cols)

```

### Zonal Statistics

To get the total viable area for aquaculture of seamonkeys in the California Current, run `zonal()` using `ccZones`. The `zonal()` function is given any sort of function you define including sum, mean, max, etc.  

Since the current values of `prefRas` are all equal to 1, we can simply sum up the number of cells in each EEZ.

```{r zonal}
par(mfrow=c(1,2))
plot(cellsAQ,col='lightblue',main='Suitable Aquaculture Areas');plot(ccZones,col=cols,main='US and Mexico zones')

cellsSum = zonal(cellsAQ,ccZones,fun='sum') #total number of cells since they are all equal to 1
cellsSum
dev.off()
```

### Calculate area() per cell

But that isn't as useful as calculating the actual area in km2. Using the `area()` function from the `raster` package you can create a new raster with cell values equal to their area, and then run `zonal()`.  

```{r area zonal}

cellsArea = area(cellsAQ,na.rm=T);plot(cellsArea,col=cols, main='Cell area (km2)') #gives are in km2

area = zonal(cellsArea,ccZones,fun='sum');area
```

**There are `r area[1,2]` km2 of suitable area for seamonkey aquaculture in US waters and `r area[2,2]` km2 in Mexican waters.**

###Playing with the tmap package

```{r}
#install.packages('tmap')

library('tmap')
data(World)

extent(cellsAQ)<-extent(cc)


tm_shape(cellsAQ)+
  tm_raster(title='Suitable Area')+
tm_shape(cc)+  
  tm_borders()+
tm_shape(World)+
  tm_fill(col = 'beige')+
tm_shape(World)+
  tm_borders()


```