Exercise_01_RPrimer.Rmd

---
title: 'Activity 1 - R Primer'
author: "Michael Dietze"
output: html_document
---

## Objectives

The objective of today's hands-on activity is to provide a basic overview of R.  For each topic covered the first part will be directed – you will follow the prescribed sequence of R commands in order to familiarize yourself with what they do.  The second part of each topic will ask you to apply these R commands. 

## Assignment

For this activity you will turn in a Rmd file that contains both code and written answers. Tasks are labeled as tier [A] or [B]. A are required. B are recommended and can only increase you grade.

**Always check that the Rmd file "knits" before submitting!**

## Getting this exercise

This exercise is available off of http://github.com/EcoForecast and assumes that you will be working from within the RStudio editor with git installed. This requires a few steps:

0. Before installing RStudio, you'll want to make sure you have [installed R](http://cran.us.r-project.org/)
1. RStudio can be downloaded for free from http://www.rstudio.com/
2. You will need to install Git, which you can do by following the instructions at [RStudio Support - Version Control with Git and SVN](https://support.rstudio.com/hc/en-us/articles/200532077-Version-Control-with-Git-and-SVN).
3. You will need to make sure RStudio knows where git is installed. 
  + In RStudio click on Tools > Global Options > Git/SVN
  + Make sure the "Git executable" path is set to where git was installed 
  + Make sure "Enable version control interface" is checked
4. You will need to introduce yourself to git.
  + From RStudio click Tools > Shell
  + To set your name, enter the following & hit Return: 
```
git config -–global user.name “your name”
```
  + To set your email, enter the following & hit Return: 
```
git config -–global user.email “your.email@gmail.com”
```
  + You can now close this Shell window
  
The bestest way to get a copy of this exercise, and all the other EcoForecast exercises, is to click on the project pull-down menu in the top-right corner of RStudio and select "New Project...".  

Next, click "Version Control" and then "Git".

Enter the following address in the Repository URL: git@github.com:EcoForecast/EF_Activities.git

You can optionally choose to name the folder something different than EF_Activities (though be aware that many activities will assume this is the name of the folder) or have the folder saved somewhere other than your Home directory.

When you hit "Create Project" this will clone a copy of the git project from the github.com website.

If you anticipate making changes to a git project on GitHub then, instead of cloning from https://github.com/EcoForecast/EF_Activities you should go to that website and click on the "Fork" button in the top right corner. This requires that you have an account on GitHub, which is free and easy to set up. Once you fork the repository, this will make a copy to git@github.com:username/EF_Activities where _username_ is your GitHub username, and you can use that as your Repository URL.

If you just can't get Git to work, you can download this activity from the same website, https://github.com/EcoForecast/EF_Activities, by clicking on the "Download ZIP" button in the bottom right.

Next, to be able to compile this document into HTML or any other format you will need to install the "knitr" library. From the "Packages" tab in the bottom right window click on "Install", enter knitr as the package name, and then click Install

Finally, go to File > Open File and open this file, "Exercise_01_Rprimer.Rmd" and then click "Knit HTML" in top left window's menu bar.

## Knitting Rmarkdown documents

To be able to compile Rmd into HTML or any other format you will need to install the "knitr" library. 

If you do not already see a button at the top of the editor window that says "Knit" with an icon of a blue ball of yarn next to it, you will need to install this package. 

![](http://test-pecan.bu.edu/shiny/GE375/Knit_button.png){width=100px}

- From the "Packages" tab in the bottom right pane click on "Install"
- Enter knitr as the package name, and then click Install
- From the main menu bar, select Session > Restart R
- The Knit button should now be visible

If you do have the Knit button, now you should click it. This will create a new file, that has the same name as the Rmd file, except it ends in `.html`. Rstudio should automatically preview this file for you. 
You will see that it contains exactly the same content as the Rmd file, but formatted nicely in to an east to read document. You are free to keep reading the assignment in either the Rmd file or the html file, but **you can only edit the Rmd file**. When you edit the Rmd file, your changes will not show up in the html file until you select Knit. **Kitting is the best way to check that all your code runs.**

## Formatting your assignments

For this activity you will turn in an Rmd file that shows your work.
All Rmd documents should start with a section that looks something like this:

```
---
title: "My Lab"
author: "My Name"
output: html_document
---
```
Make sure to give your file a title and put ***your name*** in the author field. Notice that the title and author need to be in quotes. 

In this document the `output` section is more complicated because it also contains the code for the table of contents. You don't need to edit it. 


### Code Answers

All your code should be placed into what we call "R chunks."

In the editor pane you can recognize them because: 

They start with: ` ```{r} ` (the `{r}` is very important because it's what makes it specifically an R chunk, as opposed to a different type of code chunk)

They end with ` ``` `

R code chunks also have in the top right hand corner three important icons (from left to right)
![](http://test-pecan.bu.edu/shiny/GE375/R_chunk.png){width=100px}

1. Optional Settings. Once you get more comfortable using R, they can be useful.
2. Run the code in all chunks up to this one.
3. Run the code in this chunk.

Below is an R chunk. Try hitting the play button. 
You will see that the output appears below the code chunk. 
```{r}
1 + 2
```

### Written Answers 

Any verbal answers should be written in the space between code chunks. You can also include some of your written responses as commented code within the R chunk. Comments start with `#` and are not evaluated by R. Putting comments in code is very useful for remembering what you did when you come back to a file later.

There are multiple ways to format written answers and Rmarkdown is very versatile. What you do comes down to personal preference. 

**Always check that the Rmd file "knits" before submitting!**

### Example of a properly answered question 

1. **If there are 7 cats and 5 dogs, how many animals are there in total?**

```{r}
cats = 7 # there are 7 cats
dogs = 5 # there are 5 dogs
animals = cats + dogs # the sum of cats and dogs gives us the total animals

animals # this will print out the number of animals
```
From the calculations above, we can see that there are 12 animals in total. 

You can also use `` `r `` to reference calculations within your text , which has the advantage that you don't need to update your text if specific numbers change. For example:

We can see that there are `r animals` animals in total.

# R basics

## About R

The R software we are using this semester is open-source statistical software that has gained rapid popularity because of its power and flexibility.  In addition, there are a large number of “packages” for R that have been written by users and are freely downloadable from [CRAN](http://cran.us.r-project.org/) (Comprehensive R Archive Network).  Individual packages do everything from allowing R to interface with supercomputers to solving sudoku puzzles.  They contain most every classical statistical test you're likely to come across as well as interfaces that allow R to interact with a large number of other programs and software libraries.  Unlike many pieces of software you may be familiar with, R is a scripting language.  Usually you will be using R “interactively” which means that the basic mode of operation is to type commands at a command prompt and have it spit back a result, which you'll often want to cut-and-paste elsewhere.  R can also be run in “batch” mode, whereby a file containing a list of R commands is run all at once.  This mode is particularly useful for large analyses that take a long time to run because batch jobs can be submitted to computer clusters.

## What R has to offer

Through your web-browser go to http://www.r-project.org.  

* Please briefly look over the “What is R?” section.

*	Next, go to the “Manuals” section.  This section gives an overview of some of the on-line documentation you may want to use from time to time to gain a greater understanding of how R works and to solve problems you come across.  Some of today's activity is borrowed from the “Introduction to R”.  You are encouraged to read this section later on your own, especially if you have no previous familiarity with R or any other programming language.
*	Next, go to the “Search” section and in the “Google” box type in “mantel test.”  You'll find this gives you a list of different R packages that have different forms of mantel tests (a test of correlation between two matrices).  Press the back button and this time click on “Searchable mail archives” and again type “mantel test” in the query box.  This time you will see any discussions from the R email list about Mantel's tests.  These searches are often very useful if you're looking for an example or have run into a problem (because often someone else has had the same problem)
*	Next, go to the “CRAN” section, and select one of the sites – it doesn't matter which one since they are identical “mirrors” of each other.
*	Click on “Task Views” and then select one of the “tasks” that you find interesting.  You will then be presented with a brief overview of major subtopics within that area and the R packages that are useful for those types of problems.  These summaries are often an efficient way to familiarize yourself with what R has to offer for a specific type of analysis but are not exhaustive because new packages are constantly being added to R and not all types of analysis have a “Task View”.
*	Click on “Packages” and then “Table of available packages” and you will see a long list of all of the submitted R packages.  Look around a bit and then click on one you find interesting.  Here you will see basic info on the package including the “Depends” which is the list of packages that you need in order to use this package.  Click on the PDF “Reference manual”.  You'll find that the R documentation for a package describes the inputs and outputs of all the functions in the package and often provides examples of what code calling this function might look like.  However, most packages don't describe how a test works, its assumptions, or why you might want to use it – you'll usually want to look up info about a test rather than apply it blindly.

# R basics

Lets see how R does some basic arithmetic. Note that in the examples below that a pound sign (#) indicates a comment in the code. Everything after a # is not read by R and is just there for the benefit of the person reading the code (so you don’t have to type it all in). Within the Console window (bottom left) type the following at the command prompt symbol '>':

```
3 + 12  		  ## addition
5 - pi			  ## subtraction
2 * 8			    ## multiplication
14 / 5			  ## division
14 %/% 5		  ## integer division
14 %% 5		    ## modulus (a.k.a. remainder)
```
 
The window should show

```{r, echo=TRUE}
3 + 12    	  ## addition
5 - pi			  ## subtraction
2 * 8			    ## multiplication
14 / 5			  ## division
14 %/% 5		  ## integer division
14 %% 5		    ## modulus (a.k.a. remainder)
```

Note: A pound sign (#) indicates a comment in the code. Everything after a # is not read by R and is just there for the benefit of the person reading the code (so you don't have to type it all in). 
  
The number [1] before the answers just means that this item is the first element of a vector (vectors can be thought of as a collection of related values, such as a column in a data table). We can also see that R knows the value of common constants (e.g. π) and can use them like numbers. 

Make sure you understand what each operator (+, -, *, /, %%, %/%) does before you proceed. See this link https://www.tutorialspoint.com/r/r_operators.htm for more details. 

## Functions

R functions can be thought of as small programs that take in input values, referred to as the function "arguments," and perform certain tasks. These tasks may be to calculate a statistical value or to create a plot. 

You will recognize R functions because they have a name that is followed by a set of parentheses that contain their arguments.  In R there are many functions that can take more than one argument, and these arguments are always separated by commas.

For example there is a function in R for taking a square root: `sqrt()` 

```{r}
sqrt(25)  ## square root
```

There is a function for taking the factorial of a number: `factorial()`
```{r}
factorial(10) ## factorial

# This is much easier than:
10*9*8*7*6*5*4*3*2*1
```

Some of the other common mathematical operators in R that you should try running are:

### Exponentials and Logarithms 

```
exp(10)   ## exponential function exp(1) == 2.718282... (Euler's number, e)
log(10)   ## natural logarithm (i.e. ln)
log10(10) ## log base 10
log2(10)  ## log base 2
```

### Trigonometry 

```
sin(pi/2)           ## sine
cos(pi)             ## cosine
tan(pi/4)           ## tangent
asin(0.5)           ## arc-sine
acos(0.5)           ## arc-cosine
atan(1)*180/pi      ## arc-tangent
atan2(-1,-1)*180/pi ## arc-tangent (alternate version)
```
Note that the atan2 function takes TWO arguments. 

### Function Arguments

In later labs we will learn about writing functions. 

For now, it is important that you understand the concept of **function arguments**. 

If we specifically wanted to look up the arguments for `atan2()`, we can type

```{r}
args(atan2)
```

You may notice that the arguments have names: "x" and "y", but we ran the function `atan2(1,2)` without ever using these letters! 

```{r}
atan2(1,2)
```

This is because R is smart enough to do **"argument matching"** by **position** and automatically knows that since the documentation says `atan2(y,x)`, `y = 1` and `x = 2`.

Argument matching by position is convenient, but sometimes is unclear. It requires that you always remember the order of all the arguments. This can become very difficult! It is good practice to be **explicit** and always name your arguments. Added bonus: when you name your arguments, order no longer matters! 

```{r}
atan2(y = 1, x = 2) 
# Here I change the ORDER of the arguments but they are named!
# So the result stays the same 
atan2(x = 2, y = 1)  
```

## Function Help

To get even more information about a function in R you can precede the function name with a ?.  So if we wanted to learn more about `atan2`, we could type

```
?atan2
```
 
The rest of this activity will often list the names of functions you might find useful, and ? can be used to find out about the syntax of these functions.

If on the other hand you are looking for a command (e.g. because you've forgotten its name) you can use help.search. For example:

```
help.search("arctan")
```
 
This will return a list of relevant functions with the parenthesis after the name indicating what package it is in.  help.search will only search packages you have already installed, not all the ones that exist -- you'll want to use CRAN to find new packages.  

Within RStudio you can also search for help with commands in the **Help** tab in the bottom-right window. This search window combines the functionality of both ? and help.search.

To get new functions in packages that you do not currently have installed, you will have to look up the package you want to use and then install it. Packages are installed using the ‘Install Packages’ button under the “Packages” tab or by using 'install.packages' function on the command line. Packages are loaded by clicking the check box next to a packages name or by using the 'library' function at the command line.  Packages only need to be installed once but need to be loaded every time you start R (hint: you'll probably want to list library commands near the top of your script files [described below])

### Auto-complete is your friend!

RStudio has the capacity to auto-complete function names, function arguments, and file names. 

For example, if you type ‘read.t’ and then hit TAB, RStudio will finish typing read.table and it will also show what information you can specify for the read.table function.  

![](http://test-pecan.bu.edu/shiny/GE375/auto_complete_1.png){width=500px}

If you type read.table( and then hit TAB, RStudio will allow you to select the function argument that you want to fill in. 

![](http://test-pecan.bu.edu/shiny/GE375/auto_complete_2.png){width=500px}

If you type read.table(“ and then hit TAB, RStudio will show you the files in your current working directory and allow you to select one. 

![](http://test-pecan.bu.edu/shiny/GE375/auto_complete_3.png){width=300px}

If there are a lot of files in the directory, you can start typing the file name you want and then hit TAB again and RStudio will limit what it shows to just those files that match what you’ve typed so far.  

![](http://test-pecan.bu.edu/shiny/GE375/auto_complete_4.png){width=300px}

## ★ Questions (1-3)

1. [A] **Evaluate the following:**
   - a.	ln(1)
   - b.	ln(0)
   - c.	ln(e)
   - d.	ln(-5)
   - e.	-ln(5)
   - f.	ln(1/5)
   - g.	How does R represent when the output of a function is not a number?

2.	[B] **What is the difference between log and log10?**  (Hint: use help!)

3.	[A] **Pythagorean theorem**
  - a.	Given a right triangle with sides `a` and `b`, write a few lines of code that will calculate the length of the hypotenuse. Make sure to use variables in this calculations, not hard-coded numbers.
  - b.	Try out your code with `a=5` and `b=13`.


## Sequences

One of the most common tasks in programming is generating a sequence of numbers. 

There are two different ways to make sequences in R and it is important not to get them confused:

```{r}
# The first way generates a vector of ten integer values from 1 to 10 using a :
x1 = 1:10
x1

# The second way generates a vector of ten values from 1 to 10 using the seq() function
x2 = seq(from = 1, to = 10, by = 1)
x2
```

Why have two different ways to do the same thing?

Because the `seq()` function is much more powerful than `:`. 

`:` can only do steps of 1 (i.e. integers), but with `seq()`, you can have steps of any size:

```{r}
# Here we set the step size of .5
seq(from = 1, to = 10, by = 0.5) 
# Instead of specifying step size, we specify the length of the sequence 
# and get the same thing!
seq(from = 1, to = 10, length.out = 19) 
```
 
In all cases you need to provide the first value in a sequence, and after that you need to provide some combination of step size (by), length (length.out), and finishing value. This is a good example of why using argument names is important! 

```{r}
# The function rep() repeats a value a specified number of times
rep(1,10)
```

## Vector math

Imagine we had a vector x = 1:10 and we wanted to add 5 to every value. We could do this manually, one element at a time, but R also allows us to do this more simply as
 
```{r}
x = 1:10
x + 5
```

which adds 5 to each element. We can also multiply, divide, and subtract with vectors, and the operations are applied to each element

```{r}
5*x
x/5
x-5
```

Furthermore, if we have two or more vectors, and we perform mathematical operations on them, the operations are performed element by element

```{r}
y = 10:1
x*y
x-y
x/y
```

Most functions in R can also be applied to vectors of data, not just individual data points.  Indeed, many only make sense when applied to vectors, such as the following that calculate sums, first differences, and cumulative sums.
 
```{r}
sum(1:10)  	## sum up all values in a vector
diff(1:10)		## calculate the differences between adjacent values in a vector 
cumsum(1:10)		## cumulative sum of values in a vector
prod(1:10)		## product of values in a vector
```
 
## ★ Questions (4-5)

4.	[A] **Generate a sequence of even numbers from -6 to 6**
5.	[B] **Generate a sequence of values from -4.8 to -3.43 that is length 8** (show code)

## R Scripts

Now click on File>New>R Script to open up a script window.  It is often useful to work on R from a script window because it provides a record of what you did in your analysis and can be reused for similar analyses.  It is particularly essential for more complicated analyses.  In the script window type
 
```
x = 1:10
x
#y
```
 
Unlike at the command prompt nothing happens when you hit return at the end of a line. Highlight the code and hit the Run button.  At the command line you should see 
 
```{r}
x = 1:10
x
#y
```
 
This example also demonstrates that the comment character '#' also works in scripts.  Putting comments in scripts is very useful for remembering what you did when you come back to a file later.  For the rest of this activity we'll use boxes to indicate text to be typed in and run, and will use > to indicate that it should be typed on the command prompt and no prompt to indicate that you probably want to type it in a script instead.

Finally, it is useful to understand the difference between R scripts (.R files) and R Markdown (.Rmd files). Scripts are older and just contain code and '#' comments. They are very useful for defining functions and running automated analyses. R Markdown is a newer technology that combines code with text and rendered figures, tables, etc. R Markdown is particularly useful for reproducible research (e.g. analyses contained within a paper or report), for labs and tutorials -- basically any time you want to combine text with code to produce some sort of report or webpage.

# Data!

## R Data Types

There are many different types of data, and it is important that you not only understand the type data you are using, but that you clearly define the data type in R so that R can properly process it. 

Below are all the basic data types in R:

- **Numeric** (numbers with decimals): `12.3, 5, 999`

- **Integers** (whole numbers): `0, 2, 100`

- **Character** (sometimes called strings, always surrounded in quotes - either single or double): `"hello", 'a', "23.4"` Notice that once numbers are inside quotation marks, they become characters and cannot be used for computation.

- **Logical** (represents true and false, yes and no, 1 and 0, also called boolean): `TRUE, FALSE` Notice that these are NOT in quotes. 

- **Factors** (a more complex data type that helps categorize data and store it as levels)

## R Data Objects 

![](http://test-pecan.bu.edu/shiny/GE375/Data_Objects.png){width=600px}

We have already seen how to create scalars, as well as vectors that contain simple sequences. More generally you can create vectors using the `c( )` function that “combines” scalars of the same data type into a vector. You use it like this:
 
```{r}
x=c(1,7)
x
y=c(10:15,3,9)
y
z = c(TRUE,FALSE,TRUE)
z
```

You can also use `c` to combine vectors
```{r}
c(x,y)
```

But if you try to combine data of different types, R will try to convert them all to a single type (which isn't always what you want to happen)
```{r}
c(x,y,z)
```
 
We can also combine vectors to build up matrices and data frames by “binding” them together either are rows or as columns

```{r}
p = 1:10
q = 10:1
cols = cbind(p,q)  	# bind as columns
cols
rows = rbind(q,p)		# bind as rows
rows
```

`cbind` and `rbind` can also be applied to existing data frames, for example to add another column to an existing data frame or to take two data sets with the same columns and bind them together by row to make a larger data set.

You can also build matrices and data frames using the `matrix` and `data.frame` functions
```{r}
x = matrix(1:25,nrow=5,ncol=5)
x
y = data.frame(a = 1:3, b = seq(5,by=2.5,length=3), z=z )
y
```
You can see that in a matrix R fills in the table by rows by default and all the data is the same type. By contrast, with a data frame the component vectors can be different types but they need to be the same length. The `data.frame` function also lets you name the function arguments (almost) anything you want, and then uses those names as the column names in the table. Indeed, the syntax `z=z` meant "create a column named z and fill it with the contents of the variable z". The `list` function operates very similar to the `data.frame` but allows different elements to be different data objects and different sizes

```{r}
mylist <- list(a = 1,
               b = c(TRUE,FALSE,TRUE),
               c = matrix(1:9,nrow = 3,ncol = 3),
               d = "Australia"
               )

mylist
```
   
## ★ Questions (6)

6.	[A] **Create a character vector that contains the names of 4 people you admire.**


## Loading and Saving Data

In practice, most of the time that we create vectors, matrices, data.frames, and lists we don't do it by hand, but rather we do it by loading in data that we want to manipulate, visualize, and analyze. This section focuses on how to get data into R.

### Environment 

When you assign a value to a variable you are saving that data in to the **Environment.**

```{r}
new_v = 3
```

If you run the code in the R chunk above, you should see the `new_v` variable appear in the upper right window under the Environment tab. 

You can always see what variables you currently have defined in the Environment window or by using the command
  
```{r}
ls()
```
 
Within the Environment window, clicking on a variable will show you the contents in a spreadsheet-like format in the Scripts window. Or if the variable has an arrow by it, you can click on it to expand it and see a preview of what it looks like.

![](http://test-pecan.bu.edu/shiny/GE375/Environment.png){width=300px}

When we perform data analysis in R, **we do not open and edit the data files themselves.** This is may seem counter intuitive if you are used to using programs like Excel where you start by opening your data.  

Instead, we **load the data** in to R, i.e. we assign the data to a variable in the environment and then we do our analysis with that variable. This is good scientific practice because it means that we can never accidentally change our raw data. 

Note that if you just click on the file from within the **Files** tab, or try to open the file from File > "Open File..." that it may automatically open the file in a new editor tab or give you the option to "View File", but it it **doesn't load the data into R in any way we can use it.** You will know that you did not successfully load anything because the Environment will not change!

![](http://test-pecan.bu.edu/shiny/GE375/View_Import_Data.png){width=300px}


### Loading ASCII Text Files

To properly load data, you need to know what format file you have. There are a number of ways to get information into and out of R, but the most simple is in ASCII text formats, such as tab-delimited (.txt) or comma-separated (.csv).  It's usually straightforward to export data in one of these formats from most any program (e.g. Excel).

Lets begin by opening the "Lab1_frogs.txt" file in the "data" folder. This data and some of the examples below come from Ben Bolker's handy book "Ecological Models and Data in R". 

Run the code below and then check for the dat variable in the Environment! Has it appeared?

```{r}
dat = read.table("data/Lab1_frogs.txt",header=TRUE)
```
 
The second variable “header=TRUE” informs R that your data file has column headers that should be read rather than treated as just another line of data.  For the above command to work **R has to be looking at the correct folder**.  You can find out what folder R is currently looking at (its “working directory”) using “getwd” and you can change that directory using “setwd” or within the “Files” window tab under **More > Set As Working Directory**.  
 
CODING TIP: Programmers often use the word **directory** instead of **folder**. The idea of "folders" didn't come around until computers actually had user interfaces and Windows made pictures to visualize where files were. 

```
getwd() # print out my current directory (or folder)
dir()   # what files are in my current directory?
```

### Saving to ASCII Text Files
      
For saving ASCII data there is an equivalent command “write.table”. Type ?read.table and ?write.table to learn more about these functions. While read.table and write.table can read and write CSV (comma separated value) files by just specifying the ‘sep’ argument as sep = “,” (i.e. a comma in quotes), there are also predefined functions read.csv and write.csv.
 
```{r}
write.table(dat,"my_frogs.csv",row.names=FALSE,sep=",")   	## save as CSV
```
 
### The .RData File Format

R also has a built-in binary data format that is good for saving results for later use in R and can store any number of data types of different shapes and sizes (note just single tables). Thus, for example, the .RData format could be used to save multiple data sets associated with a particular project. 

The function `save` is used to make .RData files. This command saves any number of data objects, separated by commas, that come before the `file =` argument, which tells the function the name of the file you want to write to. Here we will save many of the variables we have created so far in this lab in one .RData object!

```{r}
save(dat,x,y,z,file="Lab1.RData")
```
 
This data can then be reloaded at a later time, or on a different computer, using “load”
 
```{r}
load("Lab1.RData")
```
 
There is also a “save.image” function that saves every variable you have defined so far

```{r}
save.image("Lab1_all.RData")
```

These commands will be very helpful if you don’t finish a activity by the end of the period and want to take your whole R desktop home, for saving work in progress, or archiving results of analyses. When you quit R you will be asked if you want to save your desktop and if you answer ‘y’, then save.image is called by default and will save to a file simply named “.Rdata” which is automatically loaded the next time you start R.  While this is convenient, I actually recommend against it and suggest using save or `save.image` explicitly instead because otherwise it is very easy to accidentally use variables and data sets defined in previous analyses, or to be unsure which version of an analysis you’re working with.

### Other File Formats 

	Finally, while we won’t use these explicitly in this class, there are a large number of other options for getting data in and out of R in specific formats (e.g. GIS data, image data, etc) and ways to connect R to data sources more dynamically (e.g. SQL databases) that can be particularly useful when dealing with large data sets. The **R Data Import/Export** manual on the R website is a place to start to learn more about moving data in and out of R.

## ★ Questions (7-8)

7. [A] Use R commands (e.g., do not click "import dataset" and use the dialog there) read in the file  `met_hourly.csv` (which is in the `data` directory) and assign it the variable name `met`

8. [A] Save the matrix `x` to a csv file


## Data Exploration

One of the first things you’ll do with any data set when you first load it up is some basic checks to see what you are dealing with.  Typing the variable name will show you its contents, but if you just loaded up something with a million entries then you’ll sit for a long time as R lists every number on the screen. The function class will tell you the type of data you’ve just loaded.  
 
```{r}
class(dat)
```
 
In this case the data is in a “data.frame”, which is like a matrix but can also contain non-numeric data.  The basic (or atomic) data types in R are integer, numeric (decimal), logical (TRUE/FALSE), factors, and character.  Character data in R is usually displayed in double quotes to indicate that it is character data (e.g. the character “1” rather than the number 1). When writing character data in R (e.g. file names in read.data) it is necessary to use double quotes as well so that R can tell the difference between character data and the names of variables and functions. By contrast, R usually reads character data from files correctly even if the data isn’t in quotes. In addition to the basic data types in R, there are a wide variety of derived data types built up from these basic types that are used for a wide range of purposes. A common example is the Date type, which can be useful for analyzing and plotting data through time, and which you can learn more about by looking at the help for **as.Date** and **strptime**.

At the most basic level R organizes data into vectors of data of a given data type and each column of a data.frame consists of a vector.  R also has a matrix data type, which must contain data of a single basic data type (usually numeric).  It is important to be aware of data types because certain operations can only be applied to certain data types (e.g. you can multiply two matrices but not two data frames).
 
```{r}
str(dat)
```
 
will tell you the basic structure of the data.  From these we learn that there are four columns of data named “frogs”, “tadpoles”, “color”, “spots” and that there are 20 rows of data, and that the data is numeric for the first two, a factor for the third, and logical for the fourth.  If we didn’t need all this information 
 
```{r}
names(dat)
```
 
will tell you the names of the columns in you data frame.
```{r}
dim(dat)
```
**dim** will tell you the dimensions of the data, in this case [1] 20 4 which means we 20 rows and 4 columns.  Each of these pieces of information is accessible individually using **nrow** and **ncol**. 
```{r}
nrow(dat)
ncol(dat)
```
Note that dim will not work on a single vector (e.g. dim(x)) but that **length(x)** will tell you the length of a vector. Similarly, `length` should not be used with a matrix or data frame. 

```{r}
x = 1:10 
dim(x) # dim does not work on a vector
length(x) # length works for a vector
```

The functions `head` and `tail` show just the first and last few lines of a data set, respectively.

```{r}
head(dat)
tail(dat)
```

The `summary` function will give you basic summary statistics on a data set.

```{r}
summary(dat)
```

## ★ Questions (9)

9. [A] Use the commands above to explore the contents of `met`. 
   [B] Describe in words what this file contains and some of its characteristics.

## Subsetting Data

	Data within vectors, matrices, and data frames can be accessed using [ ] notation.  Subsets of data can also be accessed by specifying just rows, just columns, or ranges within either.  These are often referred to as subscripts or indices and the first is the row number while the second is the column.
 
```{r}
x = 1:10
x[5]  	   # select the 5th element only
dat		     # select the entire data frame
dat[5,1]	 # select the entry in the 5th row, 1st column
dat[,2]	 	 # select all rows of the second column
dat[1:5,]	 # select rows 1 through 5, all columns
dat[6:10,2:3] 	 # select rows 6 through 10, columns 2 and 3
```
 
We can also refer to specific columns of data by name using the $ syntax

```{r}
dat$frogs  	# show just the ‘frogs’ column
dat$color[6:10]		# show the 6th though 10th elements of the color column
met = read.csv('data/met_hourly.csv')
met$AirTemp[1:10] # show the first 10 values of air temperature
```

In general, it is better to **access data by name**, rather than using the row and column numbers, because this makes your code easier to understand and debug, making the process of coding less error prone. It also makes it much easier to adapt your code to new situations or data sets, where the columns might not come in the same order, or the data might not have the same number of rows or columns. This highlights a more general point, that you should use variables to represent names and numbers, especially if those names and numbers are reused, rather than ‘hard coding’ numeric values into the code.

Another useful feature of R is that in addition to using positive indices to display specific rows, we can also use *negative* indices to remove specific rows. For example, if I wanted to drop the first 3 rows from the frog data:
```{r}
dat[-(1:3),]
```

Finally, our ability to access data is not restricted to consecutive rows and columns, and vectors can also be used for indexing other vectors. For example:
 
```{r}
x = c(1,10,100,1000)
met$time[x]  		## return specific elements of a vector
```

## ★ Questions (10-12)

10. [A] **Show the 1033rd through 1056th row of weather data**

11. [B] **Show the 3rd through 8th rows of the 1st though 3rd columns of the frog data**

12. [A] **What was the _total_ amount of rain?** Hint: look back at Vector Math


## Converting Data Types

Sometimes we need to convert data from one data type to another. Most often this occurs when R reads data in as a different type than what we need. For example, if R has some numbers represented as character data, we can't actually use them in calculations.

```{r}
x = c("3.14","2.10","42")
```
If, for example, we tried to multiply `x*2` R would return an error message
```
Error in x * 2 : non-numeric argument to binary operator
```
We can fix this by asking R to convert `x` to numeric first
```{r}
x = as.numeric(x)
x * 2
```

R has conversion functions for all of it's basic data types, as well as many more advanced derived data types
```{r}
as.character(dat$color)  ## make sure colors are character
as.numeric(dat$spots)    ## from logical to numeric
as.logical(0:1)          ## from numeric to logical
as.POSIXlt(met$time[1:24],tz = "GMT") ## date time conversion
```

The last example is more complicated and reflects that R has some fairly sophisticated tools for handling dates and times. In this case we're converting the _character_ column `time` to one of R's datetime representations. In general it is much more handy to convert dates and times to datetime variables, rather than leaving them as character strings, as there's a lot more we can do with them that way (e.g. extract specific days, times, months, etc).

Furthermore, R also has conversion functions for *data structures* as well, if you even need to convert one data structure into another:
```{r}
as.matrix(x)
as.data.frame(x)
as.list(x)
```

## ★ Questions (13)

13. [A] **Using the frog data, show just the spots column as characters**

# Logical operators and indexing

## Logical operators

R can perform standard logical comparisons, which can be very useful for comparing and selecting data. It's important to know the syntax for the different logical operators, some of which are odd:

```
>	  # greater than
<	  # less than
>=	# greater than or equal to
<=	# less than or equal to
==	# equal to (TWO equals signs...you were very close!)
!=	# not equal
```

As a simple example you could compare individual numbers:
 
```{r}
1 > 3
5 < 7
4 >= 4
-11 <= pi
log(1) == 0
exp(0) != 1
```

You can also combine multiple logical operators using the symbols for ‘and’ (&) and ‘or’ ( | )

```{r}
w = 4
w > 0 & w < 10
w < 0 | w > 10
```

You can also apply logical operators to vectors and matrices. When you type a "logical" expression like "y > x" in R you get a TRUE/FALSE answer of the same shape as the inputs. For example, if we wanted to find all the ponds where there was a high density of tadpoles (which we'll define as greater than 5) we could do this as:

```{r}
z = dat$tadpoles > 5
z
```

You will notice that by default logical operations are performed element-by-element. If you want to apply a logical test to a whole vector at a time you can use the function **any** to test if any of the values are true and **all** to test if all values are true

```{r}
any(dat$tadpoles > 5) # At least one of the values of tadpoles is larger than 5 so this will be TRUE
all(dat$tadpoles > 5) # All of the values of tadpoles are not larger than 5 so this will be FALSE
```

Note that when your data is characters you'll need double-quotes in your comparison. e.g. 

```{r}
a = c("north","south","east","west")
a == "east"
```

## Logical vectors

A "logical vector" (i.e. a vector made up of logical values) can be a particularly useful tool when selecting and subsetting data. 

Take for example the logical vector that we created earlier:

```{r}
z = dat$tadpoles > 5
z
```

For each value of tadpoles, we are checking if it is larger than 5.

But which values in tadpoles are larger than 5?

By looking at the data we can see that the 9th through 20th rows are `TRUE` and the other values are `FALSE`. But if the data were much longer, it would **not** be reasonable to look at it to find the answer. For example, in our met data, we might want to know how many days had temperatures > 25C, which is impractical to do by hand as there are `r nrow(met)` rows of data

Instead we can use the function `which` for this very purpose: the `which` function returns the *indices* of the `TRUE` values in a logical vector. 

```{r}
z = which(met$AirTemp > 25)
length(z)  ## how many values meet this criteria
head(z)    ## show the first few examples
```

You can also sometimes treat `TRUE` and `FALSE` just like 1 and 0, which can be very useful. If you want to count the number of values of AirTemp that are larger than 13, you can use the `sum` function:


```{r}
hot = sum(met$AirTemp > 25)
hot
```
Recall that our met data is hourly for the year 2013, so this indicates that a total of `r hot` hours were over 25C at this site (Lake Sunapee, NH) in this year.

Logicals don't always work exactly like 0 and 1 in some situations, so be careful. You can always convert them explicitly with `as.numeric` if need be.
 
## Subsetting using logical vectors

So far we've seen that we can use logical vectors and their indices to identify conditions of interest and to count things, but frequently we need the ability to subset data based on criteria. For example, we might want to pull all the data for 'hot' days into a separate data frame for further analyses (were those days sunnier than average? less rainy?)

You also need to know that logical vectors can be used as indices for other vectors of the same length. Commonly, you'll use them as indices to one of the vectors that produced them.

For example, if we want to know the actual values in `y` that are larger than 13

```{r}
hot = met$AirTemp > 25 # create a logical vector 

hotData = met[hot,] # use the logical vector to select the `hot` rows in met
```

Or, skipping the middleman:

```{r}
hotData = met[met$AirTemp > 25,]
summary(hotData)
```

These simple comparisons can provide a powerful means for subsetting data.

## dplyr: The filter and select functions

In addition to data manipulation functions provided within the base R libraries, there are a lot of external libraries developed to make data management more efficient. One of the most popular such libraries is `dplyr`.

Here we load the library dplyr. If you haven't installed dplyr yet this line of code will throw an error, which you can correct by either using `install.packages("dplyr")` or the Packages > Install menu in the bottom right window of RStudio
```{r}
library(dplyr)
```

dplyr includes a function `filter` that also does the sort of subsetting we saw in the previous tasks. I takes the data set as the first argument and the condition used for subsetting as the second argument. So the previous example could also be rewritten as
 
```{r}
hotData = dplyr::filter(met, AirTemp > 25)
```

`dplyr` also has an function `select` for just selecting specific columns. So if you wanted to subset just the columns `ShortWave` and `Rain` you could run

```{r}
foo = dplyr::select(met, c("ShortWave","Rain"))
head(foo)
```

We'll see more dplyr commands below, but a quick "cheat sheet" can be found [here](https://www.rstudio.com/wp-content/uploads/2015/02/data-wrangling-cheatsheet.pdf)

## Pipes

When cleaning and organizing data it is common to have to string together multiple functions sequentially. For example, if we had data set `x` we might need to first run x through `function1()` and then run the output through `function2()`. In the examples above we did this either by using temporary variables,

```
y = function1(x)
z = function2(y)
```

or by embedding one function call within another

```
z = function2(function1(x))
```

Many consider the first both options cumbersome, as the first creates a lot of temporary variables we don't need and the latter can be confusing, particularly when multiple functions are involved or the functions have additional arguments, because one ends up reading the functions in the reverse order that they are called.

Pipes, `|>`, aim to address this issue by creating a way to pass the output of one function to another in order without creating a temporary variable.

```
z = x |> function1() |> function2()
```

As another example, if we wanted to both filter the met data by temperature AND select a subset of columns, we could pipe together multiple dplyr commands

```{r}
foo = met |> filter(AirTemp > 25) |> select(c("ShortWave","Rain"))
head(foo)
```

In this instance, when used alone both functions took the dataset as the first arguement and then additional information as the second argument. Here, the first argument is implicit and one only needs to use specify the second. An important thing to remember about pipes is that they will _always_ pass in the *first* argument in a function.

## ★ Questions (14)
14.	**Using the frog data set and either base or dplyr functions:** 
*  a.	[A] display just the rows where frogs have spots
*  b.	[B] display just the rows where frogs are blue
*  c.	[A] create a new object containing just the rows where there are between 3 and 5 tadpoles
*  d.	[B] display just the rows where there are less than 2.5 red frogs
*  e.	[A] display where either frogs do not have spots or there are more than 5 frogs

# Summary tables and statistics

## Tables

Often understanding our data requires more than just being able to subset the raw data, but also the ability to summarize that data. The **table** command can do basic tabulation and cross tabulation of logical and categorical data (i.e. counting up the different cases)

```{r}
table(dat$color)           ## basic table
table(dat$color,dat$spots) ## cross table
```

## Summary statistics

There are also a number of commands for calculating basic statistical measures.

*min* and *max* return the smallest and largest values respectively
```{r}
min(met$AirTemp)				        ## smallest value
max(met$AirTemp)				        ## largest value
```
while the *range* function returns both the min and max at the same time
```{r}
range(met$AirTemp)
```

*median* returns the median value, which is the one in the middle of a sorted list, such that half the values are larger and half are smaller
```{r}
median(met$AirTemp)
```

More generally, we can use the *quantile* function to find any percentile of a data set, not just the median (50%)
```{r}
quantile(met$AirTemp,c(0.25,0.75))		## 25% and 75% quantiles
```
and the *IQR* function returns the inter-quartile range (the difference between the 75th and 25th percentiles)
```{r}
IQR(met$AirTemp)
```

By far the most common summary statistic you'll encounter is the arithmetric *mean*

$$mean(x) = \bar{x} = {{1}\over{n}} \sum x_i$$
```{r}
mean(met$AirTemp)
```

The other thing we'll encounter a lot this semester are statistics that measure uncertainty and variability. The most common measure of variability is *variance* which is calculated using the `var` function
$$var(x) = {1 \over {n-1}} \sum (x-\bar{x})^2$$
```{r}
var(met$AirTemp)  		 	        ## variance (Celsius^2)
```
Variance is a measure of the mean squared difference, and as such it's units are that of the square of whatever units the data have (in this example Censius squared). For this reason, many people find it hard to interpret variances directly. If we take the square root of a variance we get a measure, the *standard deviation* that's in the same units as the data and thus much easier to interpret.
```{r}
sd(met$AirTemp)				          ## standard deviation (Celsius)
```

In addition to being able to calculate the variance of a single variable, we can also calculate a *covariance* to measure how two variables vary together

$$cov(x,y) = {1 \over {n-1}} \sum (x-\bar{x})(y-\bar{y}) $$

```{r}
cov(met$AirTemp,met$LongWave)		## covariance between air temp and long wave (thermal) radiation
```
Looking that these equations you'll also notice that *var(x) = cov(x,x)* (i.e. that variance is just the covariance of a variable with itself). 

The other thing you'll notice is that the units of covariance are the product of the two variables, and thus are even harder to interpret than variance (in this case, the units are `Celsius*Watts`). Because of this, it is also common to normalize the cov by the standard deviations of the two variables to give us a unitless measure of how two variables are related known as *correlation*

$$cor(x,y) = {{cov(x,y)} \over {sd(x) sd(y)}}$$
```{r}
cor(met$AirTemp,met$LongWave)		## correllation
```
Correlation coefficients vary from 1 to -1, with 1 indicating a perfect positive correlation (positive slope), 0 indicating no correlation, and -1 indicating a perfect negative correlation (negative slope).


## Apply

R also has a set of `apply` functions for applying any function to sets of values within a data structure.

The function `apply` will apply a function to either every row (dimension 1) or every column (dimension 2) of a matrix or data.frame. In this example the commands apply the `sum` function to the first two columns of the data (frogs & tadpoles) first calculated by row (the total number of individuals in each population) and second by column (the total number of frogs and tadpoles)

![](http://test-pecan.bu.edu/shiny/GE375/apply_1.png){width=600px}

```{r}
# calculate sum of frogs & tadpoles by row (1st dimension)
apply(dat[1:5,1:2], MARGIN = 1, FUN = sum)  
```
![](http://test-pecan.bu.edu/shiny/GE375/apply_2.png){width=600px}

```{r}
# calculate sum of frogs & tadpoles by column (2nd dimension)
apply(dat[1:5,1:2], MARGIN = 2, FUN = sum)  
```

The function `tapply` will apply a function to an R data object, grouping data according to a second variable or set of variables. The first example applies the `mean` function to frogs grouping them by color. The second shows that `tapply` can be used to apply a function over multiple groups, in this case color and spots. 

```{r}
# calculate mean of frogs by color
tapply(dat$frogs, INDEX = dat$color, FUN = mean)  
```


![](http://test-pecan.bu.edu/shiny/GE375/tapply.png){width=200px}

```{r}
# calculate mean of frogs by color & spots
tapply(dat$frogs, INDEX =  dat[,c("color","spots")], FUN = mean)  
```

`dplyr` also provide an alternative to `tapply` using the `group_by` and `summarize` commands. For example, the previous example could be written as

```{r}

dat |> group_by(color,spots) |> summarise(avg = mean(frogs))

```

Note that the `summarize` function creates a new column to store the output of the operation being performed. In this way `summarize` also makes it easy to calculate multiple summary statistics at once for the same groups. For example:

```{r}
dat |> group_by(color,spots) |> 
  summarise(avgFrogs = mean(frogs), 
            sdFrogs = sd(frogs), 
            avgTadpoles = mean(tadpoles),
            sdTadpoles = sd(tadpoles))
```

Another handy dplyr function is `mutate` which provides the ability to add new columns to a dataframe, for example based on computations involving other columns. For example, if we wanted to add a column to `met` that converted the datetime column `time` to just the calendar date we could do this as

```{r}
met = met |> 
  mutate(time = as.POSIXct(time,tz="GMT")) |> ## convert from character to datetime
  mutate(date = as.Date(time))                    ## convert from datetime to date
head(met)
```


## ★ Questions (15)
15. 
  a. [A] **Use dplyr to calculate the daily mean air temperature**
  b. [B] **Use apply to calculate the across-population standard deviations in the numbers of frogs and tadpoles**
  

# Plotting & visualization

## Base plots

There are a *lot* of options for plotting data in R, far more than we could cover in one lab, so the aim here it to provide a basic introduction to the the most common plotting functions. 

For example, you can make a simple *histogram* of data using the function `hist`
```{r}
hist(met$AirTemp) ## histogram
abline(v = mean(met$AirTemp),col="red") ## add a (v)ertical line at the mean, set the (col)or to red
```

You can also make a *scatterplot* of two variables against each other using the `plot` function

```{r}
plot(met$AirTemp, met$LongWave)  ## x-y scatter plot
```

There are a multitude of optional arguements to plotting functions that can be used to control all aspects of the plot aesthetics -- you'll end up learning most of these through examples used in future labs. For example, if I wanted to add labels to the previous scatterplot and change the plotting character (pch) I could do that as:

```{r}
plot(met$AirTemp, met$LongWave,
     cex = 0.5,      ## decrease the symbol size 50%
     col = "purple", ## change the point color
     pch = "+",      ## change the point symbol
     xlab = "Air Temperature (Celsius)",  ## label the x-axis with name and units
     ylab = "Longwave Radiation (Watts)", ## label the y-axis with name and units
     cex.lab = 1.3,			# increase the axis label font size by 30%
     main = "2013 weather data, Lake Sunapee, NH", # title
     cex.main = 1.5			# increase title font size 50%
     )
```
Throughout the semester, you'll be making a *LOT* of plots, and it's important to get in the habit early of producing figures with proper axes labels, units, and (where needed) legends

R's plotting functions also allow us to vary things like symbols, color, and size depending on attributes of the data, which can be extremely helpful for teasing out patterns in data
```{r}
plot(dat$frogs,dat$tadpoles,
     cex = 1.5,    		        # increase the symbol size
     col = as.character(dat$color),	# change the symbol color by name
     pch = dat$spots + 1,			# change the symbol (by number)
     cex.axis = 1.3,			    # increase the font size on the axis 
     xlab = "Frog Density",		# label the x axis
     ylab = "Tadpole Density",# label the y axis
     cex.lab = 1.3,			      # increase the axis label font size
     main = "Frog Reproductive Effort", # title
     cex.main = 2			        # increase title font size
     )
legend("topleft",  ## draw legend in top corner
       c("Red no spot","Blue no spot","Red spots","Blue Spots"), ## legend text (vector)
       pch = c(1,1,2,2),                      ## matching vectors of plot characters adn color
       col = c("red","blue","red","blue"),
       cex = 1.3
       )
```

R plots can also handle dates and time intellegently if they're in the right format
```{r}
time <- as.POSIXlt(met$time,tz="GMT")
plot(time,met$AirTemp,
     type = 'l',   ## switch to plotting lines instead of points
     ylab = "Air Temperature (Celsius)",  ## label the y-axis with name and units
     )
```

The functions `lines` and `points` are also frequently used to add additional lines and points (respectively) to an existing plot.
```{r}
met$time <- as.POSIXct(met$time,tz="GMT")  ## coerce time from character to datetime
plot(time,met$AirTemp,
     type = 'l',   ## switch to plotting lines instead of points
     ylab = "Air Temperature (Celsius)",  ## label the y-axis with name and units
     )

## calculate monthly mean temperature
mmet = met |> mutate(month = months(time)) |> 
  group_by(month) |> 
  summarize(Tbar = mean(AirTemp), mbar = mean(time))

points(mmet$mbar,mmet$Tbar,col="red",pch=18,cex=2)              ## add monthly means to plot as solid diamonds (pch=18)
```

For those accustomed to making figures with software that allows you to manipulate plots by hand by clicking on them, the process of making visualizations with code can initially feel difficult and frustrating. For example, you might want to nudge a label over by a little and you can't just drag it, you have to look up a lot of details in the `plot` or `par` functions to understand how to modify your code. However, what producing figures by code provides, which is ultimately time saving, is reliable *REPRODUCIBILITY*. It doesn't take many experiences of having to re-tweak a manual figure by hand when the data is updated, or when you need to produce a very similar plot for a similar dataset, to realize that manual, interactive figure drawing is imprecise, inefficient, and not scalable when you need to keep making figures (e.g. dashboards, regular reports).

In addition to rendering plots directly within Markdown, within the Plot window, graphs can be cut-and-pasted into other documents or saved to file fairly simply by using Export. If you want to automate the process of exporting graphics, for example when you generate a whole bunch of figures at once and don’t want to Export each one by hand, you'll want to use the graphical functions such as 'postscript', 'pdf', or 'tiff'. For all of these plot functions there are numerous additional (optional) arguments that control the formatting of the plots. The help for `par` (i.e. `?par`) gives a fairly detailed list of these options, some of which you will see in further examples below.

## ★ Questions (16-18)
16. [A] **Plot a histogram of blue frogs**
17. [A] **Plot shortwave (solar) radiation against time**
18. [B] **Plot shortwave (solar) radiation against relative humidity**

## Tidyverse

Beyond `plot` and `hist` the course readings cover additional info data visualizations, and there's even more information provided in the "Additional Resources" section of the syllabus. Visualization also brings us to introduce the concept of the `tidyverse`. Over the last few years Hadley Wickham and colleagues have introduces a set of new packages for data manipulation and visualization that "share an underlying design philosophy, grammar, and data structures" around what they call "tidy" data. The `dplyr` package, which we saw above, is part of the tidyverse and as we saw provides a number of alternative functions for subsetting and summarizing data (see, for example, the data wrangling chapters in Wickham's [R for Data Science](https://r4ds.had.co.nz/)). Another popular part of the tidyverse are its data visualization tools, which are anchored around the `ggplot2` package, which numerous other packages build upon (see the Reverse Dependencies listing for the package https://cran.r-project.org/web/packages/ggplot2/index.html)

The syntax for ggplot is a bit less intuitive than `plot`, but the following tool `esquisse` provides a graphical interface around ggplot that then returns the underlying code used to generate the plot.

Outside of knitr, run the following code to install and launch `esquisse`
```
install.packages("esquisse")  ## only needs to be run once
esquisse::esquisser()         ## launch tool
```
Once the tool launches use it to answer the following questions. Note that you won't be able to type here while the tool is running, so you may want to copy the questions, and your answers, to another doc temporarily.


## ★ Questions (19-24)

19. [A] **Histogram**
* Select “met” as the data.frame
* Click “validate imported data”
* Drag AirTemp into the Y box (which will initially cause the tool to draw a histogram)
* Click on “Labels and Title” to add axes labels, units, and title
* Click on “Plot options” to change the color, theme, and number of bins
* Click on “Data” and filter the dates down to just the summer (approximately)
* Click on Export & code and hit “insert code into script” to add the code for this figure
* When you do this you’ll see a lot of new syntax. First, the `%>%` is known as the _pipe operator_ and is used string together multiple functions. So here we see that we start with the met data, then we filter it based upon time, then we call ggplot(). After we call ggplot we see that the different parts of the figure are added together using a `+`. First, we use `aes` (aesthetic) to select the data being used. Then `geom_histogram` is called to plot a histogram, with additional arguments for the number of bins and the fill color (in this case specified using a hexadecimal color code). Next, we see axis labels being set using the `labs` command. Finally we see the theme that’s been applied.
* Turn in the code for drawing your histogram

20. [B] **Density plot**
* With the same data loaded up click on the “Histogram” icon in the top left you can see that you can also change the type of plot drawn. Click “density” to see how the plot changes and how the underlying code changes.
* Play with the options to get a plot you like
* Using Export, turn in the code for drawing your density plot 
 
21. [A] **Line plot**
* Returning to the ggplot builder, next drag “time” into the X box, which should turn the plot into a time series line plot.
* As before you can play with label, plot options, data filtering, etc
* Export this code to your Rmd as well and briefly explain (in text) what has changed in your code from the histogram plot

22. [B] **Colored line plot**
* Drag the “shortwave” data into the “Color” box. This should now cause the color of the line to change as a function of the incoming solar radiation
* After playing with options, export this code as well
* Describe the impact that Shortwave radiation has on air temperature. Note both the diurnal (daily) cycle (How does temperature change as shortwave increases and deceases throughout the day? When is temperature at its max and min?) and across days (How does shortwave radiation impact the amplitude of the diurnal cycle?). Note: to be able to see diurnal cycles clearly you’ll need to have zoomed in to a few months or smaller, not the whole time series.

23. [A] **Scatterplot**
* At the top left click on “Data” and change to the frog data “dat”
* Plot tadpoles vs frogs and then drag the color variable to the Color box and the spots variable to the Size box
* Set labels, select plot options that provide a good color scheme, theme, etc
* Export the code and describe the new functions and arguments being used.

24. [B] **Facet and group plots**
* Drag spots from Size to Facet. This will cause ggplot to now draw separate plots for the spots=TRUE data and the spots=FALSE data. Export this plot code.
* Similarly, if you change the plot to type ‘line’ in the top left and then drag spots from “Facet” to “Group” it will now draw separate lines for each case of spots. This feature can be particularly handy when you have data from many different discrete classes (states, plots, year, etc).

Hit the Close button on the ggplot2 builder and check to make sure that it put the code chunks in the right place with each question, that the R code is within executable R code blocks, and that the figures still render correctly when you hit ‘Knit’


# Linear Models

Since R evolved out of the statistical programming language S, it can easily perform a wide variety of statistical tests and analyses. In R linear regression is done with the function “lm” (linear models)
 
```{r}
reg1 = lm(tadpoles ~ frogs,data=dat)   #model syntax: y ~ x
reg1  			     # default return from lm
summary(reg1)			     # detailed summary of results
anova(reg1)			     # ANOVA table of results
plot(reg1)			     # diagnostic plots (4 panels)
plot(residuals(reg1))		# residuals by row
coef(reg1)			# parameter coefficients
vcov(reg1)			# parameter covariance matrix
plot(dat$frogs,dat$tadpoles)
abline(reg1)			# adding regression line to the scatter plot
```


The “equation” syntax for models in R often confuses people because while the order of the data is that you would use for writing down the equation [y = f(x)], it is the opposite order from the scatterplot (x,y). The equation syntax allows one to add additional variables to the regression model (e.g. y ~ x1 + x2). This syntax also makes it easy to specify interaction terms (y ~ x1 + x2 + x1*x2). 
Note that the linear model is returning an object and that all the other functions are acting on this object. You can use all the functions you used to explore data objects (e.g. class, names, str, summary) to explore the objects returned by functions. Similar to 'lm', ANOVA models are done with 'aov'

```{r}
anov1 = aov(tadpoles ~ color + spots + color*spots,data=dat)
summary(anov1)
```

# IF statements

Logical operators are not just used for subsetting data, but can be used to control the flow of an analysis and make decisions. The idea is that we want to tell the computer a set of rules, such as “if X happens, then do Y, otherwise do Z”.  The syntax for this in R is 

```
if(condition){
  ## Do Y
} else {
	## Do Z
}
```

The “condition” part of this syntax is always a logical comparison, which does the first part (Y) if the condition is TRUE and the second part if it is FALSE. It should also be noted that the “else{ }” part of the syntax is optional, which would correspond to telling the computer, “if X do Y, otherwise just keep going”. For example, if we wanted to do integer division on integers but normal division otherwise we could write 

```{r}
if(is.integer(x) & is.integer(y)){
  z = x %/% y 	## Do Integer division
} else {
	z = x/y          	## Do normal division
}
z
```


It is also possible to string together multiple if statements sequentially to deal with multiple possible cases and outcomes. For example, we might want the above code to give us a warning if we try to do division on non-numeric data rather than failing with an error

```{r}
if(!is.numeric(x) | !is.numeric(y)){
  warning("Cannot perform division on non-numeric data")
}else if(is.integer(x) & is.integer(y)){
	z = x %/% y 	## Do Integer division
} else {
	z = x/y          	## Do normal division
}
z
```

# Defining custom functions

One of the powerful things about computer languages is that they allow us to encapsulate repetitive tasks into functions, making it easier to abstract a problem. In R you are not limited to the pre-defined functions but you can define your own functions as well. For example, if we found that we were repeating the previous block of ‘if’ code multiple places in our code, we might want to convert it to a function so that we could save on retyping the code again and again. Putting the code in one place also means that if we change the code we only need to change it once and it applies everywhere. At the extreme, its often argued that anything you do more than once in a piece of code should be converted to a function.  So how do we define a function in R?  

```
name = function(arguments){
  # do some calculations
	return(z)
}
```

We need to give it a name, for example we could call the previous if statement ‘my.division’, and we need to define the arguments to the function. We also need to be explicit in defining what data we want the function to return, since in many cases the outside user doesn’t need to know everything that goes on inside the function but is only interested in the result. Putting these together would give us the following

```{r}
my.division = function(x,y){
  if(!is.numeric(x) | !is.numeric(y)){
		warning("Cannot perform division on non-numeric data")
	}else if(is.integer(x) & is.integer(y)){
	z = x %/% y 	## Do Integer division
} else {
	z = x/y          	## Do normal division
}
	return(x)
}

my.division(x,y)
my.division(y,x)
my.division(x,"5")
```

# For loops

Another powerful aspect of computers is there ability to easily repeat the same task time and time again. In fact, one of the major reasons many people learn to code is that they’ve figured out how to do some analysis once, but they want to apply the same analysis hundreds or thousands of times to different data sets, sites, individuals, pictures, etc. Doing so by clicking through a typical graphical user interface thousands of times if at best mind-numbing, if not outright impossible. Loops allow us to easily repeat an analysis over and over. The most common type of loop we will encounter is the for loop, which will repeat a chunk of code one time for each values specified by some sequence
 
```
for( variable in sequence){
  ## do something
}
```

As a very simple example, we might want to print the numbers 1:10

```{r}
for( i in 1:10){
  print(i)
}
```

A more complicated, but common, example might be to loop over all rows in a data set, or to loop over all files in a directory. We also commonly use for loops to do simple simulations. For example, if we want to simulate logistic growth, we might code it as follows:

```{r}
NT = 100  		## number of time steps
N0 = 1				## initial population size
r = 0.2				## population growth rate
K = 10				## carrying capacity
N = rep(N0,NT)
for(t in 2:NT){
	N[t] = N[t-1] + r*N[t-1]*(1-N[t-1]/K)    ## discrete logistic growth
}
plot(N)
```

## ★ Questions (25)
25. [B] **Modify the logistic growth code to simulate discrete-time exponential growth and plot the output.**

## Matrix Math

While loops are a powerful tool, there are many places where a calculation we want to perform can be done so directly on a vector or a matrix as a whole rather than having to loop through each value. 

In addition to the vector math covered earlier, R allows us to define matrices and perform element-wise math on them as as well

```{r}
z = matrix(1:25,5,5)
z
z + 5
z*5
z - 1:5
```


The last example shows that we can also perform element-wise math between a matrix and a vector. Similarly, matrix-to-matrix math is also allowed, though for these cases you need to pay a lot of attention to the dimensions of the matrices and the order that vectors are being applied.
	In addition to element-wise math, R can also perform standard matrix math. If you have not seen this math before you don’t need to worry about what it is doing at this point, any matrix operations will be explained on a case-by-case later in the semester.
 
```{r}
t(z)  		## transpose
diag(z)			## diagonal of an existing matrix
diag(1,5)   ## create a diagonal matrix
w = z + diag(1,5) ## matrix addition
w
solve(w)		## inversion
crossprod(z,z)		## cross product
z %*% z 		## matrix multiplication
```

## Not covered…

While this summary is really just the tip of the iceberg relative to the full depth of available functions and packages in R, there are a few major topics that will be introduced in later exercises but which you would be encouraged to look ahead to. These include how R treats probability distributions and random numbers, missing values (NA), and a very flexible data structure called a list. As the course progresses you will also be exposed to more examples of R code, the structure of R packages, command line execution, how to read a write a larger suite of data types, and how R interfaces with a few other software systems.

## Synopsis

Most of this document is meant to familiarize you with R, so that when you encounter a problem you need to solve you will have a vague memory of something that might work and can use this as a reference. Over time you will come to memorize a larger fraction of this and through exploration you will find many additional functions and coding techniques that are useful. However, here are some parting thoughts on what you should take home at the end of the day.

If you have never programmed before:

*	Focus first on the core concepts that define almost all programming languages
 + Mathematical operators
 + Logical operators
 + Indexing of vectors, matrices, and data frames
 + If statements
 + For loops
 + Functions
*	Programming, at its essence, is a process of breaking down a complicated problem into a series of very simple steps, and then translating those steps into a symbolic language (code). This document is mostly about the vocabulary and syntax of one such language (R), the fun and creative process of using this language for problem solving will come through examples and experience.

If you know another language and are picking up R:

*	The hardest transition will be SYNTAX

Everyone:

*	Save code early and often
  + Even better is to commit your code in a version control system like git
  + Commit code frequently, typically one new feature per commit
*	Keep code well documented
*	Use meaningful variable names
*	Develop a habit of actively searching and exploring R. 
 +	Read the help documents for a function
 +	Search for new functions and techniques
 +	Scour the web when debugging. 
* The key to good programming is being able to teach yourself


<!-- Ignore code below this line, it's for formatting the table of contents. -->
<!-- ----------------------------------------------------------------------- -->
<div class="tocify-extend-page" data-unique="tocify-extend-page" style="height: 0;"></div>