lec11-model-selection.Rmd

---
title: "Model selection and model averaging"
author: "James Santangelo"
date: '2018-10-03'
output: html_document
---

## Lesson preamble

> ### Learning objectives
>
> - Learn the benefits of model selection and how it differs from traditional inferential statistics
> - Understand the use of AIC and AIC~c~
> - Use AIC~c~ to perform model selection on the RIKZ data
> - Use AIC~c~ to perform model selection and model averaging on a more complicated ecological dataset

> 
> ### Lesson outline
> 
> Total lesson time: 2 hours
>
> - Brief intro to model selection (10 min)
> - Understanding AIC and AIC~c~ (20 mins)
> - Model selection of RIKZ dataset (30 mins)
> - Model selection and model averaging of more complicated ecological data (60 mins)

> 
> ### Setup
>
> - `install.packages('dplyr')` (or `tidyverse`)
> - `install.packages('ggplot2')` (or `tidyverse`)
> - `install.packages("lme4")`
> - `install.packages("lmerTest")`
> - `install.packages("MuMIn")`

```{r include = FALSE}
knitr::opts_chunk$set(warning = FALSE, message = FALSE)
```

## Introduction to model selection

Up to now, when faced with a biological question, we have formulated a null hypothesis, generated a model to test the null hypothesis, summarized the model to get the value of the test-statistic (e.g. _t_-statistic, _F_-value, etc.), and rejected the null hypothesis when the observed test statistic falls outside the test statistic distribution with some arbitrarily low probability (e.g. _P_ < 0.05). This low probability then allows us to reject the null hypothesis in favour of the more biologically interesting alternative hypothesis. This is an **inferential** or **frequentist** approach to statistics.
An alternative approach is to simultaneously test multiple competing hypotheses, with each hypothesis being represented in a separate model. This is what model selection allows and it is becoming increasingly used in ecology and evolutionary biology. It has a number of advantages:

1. It does not rely on a single model.
2. Models can be ranked and weighted according to their fit to the observed data.
3. The best supported models can be averaged to get parameter estimates
    
The most challenging part of model selection is coming up with a series of hypothesis-driven models that that adequately capture the processes and patterns you are interested in representing. In an ideal world, this would be based on detailed knowledge of the system you are working in and on prior work or literature reviews. However, detailed knowledge is often unavailable for many ecological systems and alternative approaches exist. For example, we can compare models with all possible combinations of the predictors of interest (aka the **all-subset** approach) rather than constructing models with only particular combinations of those predictors. Note this approach has been criticized as "data-dredging" or "fishing" (Burnham and Anderson 2002, but see Symonds and Moussalli 2011) and is nicely summarized but this often-quote line from Burnham and Anderson (2012).

_"“Let the computer find out” is a poor strategy and usually reflects the fact that the researcher did not bother to think clearly about the problem of interest and its scientific setting (Burnham and Anderson, 2002)."_

The next step is to decide how we select the "best" model or set of best model. One approach would be to use a measure of model fit or explanatory power. For example, we could use the model R^2^, which we covered in lecture 9 and represents the amount of variation in our response variable that is explained by the predictor variables in the model. However, this is not a parsimonious solution since it inevitably favours more complex models (i.e. models with more predictors). Thus, what we really need is an approach that examines model fit while simultaneously penalizing model complexity.

## Enter Information Theory and model selection criteria

There are numerous model selection criteria based on mathematical information theory that we can use to select models among a set of candidate models. They additionally allow the relative weights of different models to be compared and allow multiple models to be used for inferences. The most commonly used information criteria in ecology and evolution are: Akaike's Information Criterion (AIC), the corrected AIC~c~ (corrected for small sample sizes), and the Bayesian Information Criterion (BIC, also known as the Schwarz Criterion)(Johnson and Omland, 2004). Here we will focus on AIC and AIC~c~. Here is how AIC is calculated:

$$ AIC = -2Log\mathcal{L} \ + \ 2p $$
The lower the AIC value, the better the model. $-2Log\mathcal{L}$ is called the **negative log likelihood** of the model, and measures the model's fit (or lack thereof) to the observed data: **Lower** negative log-likelihood values indicate a beter fit of the model to the observed data. $2p$ is a _bias correcting factor_ that penalizes the model AIC based on the number of parameters (p) in the model. 

Similar to AIC is AIC~c~, which corrects for small sample sizes. It is recommended to use AIC~c~ when $n/k$ is less than 40, with $n$ being the sample size (i.e. total number of observations) and $k$ being the number of parameters in the most saturated model (i.e. the model with the most parameters)(Symonds and Moussalli 2011). AIC~c~ is calculated as follows:

$$ AIC_c = AIC+\frac{2k(k+1)}{n-k-1} $$

## Example of model selection

In lecture 8, we used mixed-effect models to determine whether species richness was influenced by NAP (i.e. the height of a sampling location relative to mean tidal level) across 9 beaches. We generated 3 models: (1) A random intercept model with NAP as a fixed effect and the random effect allowing the intercept (i.e species richness) to vary by beach; (2) A random intercept and slope model with NAP as a fixed effect and a random effect allowing both the intercept (i.e. richness) and slope (i.e. response of richness to NAP) to vary across beaches; and (3) An intercept only model with no fixed effects but allowing for variation in richness across beaches. In this lecture, we'll create similar models with random intercepts and random intercepts + slopes but we'll additionally include Exposure and the NAP by Exposure interaction as additional fixed-effects. Let's load in the RIKZ data.

```{r}
# Load in packages used throughout lesson
library(tidyverse)
library(lme4)
library(lmerTest)
library(MuMIn)
```

```{r, eval=FALSE}
# Load in data
rikz_data <- "https://uoftcoders.github.io/rcourse/data/rikz_data.txt"
download.file(rikz_data, "rikz_data.txt")

rikz_data <- read_delim("rikz_data.txt",
                         col_names = TRUE,
                         delim = "\t")
rikz_data$Beach <- as.factor(rikz_data$Beach)
```

```{r, echo=FALSE}
rikz_data <- read_delim("data/rikz_data.txt",
                         col_names = TRUE,
                         delim = "\t")
```

```{r}
rikz_data$Beach <- as.factor(rikz_data$Beach)
```

Given these 3 models, we may be interested in knowing which one best fits our observed so that we can interpret this model to draw inferences about our population. To do this, we can follow the guidelines laid out in Zuur _et al._ (2009):

1. Create a saturated model that includes all fixed effects (and their interactions) and random effects. If you can't include all fixed effects, you should select those that you think are most likely to be important based on your knowledge of the system at hand.
2. Using the saturated model, optimize the random-effect structure of the model. Compare models with saturated fixed effects structure with models of differing random effect structure. Models should be fit using Restricted Maximum Likelihood (i.e. `REML = TRUE`). The optimal random effect structure is the one that provides the lowest AIC. Note that some common sense is needed here: you should not remove random effects if they are included to specifically account for non-independence in your data (i.e. nestedness, see lecture 8)
3. Optimize the fixed-effect structure by fitting the model with optimized random effects to models with differing fixed effect structures. These models should be fit with Maximum Likelihood (i.e. `REML = FALSE`) to prevent biased fixed-effect parameter estimates. Models can be selected on the basis of AIC (lowest is best) or by comparing nested models using Likelihood Ratio Tests (LRTs). **Important**: You cannot include models that contain interactions if the main effects involved in the interactiond are not present in the model.
4. Run the final model with optimized fixed and random effects using REML.

Note that this approach to model selection can also be applied to models that lack random effects (e.g. simple linear regression). In such cases, you don't need to worry about random effects and can go ahead and just optimize the fixed effects. You also don't need to worry about ML vs. REML.

Let's try this with some real data. Let's create 2 models, but this time let's include Exposure and the interaction between NAP and Exposure as additional effects. 

```{r}
# Define 2 models. Fit both with REML.
mixed_model_IntOnly <- lmer(Richness ~ NAP*Exposure + (1|Beach), REML = TRUE, 
                             data = rikz_data)
mixed_model_IntSlope <- lmer(Richness ~ NAP*Exposure + (1 + NAP|Beach), REML = TRUE, 
                          data = rikz_data)
```

#### Step 1: Create saturated model

This is already done. The saturated model is `mixed_model_IntSlope` created in the code chunk above. 

#### Step 2: Optimize random-effect structure

To optimize the random effects, we compare the `mixed_model_IntSlope` with the `mixed_model_IntOnly`. This will determine whether including a random slope for each beach improves the fit of the model to the observed data. **Note:** We are not testing the `mixed_model_IntOnly` model against one in which there is no random effect since including a random intercept for each beach is required to account for the non-independence in our data. Let's get the AIC~c~ for these two models below. We will use AIC~c~ since $n/k$ is equal to `nrow(rikz_data)/3 = 1.67`, which is below 40.

```{r}
AICc(mixed_model_IntOnly, mixed_model_IntSlope)
```

Based on the output above, it seems including a random intercept only is a beter fit to the data (i.e. lower AIC~c~). The optimal random-effect structure is thus one that includes only a random intercept for each beach but does **note** include a random slope.

#### Step 3: Optimize the fixed effect structure

We now need to refit the model with the optimal random-effect structure using ML and compare different fixed effect structures. Let's fit these models below and check their AIC~c~s.

```{r}
# Full model with both fixed effects and their interaction
mixed_model_IntOnly_Full <- lmer(Richness ~ NAP*Exposure + (1|Beach), REML = FALSE, 
                             data = rikz_data)

# No interaction
mixed_model_IntOnly_NoInter <- lmer(Richness ~ NAP + Exposure + (1|Beach), 
                                    REML = FALSE, 
                             data = rikz_data)

# No interaction or main effect of exposure
mixed_model_IntOnly_NAP <- lmer(Richness ~ NAP + (1|Beach), 
                                    REML = FALSE, 
                             data = rikz_data)

# No interaction or main effect of NAP
mixed_model_IntOnly_Exp <- lmer(Richness ~ Exposure + (1|Beach), 
                                    REML = FALSE, 
                             data = rikz_data)

# No fixed effects
mixed_model_IntOnly_NoFix <- lmer(Richness ~ 1 + (1|Beach), 
                                    REML = FALSE, 
                             data = rikz_data)
AICc(mixed_model_IntOnly_Full, mixed_model_IntOnly_NoInter,
    mixed_model_IntOnly_NAP, mixed_model_IntOnly_Exp,
    mixed_model_IntOnly_NoFix)
```

Based on the output above, it looks like the model that includes NAP, Exposure, and their interaction provides the best fit to the data. 

#### Step 4: Interpret model output

Summarizing the output, we see that increasing both NAP and Exposure results in a decrease in species richness (_P_ < 0.05). There is also a nearly significant interaction between NAP and Exposure (I wouldn't interpret this since _P_ > 0.05). Finally, while Beach is included in our model as a random effect, notice how little variation is attributed to differences between beaches relative to the model we ran in lecture 8. The only difference is that our current model includes Exposure as a fixed effect. This suggests that much of the variation between beaches in lecture 8 was likely attributable to differences in exposure, which is now being captured by the fixed effects.

```{r}
# Summarize best-fit model
summary(update(mixed_model_IntOnly_Full, REML = TRUE))
```

## A more realistic example

In Assignment 5, you used data from Santangelo _et al._ (In Press) who were interested in understanding how insect herbivores and plant defenses influence the expression of plant floral traits. While that was one component of the study, the main question was whether herbivores, pollinators, and plant defenses alter the shape and strength of _natural selection_ on plant floral traits. In other words, which of these 3 agents of selection (plant defenses, herbivores, or pollinators) are most important in driving the evolution of floral traits in plants?

The motivation for that experiment actually came a few year prior, in 2016, when Thompson and Johnson (2016) published an experiment examining how plant defenses alter natural selection on plant floral traits. They found some interesting patterns but it was unclear whether these were being driven by the plant's interactions with herbivores, pollinators, or both. This was because they didn't directly manipulate these agents: pollination was not quantified in their experiment and herbivore damage was measured observationally and thus these results were correlative. However, their experimental data provides a prime use of model selection in ecology and evolution.

The data consists of 140 observations (rows). Each row in the dataset corresponds to the mean trait value of one plant genotype (they had replicates for each genotype but took the average across these replicates) grown in a common garden. They measured 8 traits and quantified the total mass of seeds produced by plants as a measure of absolute fitness. Genotypes were either "cyanogenic" (i.e. containing plant defenses) or were "acyanogenic" (i.e. lacking plant defenses). In addition, they quantified the amoung of herbivore damage (i.e. percent leaf area lost) on each plant twice throughout the growing season, although here we will only focus on the effects of plant defenses and avoid their herbivore damage measurements. We are going to conduct a **genotypic selection analysis** to quantify natural selection acting on each trait (while controlling for other traits) and assess whether plant defenses alter the strength or direction of natural selection imposed on these traits. While this may sound complicated, it turns out that a single multiple regression is all that's required to do this: relative fitness is the response variable and _standardized_ traits (i.e. mean of 0 and standard deviation of 1), treatments, and their interactions are the predictors. This multiple regression regression approach is a common way of measuring natural selection in nature (see Lande and Arnold 1983, Rausher 1992, Stinchcombe 2002).

Let's start by loading in the data.

```{r, eval=FALSE}
# Load in Thompson and Johnson (2016) data
Thompson_data <- "https://uoftcoders.github.io/rcourse/data/Thompson-Johnson_2016_Evol.csv"
download.file(Thompson_data, "Thompson-Johnson_2016_Evol.csv")
Thompson_data <- read_csv("Thompson-Johnson_2016_Evol.csv", 
                          col_names = TRUE)
```

```{r, echo=FALSE}
Thompson_data <- read_csv("data/Thompson-Johnson_2016_Evol.csv", 
                          col_names = TRUE)
```

```{r}
# Load in Thompson and Johnson (2016) data
Thompson_data <- "https://uoftcoders.github.io/rcourse/data/Thompson-Johnson_2016_Evol.csv"
download.file(Thompson_data, "Thompson-Johnson_2016_Evol.csv")
Thompson_data <- read_csv("Thompson-Johnson_2016_Evol.csv", 
                          col_names = TRUE)
glimpse(Thompson_data)
```

We will now generate the global model. Remember, this should be a saturated model with all of the fixed effects and their interactions. We are including the presence of hidrogen cyanide (HCN, cyanide in model below), all standardized traits and the trait by HCN interactions as fixed effects in this model. There are no random effects in this model so we can go ahead and use `lm()`.

```{r}
# Create saturated model
GTSelnModel.HCN <- lm(RFSeed ~ VegGrowth.S*cyanide + BnrLgth.S*cyanide + 
                          BnrWdt.S*cyanide + FrstFlwr.S*cyanide + 
                          InflNum.S*cyanide + FlwrCt.S*cyanide + 
                          InflWdt.S*cyanide + InflHt.S*cyanide,
                      data = Thompson_data)
```

Next, we will perform our model selection based on AIC~c~ (due to low sample sizes). We automate this process using the `dredge()` function from the `MuMIn` package. `dredge()` offers a **ton** of flexibility in how model selection is done. You can customize the criterion used (i.e. AIC, BIC, etc.), how the output is reported, what's included in the output (e.g. do you want F-stats and R^2^ to be included?), whether some terms should be represented in all models and even only include some terms in models if other terms are included (aka Dependency Chain). For our purposes, we will perform an **all-subsets** model selection, comparing models with all combinations of predictors (but not those where main effects are absent despite the presence of an interaction!). I warned earlier that this approach has been criticized. However, in this case it's reasonable: we know from work in other systems that all of these traits could conceivably experience selection, and we know that that selection could vary due to plant defenses. In other words, all terms in this model represent biologically real hypotheses. Let's go ahead and dredge.

```{r}
options(na.action = "na.fail") # Require for dredge to run

GTmodel_dredge <- dredge(GTSelnModel.HCN, beta = F, evaluate = T, rank = AICc)

options(na.action = "na.omit") # set back to default
```

Let's have a look at the first few lines returned by `dredge()`. Let's also print out how many models were compared.

```{r}
head(GTmodel_dredge)
nrow(GTmodel_dredge)
```

The output tells us the original model and then provides a rather large table with many rows and columns. The rows in this case are different models with different combinations of predictors (n = 6,817 models). The columns are the different terms from our model, which `dredge()` has abbreviated. The numbers in the cells are the estimates (i.e. beta coefficients) for each term that is present in the model; blank cells mean that term was not included in the model. The last 5 columns are important: they give us the degrees of freedom for the model (a function of the number of terms in the model), the log-likelihood of the model, the AIC score, the delta AIC, and the AIC weights. The delta AIC is the difference between the AIC score of a model and the AIC score of the top model. The weight can be thought of as the probability that the model is the best model given the candidate set included in the model selection procedure. 

Given this output, we may be interested in retrieving the top model and interpreting it. Let's go ahead and to this. We can retrieve the top model using the `get.models()` function and specifying that we want to top model using the `subset` argument. We need to further subset this output since it returns a list. 

```{r}
top_model <- get.models(GTmodel_dredge, subset = 1)[[1]]
top_model
```

This output above shows us the top model from our dredging. What if we want to interpret this model? No problem!

```{r}
# Summarize top model
summary(top_model)
```

But how much evidence do we actually have that this is the **best** model? We have over 6,000 models so it's unlikely that only one model explains the data. From the `dredge` output we can see there is little difference in the AIC and weights of the first few models. Is there really much of a difference between two models who's AIC differ by only 0.14 points? How do we decide which model(s) to interpret? Statisticians have thought about this problem and it turns out that models with delta AIC (or other criterion) less than 2 are considered to be just as good as the top model and thus we shouldn't just discount them. Alternatively, we could use the weights: if a model has weight greater or equal to 95% then it is likely to be the top model. Otherwise we can generate a "credibility" set consisting of all models whose cumulative sum of AIC weights is 0.95. In any case, the point is that we have no good reason to exclude models other than the top one when the next models after it are likely to be just as good. To get around this, we can perform what's called **model averaging** (aka multi-model inference), which allows us to average the parameter estimates across multiple models and avoids the issue of model uncertainty. Let's do this below by averaging all models with a delta AIC <= 2.

```{r}
summary(model.avg(GTmodel_dredge, subset = delta <= 2))
```

The first part of the output breaks down which terms are part of which models and gives some nice descriptive statistics for these models. The next part is the important bit: the actual parameter estimates from the model averaging. The estimates are those that were averaged across all models with a delta AIC <= 2. Note there are two sets of estimates: the "full" coefficients set terms to 0 if they are not included in the model while averaging, whereas the "conditional" coefficients ignores the predictors entirely. The "full" coefficients are thus more conservative and it is best practice to interpret these. Finally, the last part of the output tells us in how many models each of the terms was included. 

### Caveats to model selection

1. Depends on the models included in the candidate set. You can't identify a model as being the "best" fit to the data if you didn't include the model to begin with!
2. The parameter estimates and predictions arising from the "best" model or set of best models should be biologically meaningful. 
3. Need to decide whether to use model selection or common inferential statistics (e.g. based on _P_-values). Techniques that rely on both approaches are possible (e.g. backward variable selection followed by averaging of top models), such as the example provided above. 

## Additional reading

1. Johnson, J. and Omland, K. 2004. Model selection in ecology and evolution. _Trends in Ecology and Evolution_ **19**: 101-108.
2. Burnham K.P., Anderson D.R. 2002. Model selection and multimodel inference, 2nd edn. *Springer*, New York.
3. Symonds, M. and Moussalli, A. 2011. A brief guide to model selection, multimodel inference and model averaging in behavioural ecology using Akaike's information criterion. _Behavioural Ecology and Sociobiology_ **65**: 13-21.
4. 1. Zuur, A. *et al.* 2009. Mixed effects models and extensions in ecology with R. *Springer*
5. Thompson, K.A. and Johnson, M.T.J. 2016. Antiherbivore defenses alter natural selection on plant reproductive traits. _Evolution_ **70**: 796-810.
6. Lande, R. and Arnold, S. 1983. The measurement of selection on correlated characters. _Evolution_ **37**: 1210-1226.
7. Rausher, M. 1992. The measurement of selection on quantitative traits: biases due to environmental covariances between traits and fitness. _Evolution_ **46**: 616-626.
8. Stinchcombe, J.R. _et al._. 2002. Testing for environmentally induced bias in phenotypic estimates of natural selection: theory and practice. _Am. Nat._ **160**: 511-523.