assignment-09-challenge.Rmd

---
title: "Challenge assignment (14.5 marks)"
output:
    html_document:
        toc: false
---

*To submit this assignment, upload the full document on Quercus,
including the original questions, your code, and the output. Submit
your assignment as a knitted `.pdf` (prefered) or `.html` file.*


Part of being an effective scientist involves being able to solve problems you
have not encountered before. This is certainly true of programming as well,
where problems are typically solved by furious bouts of Googling, reading
documentation, and trial and error of proposed solutions. In this assignment,
like previous ones, you will be evaluated on your ability to solve data
manipulation and analysis tasks. However, unlike previous assignments, some of
the solutions to the problems will require more research and effort on your
part. It may require the use of packages and techniques not explored in class,
but all problems are solveable, often with only a few lines of code. By now, you
should all have the terminology required to search for solutions to the problems
below. As with all programming problems, there are many possible ways to get the
answer to the problems below.

1. Simpson's diversity index and bootstrapping (6 marks)

    In lecture 12, we used data from the National Ecological Observatory Networkd on
the abundance and percent cover of plant species across sites in the Harvard
Forest from the year 2017. Run the code chunck below to read in the data if you
do not already have it.

```{r message=FALSE}
library(tidyverse)
neon_data <- "https://uoftcoders.github.io/rcourse/data/NEON_PlantPA_HARV_201707.csv"
download.file(neon_data, "NEON_PlantPA_HARV_201707.csv")
neon_data <- read_csv("NEON_PlantPA_HARV_201707.csv", 
                                col_names = TRUE)
```

a. Using the raw NEON data, create a matrix with plant species as rows and sites
as columns. The cell values should represent the abundance of each species at a
given site. (1 mark)

b. Write a function that takes a single numeric vector as an argument and
returns Simpson's Index of Diversity for the numeric vector. Test your
function by computing Simpson's diversity index on the `test_vector` in the
code chunk below. Be sure to report the Simpson's index for `test_vector` in
your final assignment. As a reminder, I have included the formula for
Simpson's index below:

```{r}
test_vector <- c(0, 1, 0, 5, 0, 1, 0, 4, 3, 0, 0, 0, 0, 1, 4, 0, 5, 0, 
                 3, 0, 11, 2, 19, 0, 11, 0, 0, 0, 0, 0)
```

    
$$D = 1 - \sum_{i = 1}^{s}(p_i)^2$$

where `s` is the species richness (i.e. number of species) and `p_i` is th
relative abundance of species _i_. A `D` value of 0 represents no diversity
and a D value of 1 represent infinite diversity. (1 mark)

c. In lecture 12, we discussed how resampling techniques can be used to
conduct hypothesis tests (i.e., permutation tests) by comparing an observed
population parameter (e.g., mean, median) to a null distribution of that
parameter generated by resampling your observed data _without_ replacement
many times. Resampling can additionally be used to generate confidence
intervals around a population parameter (e.g., mean median, slope) or other
statistic (e.g., C-score) using a technique known as _bootstrapping_.
Bootstrapping allows us to estimate the true distribution of a statistic in
cases where it is unknown, which we can use to estimate our uncertainty
(e.g., Standard Error, Confidence Intervals) in a population parameter.
Importantly, this applies _regardless of the shape of the distribution_,
although some adjustments often have to be made for strongly skewed
distributions. While there are a few different types of bootstrapped
confidence intervals (e.g., bias-corrected and accelerated, _t_ with
bootstrap), in this exercise, you will write a function that calculates the
simple 95% percentile bootstrapped confidence interval from a given numeric
vector. Your function should have the following properties:

    1. It should take in two arguments: A numeric vector and an integer
    representing the number of iterations 
    2. It should return a data frame with two columns: The lower and upper
    quantiles of the distribution, representing the 2.5 and 97.5
    percentiles, respectively (**hint:** The `quantile()` function).
        
Write the function as described above. How does the bootstrap differ from a
permutation test (note: you don't need code here, just tell me in words).
Set a seed of 42 and test your function on the `test_vector` from part (b).
(1.5 marks)

d. Use your functions defined in (b) and (c) to estimate the Simpson's diversity
index and corresponding 95% bootstrapped confidence intervals around the
Simpson's index for each of the sites in your dataframe from (a). Use 1000
iterations for the bootstrapping procedure. Your answer should be a single
dataframe with four columns: `site`, `simpson`, `lower`, and `upper`. Points are
awarded for conciseness of the code. (**Hint:** The most concise answer will
likely make use of the `purrr` package, which is part of the `tidyverse`). Set a
seed of 43. (1.5 marks)

e. Using the dataframe from (d), plot the Simpson's index (y-axis) for each
of the sites (x-axis) as a single point surrounded by its lower and upper
95% CIs. Order the x-axis from lowest to highest Simpson's index. (1 mark)

2. Recreating a figure 

    In assignment 3, we explored a dataset containing changes in yearly plant
    biomass in Abisko National Park, Sweden. In this question, we will use `ggplot2` to
    reproduce a figure in the original paper (Olofsson et al, 2013;
    [link](https://royalsocietypublishing.org/doi/full/10.1098/rstb.2012.0486)). 

    You should still have the dataset from when you completed assignment 3, but
    if not, run the code chunk below to download it.

```{r message=FALSE}
plant_biomass_url <- 'https://raw.githubusercontent.com/UofTCoders/rcourse/master/data/plant-biomass-preprocess.csv'
download.file(plant_biomass_url, 'plant-biomass-preprocess.csv')
plant_biomass <- read_csv('plant-biomass-preprocess.csv')
```

Reproduce figure 4 from the paper using `ggplot2`. Pay close attention to the
overall structure of the figure, the scale on the axes, and so on. The colors of
the points do not need to be the exact same colors as those in the figure, but
they should be sufficiently close. _(Note: you can ignore the SEM points, and
the 'look' of your axes does not have to match the figure exactly. The species
names also not have to be included in the body of the plot (like in the paper)
as long as they are visible in some form_ (1 mark)

3. Use the built-in R dataset `iris`, for this question. (2 marks) 

     a. Test the relationship between sepal length and sepal width _for each
     species_ using linear models. Set sepal width as the response and sepal
     length as the predictor. Output a single data frame containing the beta
     estimate and p-value associated with the predictor for each of the three
     models, and also make sure to include a column called 'species' in the
     final data frame. Your final data frame should only have species, term,
     estimate, and p.value as columns. Do not include the intercepts. _(Hint:
     You want to simultaneously perform the same model on different subsets of
     the data with only a few lines of code)_ (1 mark)
     
    b. Use `ggplot2` to plot a scatter plot of sepal width by sepal length,
    coloured by species. Plot the three linear fits as well, also coloured by
    species. Below your code, comment on how the estimate values from your
    linear models above correspond to the plotted fits. (1 mark)
    
4. The Canadian lynx population cycle (3.5 marks)

    The Canadian lynx experiences large periodic changes in its population 
    size over a timescale of several years. This is thought to be driven by 
    oscillations in the population size of the snowshoe hare, the primary food 
    source for the lynx. Read more about the lynx population cycle on this 
    [Northwest Territories website](https://www.enr.gov.nt.ca/en/services/lynx/lynx-snowshoe-hare-cycle).
    
    R has a built-in dataset called `lynx` which contains annual population 
    measurements for the Canadian lynx as a time series. 

    (a) Plot the abundance of `lynx` vs. time in years using either `ggplot` or
    `qplot`. Plot points and a connect them by a line . Create a time series that
    starts at 0 and ends at the total number of years in the dataset (total years $=
    1934-1821$). By eye, estimate the time between peaks in the population. (0.5
    marks)

    (b) Define a function called `sine_model` that takes 5 arguments: a vector of years
    for the x-axis and four parameters (amplitude, period, phase, and offset). 
    Recall the general formula for a sine wave:
    $$y = A \text{sin}(kx - b) + c$$
    where $k = 2\pi / T$, $T$ is the period or length of time between peaks,
    $A$ is the amplitude, $b$ is the phase, and $c$ is the offset.
    Using a value of $A = c = 1700$ for both the amplitude and offset and a value of 
    $b = 2.75$ for the phase, plot the lynx data as before and add a sine curve 
    using your guess of the timescale from part (a) for the period.
    Use a colour other than black to plot the sine wave. 
    Note that the x axis must start at 0 in order for the offset of $2.75$
    to match the data. (1 mark)

    (c) Use least-squares fitting to refine your estimate of the lynx cycle length. (1.5 marks)
        - Create a numeric vector with a range of values for the period that
        span your guess from part (a).
        - Write a function to calculate the Residual Sums of Squares (RSS) from
        a model fitted to the `lynx` data. 
            - This function should calculate the sum of the difference
            (*residuals*) between the lynx data and your prediction, then return
            the sum of the residuals squared.
        - Apply this function over the numeric vector of period values you
        created. Essentially, you should be striving to obtain an RSS value for
        models fitted using all of the different period values in your numeric
        vector.
        - Plot the sum of the residuals squared vs. the range of period values. By eye,
        what is the minimum of this curve? 
        - Identify the period value that provides the best fit to the `lynx`
        data. Models with lowest RSS fit the data best. What is your calculated
        length of the lynx population cycle?

    (d) Plot the lynx data again and plot your best fit curve on top. 
    (0.5 marks)

5. In class, we worked with a simple one-locus haploid model
of evolution and investigated how different forces of evolution (e.g., selection
vs. drift) affect allele frequencies. In this challenge assignment, we will
naturally upgrade to a diploid model to see how evolution work in this kind of
system. 

    We will once again be considering the case of malaria (of course, what else does
Amber think about anyway). In assignment 7, we (you) worked on ways to
incorporate various facets of biological detail into a baseline, super simple,
disease model. We will incorporate another level of detail in this exercise:
host genotype. The trait that we will be working with is the sickle cell trait.
You can read more about the sickle-cell trait on Wikipedia. Basically, this is a
blood-cell-shape trait that is controlled at a single locus by two alleles: $A$
for wildtype "normal" blood shape, and $S$ for sickle-shaped blood cells. There
are thus three possible host genotypes with respect to the sickle cell trait:
homozygous "normal" blood ($AA$), heterozygous ($AS$), and homozygous sickle
blood ($SS$). People with homozygous $SS$ suffer from debilitating illness
(sickle-cell disease) and often result in mortality at a young age if untreated.
Heterozygotes ($AS$), while still suffering minor illness, receives partial
immunity against malaria. It was thus hypothesized that the deleterious $S$
allele is maintained in the population due to **balancing selection** -- the
strong selective pressure exerted by malaria gave heterozygotes an advantage
over the homozygotes, thus saving the $S$ allele from being purged from the
population.

    We will be using mathematical models to help us test this hypothesis. A good
start is to first extend the framework we used in lecture and think about how to
represent allele frequency change in a diploid population.

    In population genetics it is often easier to work with
allele frequencies (proportion of a certain allele in a population) rather than
changes to their absolute numbers. Here, we will use $p$ and $q$ to denote the
proportions of the $A$ and $S$ alleles in the population, respectively. The
frequencies of the three genotypes in the population are thus given by $p^2$,
$2pq$, and $q^2$ for AA, AS, and SS individuals, respectively. $p^2$, $2pq$, and
$q^2$ are known as Hardy-Weinberg proportions and they have the convenient
property of summing to 1.

    Applying selection to this population alters the frequency of each _genotype_ by
an amount that is proportional to their fitness. We will denote the fitness of
each host genotype group as $W_{AA}$, $W_{AS}$, and $W_{SS}$. The frequencies of
each genotype following selection (i.e., as weighted by its fitness) is
therefore $p^2W_{AA}$, $2pqW_{AS}$, and $q^2W_{SS}$. Dividing these quantities
by their sum (the mean fitness of the population, often denoted with
${\bar{W}}$) allows us to retrieve the frequency of the three genotypes after
selection (e.g., $p^2 \frac{W_{AA}}{\bar{W}}$). (2 marks)

    a) Derive a recursion equation that describe the change in **allele**
    frequencies of $A$ and $S$ in one time step. (Hint: How do you go from genotype
    frequency to allele frequency?) (1 marks)

    b) Simulate the trajetory of change of the A allele. Use the parameter set
    provided, and include a graph in your answer. Explain what you see, and refer to
    the parameter values provide in your explanation. (1 mark)


```{r}

# Time
times <- 500
timevec <- seq(1, times, 1)

# Parameters
N <- 10000 # Total population size (assume constant)
WAA <- 0.5
WAS <- 0.7
WSS <- 0.2

# Intitial condition
p <- 0.5 # proportion of A allele

```