x10.outgroup_analysis.Rmd

---
title: "Placing Western Australian Acropora digitifera in context"
output: 
  github_document
bibliography: bibliography.bib
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE, warning = FALSE, message = FALSE, fig.retina = 2)
# loading required packages:
library(tidyverse)
library(ggpubr)
source("scripts/color_scheme.R")

location_colors <- c(myCol,"#7C878EFF")
names(location_colors) <- c("Inshore","North Offshore","South Offshore","Japan")
```

To understand the relationship between *A. digitifera* from WA and a population in Japan we combined our data with publicly available data from [@Shinzato2015-rz].  This combined dataset included all of our samples as well as 16 samples from the Ryukyu Islands in Japan, randomly chosen from the Shinzato et al dataset. Note that the Ryukyu Islands is also the region of origin for the A. digitifera reference genome. Japanese samples were downloaded from NCBI and subjected to the same read quality checks and alignment process as we applied for our own samples.  The resulting bam files had variable coverage (9x - 32x) but were generally lower coverage than our samples.  This difference as well as other potential differences due to sequencing platform could introduce biases in our results due to variable call rates for different types of genotypes.  To minimise the effects we performed this analysis using a genotype likelihood framework (ANGSD) rather than the hard-called genotype method used for most of our other analyses.

```{r}
# Read sample information and construct a sample table

samples <- read_tsv("data/hpc/pcangsd/samples.txt",col_names = c("sample_id","location","mapping_rate","mean_mapping_depth","genome_cov"))

sample_table <- read_tsv("data/hpc/pcangsd/all_bam.txt",col_names = "filename") %>%
  mutate(country=str_match(filename,pattern = "/fast/shared/Acropora_digitifera_wgs_bamfile/([A-Z]+)/")[,2]) %>% 
  mutate(sample = str_match(filename,pattern = "/fast/shared/Acropora_digitifera_wgs_bamfile/[A-Z]+/(.*)_aligned")[,2]) %>% 
  mutate(sample = str_replace(sample,"_merged","")) %>% 
    mutate(sample_id = str_replace(sample,"_L004","")) %>% 
    mutate(sample_id = str_replace(sample_id,"_S[0-9]+$","")) %>% 
  left_join(samples) %>% 
  mutate(location = ifelse(is.na(location),"Japan",location)) %>% 
  rownames_to_column("number")
```

### SNP selection and genotype likelihood calculation

As a precursor to running PCAngsd (see below) we first called SNPs, filtered sites for quality and calculated genotype likelihoods using angsd (v 0.933) with the command. 

```bash
angsd -b all_bam.txt -out all -GL 2 -nThreads 8 -doGlf 2 -SNP_pval 1e-6 -doMajorMinor 1 -doMaf 2 -doCounts 1 -minMaf 0.05 -minInd 82 -minMapQ 20 -minQ 20 -setMinDepth 910 -setMaxDepth 3000 -setMinDepthInd 3
```

The options used in this command were chosen to mimic GATK genotype calling and SNP filtering in our main analysis as closely as possible.  Specifically, `-GL 2` uses a GATK-like genotype likelihood calculation, `-minInd 82` selects only sites with data for at least 90% of individuals. 

### PCAngsd

We then used PCAngsd version 1.0 obtained from [github commit cd7feb5a8a76946acde8beaceb919e59dc8d0298](https://github.com/Rosemeis/pcangsd.git) to perform a PCA analysis.  Note that PCAngsd is designed to handle datasets with differing sequencing depths as is the case here. 

```bash
pcangsd.py -beagle all.beagle.gz -selection -snp_weights -pcadapt -admix -tree -maf_save -pi_save -dosage_save -sites_save -threads 32 -out jpwa
```

Using the resulting covariance matrix we visualise population structure using a PCA as follows

```{r pcangsd-pca-2, fig.height=3.5}
covmat <- read_table2("data/hpc/pcangsd/jpwa.cov",col_names = FALSE) %>% 
  as.matrix()

pop_eigen <- eigen(covmat)

eigenvalues <- round(pop_eigen$values,1)

pop_pca <- data.frame(e1=pop_eigen$vectors[,1],e2=pop_eigen$vectors[,2],e3=pop_eigen$vectors[,3]) %>% 
  cbind(sample_table) %>% 
  filter(!grepl(pattern = "BR_5_121",x = sample))

pclabel <- function(pcnum,eigenvalues){
  paste("PC",pcnum," (",eigenvalues[pcnum],"%)",sep = "")
}

p1 <- ggplot(pop_pca ,aes(x=e1,y=e2)) + 
  geom_point(aes(color=location),size=1) + 
#  geom_point(color="black",size=0.1) +
  theme_pubr(base_size = 12) + xlab(pclabel(1,eigenvalues)) + ylab(pclabel(2,eigenvalues)) +
  scale_color_manual(values = location_colors)

p2 <- ggplot(pop_pca ,aes(x=e2,y=e3)) + 
  geom_point(aes(color=location),size=1) + 
  theme_pubr(base_size = 12) + xlab(pclabel(2,eigenvalues)) + ylab(pclabel(3,eigenvalues))+
  scale_color_manual(values = location_colors)

legend_b <- get_legend(
  p1 + 
    guides(color = guide_legend(nrow = 1)) +
    theme(legend.position = "bottom", legend.title = element_blank(),legend.text = element_text(size=12))
)

library(cowplot)
prow <- plot_grid(
  p1 + theme(legend.position = "none"),
  p2 + theme(legend.position = "none"), labels = c("A","B"), label_size = 12)


plot_grid(prow,legend_b,ncol=1,rel_heights = c(1, .1))
#ggsave("fig-s5.jpg",width = 5.6, height = 3)
```

```{r, include=FALSE}
# For the publication figure only
legend_r <- get_legend(
  p1 + 
    geom_point(aes(color=location),size=5) + 
    guides(color = guide_legend(nrow = 4)) +
    theme(legend.position = "right", legend.title = element_blank(), legend.key.size = unit(7,"mm"))
)

ggdraw(legend_r)

#ggsave("figures/point_colors_4pop_legend.png", width = 2,height = 3)

p1 + theme(legend.position = "none") 
#ggsave(filename = "figures/pca_4pop.png", width = 3.5,height = 3)
```


**Figure 1:** PCA showing population structure for Japanese and Western Australian A. digitifera samples. 

The PCA clearly shows that WA offshore populations are more similar to each other than to the other samples.  The plot also suggests that WA inshore samples are intermediate between WA offshore and Japanese samples.  

```{r pcangsd-nj-tree, include=FALSE}
library(ggtree)
library(treeio)
library(ape)

t <- read.tree("data/hpc/pcangsd/jpwa.tree")

t_clean <- drop.tip(as.phylo(t),tip="46")
st_clean <- sample_table %>% filter(!grepl(sample,pattern="BR_5_121"))

jp_in_mrca <- getMRCA(t_clean,
                      st_clean %>% filter(location %in% c("Inshore","Japan")) %>% pull(number) %>% as.character()
                      )


tr <- root(t_clean,node=jp_in_mrca)
tr <- root(t_clean,outgroup = st_clean %>% filter(location %in% c("Japan")) %>% pull(number) %>% as.character())

tr_plot <- ggtree(tr) %<+% st_clean + 
  geom_tippoint(aes(color=location)) +
#  geom_nodelab(label=id) +
#  geom_tiplab(aes(label=sample)) +
  theme(legend.title = element_blank()) +
  scale_color_manual(values = location_colors) +
  theme(legend.position = "bottom")


#ggsave(tr_plot + theme(legend.position = "none"),filename =  "figures/pcangst_tree.png",width = 3,height = 4)

tree_leg <- get_legend(tr_plot)
ggdraw(tree_leg)
#ggsave("figures/fig1_col_legend.png",height = 1,width = 4)
```


```{r}
# ggtree(t,layout = "circular") %<+% sample_table +
#   geom_tippoint(aes(color=location)) +
# #  geom_nodelab(label=id) +
# #  geom_tiplab(aes(label=sample)) +
#   theme(legend.title = element_blank())
##**Figure 2:** Neighbour joining tree based on distances inferred by PCAngsd and placing Japan at the root. 
```


### Phylogenetic Analysis based on UCE and Exon capture probesets

While the analysis above provides information on relative differentiation between populations of A. digitifera it is useful to consider the distances between these populations in the context of broader phylogenetic differences between species within the Acroporidae. To do this we make use of methodology developed by [@Cowman2020-mo] to examine phylogenetic relationships based on UCE and Exon sequences. Steps in this process include;

1. [Map](data/hpc/iqtree/02_map_probes.sh) the hexa-v2 probeset to genomes of all species (A. digitifera, A. millepora and A. tenuis) and filter mapping using samclip.
2. [Create a bed](data/hpc/iqtree/03_sam2bed.sh) file representing an approximately 1000bp region around each mapped probe
3. [Merge overlapping regions](data/hpc/iqtree/04_extract_bams.sh) resulting from closely spaced probes and extract these from bams (in the case of A. digitifera)
4. [Call consensus sequences](data/hpc/iqtree/05_call_consensus.sh) from bamfiles
5. [Extract fasta files](data/hpc/iqtree/06_extract_uces.sh) corresponding to merged regions
6. [Gather and align homologous sequences](data/hpc/iqtree/07_gather_uces.sh) 
7. [Create a partition file for phylogenomic analysis](data/hpc/iqtree/08_make_partitions.sh)
8. [Run iqtree to find the best models and generate a tree](data/hpc/iqtree/09_iqtree.sh)

The resulting tree (rooted at Acropora tenuis) recovers the expected relative divergences between species (Aten vs Amil > Amil vs Adi).  These divergences (Adi vs Amil ~ 25Mya) are clearly much larger than between population samples of Acropora digitifera, which agrees with our assessment that these are recently diverged.  Notably, samples from Japan, Inshore and Offshore locations do not follow the population structure observed in PCA-based methods which may reflect poor resolution of UCEs at this level. 

```{r}
#ucetree <- read.iqtree("data/hpc/iqtree/partition.nex.best_scheme.nex.treefile")

ucetree <- read.iqtree("data/hpc/iqtree/pomo_partition.nex.best_scheme.nex.treefile")

rucetree <- root.phylo(ucetree@phylo,"aten")#root(ucetree,node="aten")
rucetree <- midpoint.root(ucetree@phylo)#root(ucetree,node="aten")

#rucetree <- phytools::reroot(ucetree@phylo,getMRCA(ucetree@phylo,c("aten","amil")))

uce_st <- data.frame(tiplab=rucetree$tip.label) %>% 
  mutate(newlab = case_when(
    grepl("^adi$",tiplab) ~ "Acropora digitifera : Genome",
    grepl("^aten$",tiplab) ~ "Acropora tenuis : Genome",
        grepl("^D",tiplab) ~ "Acropora digitifera : Japan",
    grepl("^R", tiplab) ~ "Acropora digitifera : SO",
    grepl("^AR", tiplab) ~ "Acropora digitifera : NO",    
    grepl("^AI", tiplab) ~ "Acropora digitifera : IN",    
    grepl("^BR", tiplab) ~ "Acropora digitifera : IN",    
    grepl("^amil", tiplab) ~ "Acropora millepora : Genome",    
  )) %>% 
  separate(newlab,into = c("species","location"), sep=" : ", remove = FALSE)

#ptree <- drop.tip(as.phylo(ucetree),c("amil"))
#ptree <- ucetree
location_colors2 <- c(myCol,"#7C878EFF","#FFCD00FF")
names(location_colors2) <- c("IN","NO","SO","Japan","Genome")
ggtree(rucetree) %<+% uce_st + 
  geom_tippoint(aes(color=location)) + scale_color_manual(values = location_colors2) +
  geom_tiplab(aes(label=newlab),align = F) +
  theme(legend.position = "bottom", legend.title = element_blank()) +
  xlim(NA,0.55) + geom_treescale()

ggsave(filename = "figures/fig-s7.jpg", width = 6.3,height = 5.6)
#ggsave(filename = "figures/iqtree.png")
```

**Figure 2:** Phylogenetic tree based on 1023 UCE and 658 Exon loci from the Hexa-v2 probeset. Tree is a consensus tree generated by iqtree based on a partitioned analysis in which each locus was allowed a different model. 


### Mitochondrial genotyping

In addition to the PCAngsd and UCE analyses above we created a haplotype network based on the mitochondrial genomes of all host samples. To support this we obtained mitochondrial sequences for individual low coverage samples as follows;

- All reads (except duplicates) for each sample were mapped to the *A. digitifera* mitochondrial reference sequence (NC_022830.1) using bwa mem (version 0.7.17).  This produced a small bam file containing only mito reads for each sample. See the script [02_map_reads.sh](data/hpc/mito_mapping/02_map_reads.sh) for the exact commands used.  
- Mitochondrial reads were then used to call variants against the reference sequence using samtools mpileup (version 1.7), followed by bcftools (version 1.9) to call a consensus sequence for the sample. (See script [03_call_consensus.sh](data/hpc/mito_mapping/03_call_consensus.sh))
- Since consensus sequences were all the same length no alignment was necessary and they were simply concatenated to produce an alignment. 
- This set of aligned mitochondrial sequences was used as a set of haplotypes for network visualisation with [PopArt](http://popart.otago.ac.nz/index.shtml)

The resulting haplotype network is shown below. Note how the vast majority of samples share a haplotype and how those haplotypes that differ are (a) spread among locations and (b) mostly attributable to a single mutation from the common haplotype.

![host_mito_network](figures/host_mitohap.jpg)