10.identify_selective_genomic_windows.Rmd

---
title: "Identify outlying regions in selection analysis"
output: 
  github_document:
bibliography: bibliography.bib
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE, warning = FALSE, message = FALSE, fig.retina = 2)
require(tidyverse)
require(bedr)
require(cowplot)
source("scripts/selscan.R")
require(ggsci)
```

## Windowed EHH tests

[Previously](08.ehh_stats.md), we calculated iHS, XP-EHH and XP-nSL scores for all three populations and visualised their distribution across the genome via Manhattan plots. Since selective sweeps should have consistently outlying values of EHH statistics across a large linked region we calculated the fraction of SNPs in each non-overlapping 50kb window that contained  outlying, (|z-score|>2) scores. This follows the analysis of [@Voight2006] and was performed using the program `norm` included with selscan.  See [08.ehh_stats.md](08.ehh_stats.md) for details of the z-score calculation.

Both the XP-EHH and XP-nSL statistics are designed to compare two populations.  We designed our contrasts so that for each comparison there was always one focal population while the remaining two populations were used as a reference.  This was based on the reasoning that it was unlikely a selective sweep would occur independently at the same locus in two separate populations.  In the selscan outputs populations are coded as A or B.  Our focal population was always "A" so that positive value of the statistic is associated with selection in the focal population.

Regions in the top 1st percentile of extreme scores were picked as candidate selective regions. Note that, while we picked windows with a high proportion of extreme scores, the number of SNPs in each window was also taken into consideration to avoid bias towards windows with less SNPs. This is illustrated in the QC plot below which shows a relatively even spread of selected windows across regions with different snp densities.

```{r windowed-stats}
if ( !file.exists("data/r_essentials/10_windowed_ehh.rds")){
  windowed_ihs <- list.files(path = "data/hpc/selection2/ihs", pattern = "windows$", full.names = TRUE) %>% 
    map_dfr(read_windowed_ihs) %>% 
    add_column(stat = "ihs") %>% 
    dplyr::select(-sd)

  windowed_xpstats <- list.files("data/hpc/selection2",pattern = ".xp[a-z]{3}.norm.50kb.windows$",recursive = TRUE,full.names = TRUE,include.dirs = TRUE) %>% 
      map_dfr(read_windowed_xpstats) %>% 
    dplyr::select(chr,start,end,nsnp,frac=fracA,percentile=percentileA,pop,stat)


  windowed_ehh <- rbind(windowed_ihs,windowed_xpstats)

  write_rds(windowed_ehh,"data/r_essentials/10_windowed_ehh.rds")
} else {
  windowed_ehh <- read_rds("data/r_essentials/10_windowed_ehh.rds")
}

windowed_ihs <- windowed_ehh %>% filter(stat=="ihs")
windowed_xpstats <- windowed_ehh %>% filter(stat %in% c("xpehh","xpnsl"))

all_regions <- windowed_ehh %>% 
  dplyr::select(chr,start,end) %>% 
  distinct() %>% 
  mutate(region_id = paste(chr,":",start,"-",end,sep = "")) %>% 
  column_to_rownames("region_id") %>% 
  add_column(pop="all")
```


```{r, fig.width=12}
windowed_ehh %>% ggplot(aes(x=nsnp,y=frac)) + geom_point() + 
  geom_point(data=windowed_ehh %>% filter( percentile==1), aes(x=nsnp,y=frac), color="red")  + 
  labs(x="Number of sites in window", y= "Proportion of scores with z > 2 in window") + 
  facet_grid(stat~pop)
```

**Figure 1:** Plots display the number of sites in window and the proportion of extreme scores in window. The red dots depict the windows in the 1st percentile where SNP density in each window been controlled.

## Overlap of windows between populations

Our main interest in this study is in selective sweeps that occurred after the split between populations.  For this reason we excluded regions where the iHS test was also significant at 1% level in the other two populations. This step was not necessary for the XP-EHH and XP-nSL tests because these work by comparing levels of EHH in the focal population against the reference and will therefore identify sweeps that exist only in the focal population by design. 

After excluding shared iHS significant signals, less than 1% of genomic regions were identified as candidates in three populations.

```{r bed-conversion, cache=TRUE,results='hide'}
inshore_bed <- windowed_ihs %>% 
  filter(pop=="inshore" & percentile==1) %>% 
  mutate(start=start-1) %>% # Required to convert to 0 base coords (BED format)
  dplyr::select(chr,start,end) %>% 
  as.data.frame 

northoffshore_bed <- windowed_ihs %>% 
  filter(pop=="northoffshore" & percentile==1) %>% 
  mutate(start=start-1) %>% 
    dplyr::select(chr,start,end) %>% 
  as.data.frame

southoffshore_bed <- windowed_ihs %>% 
  filter(pop=="southoffshore" & percentile==1) %>% 
  mutate(start=start-1) %>% 
    dplyr::select(chr,start,end) %>% 
  as.data.frame
```


```{r bedtools-subtract, cache=TRUE,results='hide'}
inshore_unique <- bedr.subtract.region(inshore_bed, northoffshore_bed,check.chr = FALSE) %>%
  bedr.subtract.region(southoffshore_bed,check.chr=FALSE) %>% 
  bedr.merge.region(check.chr = FALSE) %>% 
  add_column(pop="inshore")

northoffshore_unique <- bedr.subtract.region(northoffshore_bed,inshore_bed,check.chr = FALSE) %>%
  bedr.subtract.region(southoffshore_bed,check.chr=FALSE) %>% 
  bedr.merge.region(check.chr = FALSE) %>%
  add_column(pop="northoffshore")

southoffshore_unique <- bedr.subtract.region(southoffshore_bed,northoffshore_bed,check.chr = FALSE) %>%
  bedr.subtract.region(inshore_bed,check.chr=FALSE) %>% 
  bedr.merge.region(check.chr = FALSE) %>%
  add_column(pop="southoffshore")
```

**Table 1:** The proportion of genome regions that are uniquely under selective sweep in three populations
```{r ihs_genomics_region-table}
all_ihs_unique <- rbind(inshore_unique,northoffshore_unique,southoffshore_unique) %>% 
  mutate(pop=case_when(
    pop=="inshore"~"Inshore", 
    pop=="northoffshore"~"North Offshore", 
    pop=="southoffshore"~"South Offshore")
    ) %>% 
  mutate(length=end-start) 

all_ihs_unique %>% 
  group_by(pop) %>% 
  dplyr::summarise(total=sum(length/1e6), `Number of Regions` = n()) %>% 
  mutate("Percentage of Total Genome (%)"=total/416*100) %>% 
  rename_all(recode,pop="Population", total="Total length (Mb) of genomic regions") %>% 
  knitr::kable(escape = F)
```


## Consolidated list of all selected regions

For each population we then created a list of all candidate regions under selection by merging intervals that are adjacent, or separated by no more than 100kb across all of the EHH statistics.

```{r, results='hide'}
consolidate_intervals <- function(focal_pop){
  ihs_bed <- rbind(inshore_unique,northoffshore_unique,southoffshore_unique) %>% 
    filter(pop==focal_pop)
  
  xp_bed <- windowed_xpstats %>% 
    filter(percentile == 1 & pop==focal_pop) %>% 
    mutate(start=start-1) %>% # Required to convert to 0 base coords (BED format)
    dplyr::select(chr,start,end) %>% 
    as.data.frame

  all_bed <- rbind(ihs_bed %>% dplyr::select(chr,start,end),xp_bed) %>% 
    bedr.sort.region(check.chr = FALSE)


  bedr.merge.region(all_bed, check.chr = FALSE, distance = 100001) %>% 
    bedr.sort.region(check.chr = FALSE) %>% 
    add_column(pop=focal_pop)
}

inshore_candidate_regions <- consolidate_intervals("inshore")
northoffshore_candidate_regions <- consolidate_intervals("northoffshore")
southoffshore_candidate_regions <- consolidate_intervals("southoffshore")
```

### Annotating candidate regions

To facilitate downstream analyses we annotated candidate regions by

  - Identifying overlapping genes
  - Information from `selscan` used to assess the region (fraction of extreme values, stats in which the region was in the 1st percentile)
  - Gene names from Uniprot
  
The complete list of these annotated candidate regions is available as a supplementary table in the paper

These analyses rely on a BED formatted file with the genomic coordinates of all genes.

```bash
cat ../../genome/adig-v2-ncbi.gff | awk 'BEGIN{OFS="\t"}$3=="gene"{print $1,$4,$5,$9}' | sed 's/ID=//' | sort -u  > allgenes.bed
```


```{r results='hide'}
# Annotate regions with gene information

allgenes <- read_tsv("data/hpc/annotation/allgenes.bed",col_names = c("chr","start","end","geneid")) %>% 
  as.data.frame %>% 
  bedr.sort.region(check.chr = FALSE)


genes_in_region <- function(regions){
  regions %>% 
#  mutate(chr=str_remove(chr,"\\.1")) %>% 
  bedr.join.region(allgenes, check.chr = FALSE) %>% 
  filter(geneid!=".") %>% 
  dplyr::select(chr,start,end,pop,geneid) %>% 
  right_join(regions) %>% mutate(chr=str_remove(chr,"\\.1"))  
}

inshore_genes <- genes_in_region(inshore_candidate_regions)
northoffshore_genes <- genes_in_region(northoffshore_candidate_regions)
southoffshore_genes <- genes_in_region(southoffshore_candidate_regions)

# This is so we can construct GO terms by region in later analyses
#
all_regions %>% bedr.sort.region(check.chr = FALSE) %>% 
  genes_in_region() %>% 
  write_rds("cache/all_regions_genes.rds")

candidate_regions_genes <- rbind(inshore_genes,northoffshore_genes,southoffshore_genes) %>%  
  group_by(chr, start, end, pop) %>% 
  summarise(genes=paste(geneid %>% unique,collapse = ";")) %>% 
  arrange(pop) %>% 
  ungroup
```

```{r ,results='hide'}
add_ehh_window_annotation <- function(focal_pop){

  sorted_ehh_windows <- windowed_ehh %>%
    filter(percentile==1 & pop==focal_pop) %>% 
    dplyr::select(-pop) %>% 
    mutate(chr=str_remove(chr,"\\.1")) %>% 
    as.data.frame() %>% 
    bedr.sort.region(check.chr = FALSE)

  candidate_regions_genes %>% 
    filter(pop==focal_pop) %>% 
    as.data.frame() %>% 
    bedr.sort.region(check.chr = FALSE) %>% 
    bedr.join.region(sorted_ehh_windows,check.chr = FALSE, check.merge = FALSE) %>% 
    mutate(frac=as.numeric(frac)) %>% 
    group_by(chr,start,end,pop,genes) %>% 
    summarise(frac = mean(frac), stats = paste(unique(stat),collapse = ";"))
}

candidate_regions_genes_ehhwin <- c("inshore","northoffshore","southoffshore") %>% map_dfr(add_ehh_window_annotation)
```

```{r, cache=TRUE, cache.lazy = FALSE, results='hide'}
# Annotate each region with the maximum score from each stat and the position of the corresponding snp

add_ehh_raw_annotation <- function(focal_pop,raw_stats,regions){
  ihs_stats <- raw_stats %>%  
    filter(stat=="ihs")
  
  xpehh_stats <- raw_stats %>% 
    filter(stat=="xpehh") %>%
    filter(startsWith(pop,focal_pop)) %>% 
    dplyr::select(-pop)
  
  xpnsl_stats <- raw_stats %>% 
    filter(stat=="xpnsl") %>%
    filter(startsWith(pop,focal_pop)) %>% 
    dplyr::select(-pop)
  
  tmp <- regions %>% 
    filter(pop==focal_pop)
  
  tmp_ihs <- regions %>% 
    filter(pop==focal_pop) %>% 
    left_join(ihs_stats) %>% 
    filter(pos>=start & pos<=end) %>% 
    group_by(chr,start,end,pop,genes,frac,stats) %>% 
    mutate(max_ihs_pos=pos[which.max(norm_value)], max_ihs=max(norm_value)) %>% 
    dplyr::select(-pos,-norm_value,-pval,-stat) %>% 
    unique() %>% 
    ungroup
  
  tmp_xpehh <- regions %>% 
    filter(pop==focal_pop) %>% 
    left_join(xpehh_stats) %>% 
    filter(pos>=start & pos<=end) %>% 
    group_by(chr,start,end,pop,genes,frac,stats) %>% 
    mutate(max_xpehh_pos=pos[which.max(norm_value)], max_xpehh=max(norm_value)) %>% 
    dplyr::select(-pos,-norm_value,-pval,-stat) %>% 
    unique() %>% 
    ungroup
  
  tmp_xpnsl <- regions %>% 
    filter(pop==focal_pop) %>% 
    left_join(xpnsl_stats) %>% 
    filter(pos>=start & pos<=end) %>% 
    group_by(chr,start,end,pop,genes,frac,stats) %>% 
    mutate(max_xpnsl_pos=pos[which.max(norm_value)], max_xpnsl=max(norm_value)) %>% 
    dplyr::select(-pos,-norm_value,-pval,-stat) %>% 
    unique() %>% 
    ungroup
  
  left_join(tmp,tmp_ihs) %>% left_join(tmp_xpehh) %>% left_join(tmp_xpnsl)
}

if ( !file.exists("data/r_essentials/candidate_regions_genes_ehh.rds")){
  raw_ehh_stats <- list.files("data/hpc/selection2",pattern = "*.norm$",recursive = TRUE,full.names = TRUE) %>% 
    map_dfr(read_ehh_stat) %>% 
    mutate(chr=str_remove(chr,"\\.1")) %>% 
    dplyr::rename(pos=chr_pos)

  candidate_regions_genes_ehh <- c("inshore","northoffshore","southoffshore") %>% 
    map_dfr(add_ehh_raw_annotation,raw_ehh_stats,candidate_regions_genes_ehhwin)

  candidate_regions_genes_ehh %>% write_rds("data/r_essentials/candidate_regions_genes_ehh.rds")
} else {
  candidate_regions_genes_ehh <- read_rds("data/r_essentials/candidate_regions_genes_ehh.rds")
}
```

```{r, eval=FALSE}
# Write candidate regions for mapping to pseudochromosome coords with ragtag
candidate_regions_genes_ehh %>% mutate(chr=paste(chr,".1",sep="")) %>% write_tsv("data/hpc/ragtag/ehh_sweeps.tsv",col_names = FALSE)

# Also write as a gff track
candidate_regions_genes_ehh %>% 
  add_column(source="selscan") %>% 
  add_column(feature="cds") %>% 
  add_column(strand="+") %>% 
  add_column(phase=".") %>% 
  mutate(start=start+1) %>% 
  mutate(chr=paste(chr,".1",collapse = "",sep="")) %>% 
  mutate(ID = paste(chr,":",start,"-",end,collapse="",sep="")) %>% 
  mutate(attributes = paste("ID=",ID,";genes=",genes,";pop=",pop,collapse = "",sep = "")) %>% 
  ungroup() %>% 
  dplyr::select(chr,source,feature,start,end,frac,strand,phase,attributes) %>% 
    write_tsv("data/hpc/tracks/sweeps.gff3",col_names = FALSE)
```

For the purposes of making a human viewable table in supp info we also attach gene names from uniprot. The final table is provided as Supp Table 5 in the paper

```{r ehh_genes_table}
# You must run 09_annotate_genes first
uniprot_gene_annot <- read_tsv("data/hpc/annotation/uniprot_gene_annot.tsv")

supp_table4 <- candidate_regions_genes_ehh %>% 
  separate_rows(genes,sep=";") %>% 
  left_join(uniprot_gene_annot, by = c("genes"="geneid") ) %>% 
  group_by(chr,start,end,pop,frac,stats,max_ihs_pos,max_ihs,max_xpehh_pos,max_xpehh,max_xpnsl_pos,max_xpnsl) %>% 
  dplyr::summarise(protein_names = paste(unique(protein), collapse = ";"), gene_ids = paste(unique(genes),collapse = ";")) %>% 
  mutate(pop=case_when(pop=="inshore"~"Inshore", pop=="northoffshore"~"North Offshore", pop=="southoffshore"~"South Offshore")) %>% 
  dplyr::rename_all(recode,chr="Scaffold",start="Start", end="End",pop="Population",protein_names="Gene names in this window",gene_ids="Gene ids in this window",frac="Fraction with z-score > 2") %>% 
  ungroup()

write_tsv(supp_table4,"cache/supp_table4.tsv") # For upload to google sheets  

supp_table5 <- candidate_regions_genes_ehh %>% 
  separate_rows(genes,sep=";") %>% 
  left_join(uniprot_gene_annot, by = c("genes"="geneid") ) %>%
  rename_all(recode,chr="Scaffold",start="Start", end="End",pop="Population", genes="Gene ID",frac="Fraction with z-score > 2", protein="Protein name",go="go_swissprot") %>% 
  dplyr::select(-genename) 

write_tsv(supp_table5,"cache/supp_table5.tsv")
```