-
Notifications
You must be signed in to change notification settings - Fork 2
/
10.identify_selective_genomic_windows.Rmd
343 lines (268 loc) · 14.4 KB
/
10.identify_selective_genomic_windows.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
---
title: "Identify outlying regions in selection analysis"
output:
github_document:
bibliography: bibliography.bib
---
```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = FALSE, warning = FALSE, message = FALSE, fig.retina = 2)
require(tidyverse)
require(bedr)
require(cowplot)
source("scripts/selscan.R")
require(ggsci)
```
## Windowed EHH tests
[Previously](08.ehh_stats.md), we calculated iHS, XP-EHH and XP-nSL scores for all three populations and visualised their distribution across the genome via Manhattan plots. Since selective sweeps should have consistently outlying values of EHH statistics across a large linked region we calculated the fraction of SNPs in each non-overlapping 50kb window that contained outlying, (|z-score|>2) scores. This follows the analysis of [@Voight2006] and was performed using the program `norm` included with selscan. See [08.ehh_stats.md](08.ehh_stats.md) for details of the z-score calculation.
Both the XP-EHH and XP-nSL statistics are designed to compare two populations. We designed our contrasts so that for each comparison there was always one focal population while the remaining two populations were used as a reference. This was based on the reasoning that it was unlikely a selective sweep would occur independently at the same locus in two separate populations. In the selscan outputs populations are coded as A or B. Our focal population was always "A" so that positive value of the statistic is associated with selection in the focal population.
Regions in the top 1st percentile of extreme scores were picked as candidate selective regions. Note that, while we picked windows with a high proportion of extreme scores, the number of SNPs in each window was also taken into consideration to avoid bias towards windows with less SNPs. This is illustrated in the QC plot below which shows a relatively even spread of selected windows across regions with different snp densities.
```{r windowed-stats}
if ( !file.exists("data/r_essentials/10_windowed_ehh.rds")){
windowed_ihs <- list.files(path = "data/hpc/selection2/ihs", pattern = "windows$", full.names = TRUE) %>%
map_dfr(read_windowed_ihs) %>%
add_column(stat = "ihs") %>%
dplyr::select(-sd)
windowed_xpstats <- list.files("data/hpc/selection2",pattern = ".xp[a-z]{3}.norm.50kb.windows$",recursive = TRUE,full.names = TRUE,include.dirs = TRUE) %>%
map_dfr(read_windowed_xpstats) %>%
dplyr::select(chr,start,end,nsnp,frac=fracA,percentile=percentileA,pop,stat)
windowed_ehh <- rbind(windowed_ihs,windowed_xpstats)
write_rds(windowed_ehh,"data/r_essentials/10_windowed_ehh.rds")
} else {
windowed_ehh <- read_rds("data/r_essentials/10_windowed_ehh.rds")
}
windowed_ihs <- windowed_ehh %>% filter(stat=="ihs")
windowed_xpstats <- windowed_ehh %>% filter(stat %in% c("xpehh","xpnsl"))
all_regions <- windowed_ehh %>%
dplyr::select(chr,start,end) %>%
distinct() %>%
mutate(region_id = paste(chr,":",start,"-",end,sep = "")) %>%
column_to_rownames("region_id") %>%
add_column(pop="all")
```
```{r, fig.width=12}
windowed_ehh %>% ggplot(aes(x=nsnp,y=frac)) + geom_point() +
geom_point(data=windowed_ehh %>% filter( percentile==1), aes(x=nsnp,y=frac), color="red") +
labs(x="Number of sites in window", y= "Proportion of scores with z > 2 in window") +
facet_grid(stat~pop)
```
**Figure 1:** Plots display the number of sites in window and the proportion of extreme scores in window. The red dots depict the windows in the 1st percentile where SNP density in each window been controlled.
## Overlap of windows between populations
Our main interest in this study is in selective sweeps that occurred after the split between populations. For this reason we excluded regions where the iHS test was also significant at 1% level in the other two populations. This step was not necessary for the XP-EHH and XP-nSL tests because these work by comparing levels of EHH in the focal population against the reference and will therefore identify sweeps that exist only in the focal population by design.
After excluding shared iHS significant signals, less than 1% of genomic regions were identified as candidates in three populations.
```{r bed-conversion, cache=TRUE,results='hide'}
inshore_bed <- windowed_ihs %>%
filter(pop=="inshore" & percentile==1) %>%
mutate(start=start-1) %>% # Required to convert to 0 base coords (BED format)
dplyr::select(chr,start,end) %>%
as.data.frame
northoffshore_bed <- windowed_ihs %>%
filter(pop=="northoffshore" & percentile==1) %>%
mutate(start=start-1) %>%
dplyr::select(chr,start,end) %>%
as.data.frame
southoffshore_bed <- windowed_ihs %>%
filter(pop=="southoffshore" & percentile==1) %>%
mutate(start=start-1) %>%
dplyr::select(chr,start,end) %>%
as.data.frame
```
```{r bedtools-subtract, cache=TRUE,results='hide'}
inshore_unique <- bedr.subtract.region(inshore_bed, northoffshore_bed,check.chr = FALSE) %>%
bedr.subtract.region(southoffshore_bed,check.chr=FALSE) %>%
bedr.merge.region(check.chr = FALSE) %>%
add_column(pop="inshore")
northoffshore_unique <- bedr.subtract.region(northoffshore_bed,inshore_bed,check.chr = FALSE) %>%
bedr.subtract.region(southoffshore_bed,check.chr=FALSE) %>%
bedr.merge.region(check.chr = FALSE) %>%
add_column(pop="northoffshore")
southoffshore_unique <- bedr.subtract.region(southoffshore_bed,northoffshore_bed,check.chr = FALSE) %>%
bedr.subtract.region(inshore_bed,check.chr=FALSE) %>%
bedr.merge.region(check.chr = FALSE) %>%
add_column(pop="southoffshore")
```
**Table 1:** The proportion of genome regions that are uniquely under selective sweep in three populations
```{r ihs_genomics_region-table}
all_ihs_unique <- rbind(inshore_unique,northoffshore_unique,southoffshore_unique) %>%
mutate(pop=case_when(
pop=="inshore"~"Inshore",
pop=="northoffshore"~"North Offshore",
pop=="southoffshore"~"South Offshore")
) %>%
mutate(length=end-start)
all_ihs_unique %>%
group_by(pop) %>%
dplyr::summarise(total=sum(length/1e6), `Number of Regions` = n()) %>%
mutate("Percentage of Total Genome (%)"=total/416*100) %>%
rename_all(recode,pop="Population", total="Total length (Mb) of genomic regions") %>%
knitr::kable(escape = F)
```
## Consolidated list of all selected regions
For each population we then created a list of all candidate regions under selection by merging intervals that are adjacent, or separated by no more than 100kb across all of the EHH statistics.
```{r, results='hide'}
consolidate_intervals <- function(focal_pop){
ihs_bed <- rbind(inshore_unique,northoffshore_unique,southoffshore_unique) %>%
filter(pop==focal_pop)
xp_bed <- windowed_xpstats %>%
filter(percentile == 1 & pop==focal_pop) %>%
mutate(start=start-1) %>% # Required to convert to 0 base coords (BED format)
dplyr::select(chr,start,end) %>%
as.data.frame
all_bed <- rbind(ihs_bed %>% dplyr::select(chr,start,end),xp_bed) %>%
bedr.sort.region(check.chr = FALSE)
bedr.merge.region(all_bed, check.chr = FALSE, distance = 100001) %>%
bedr.sort.region(check.chr = FALSE) %>%
add_column(pop=focal_pop)
}
inshore_candidate_regions <- consolidate_intervals("inshore")
northoffshore_candidate_regions <- consolidate_intervals("northoffshore")
southoffshore_candidate_regions <- consolidate_intervals("southoffshore")
```
### Annotating candidate regions
To facilitate downstream analyses we annotated candidate regions by
- Identifying overlapping genes
- Information from `selscan` used to assess the region (fraction of extreme values, stats in which the region was in the 1st percentile)
- Gene names from Uniprot
The complete list of these annotated candidate regions is available as a supplementary table in the paper
These analyses rely on a BED formatted file with the genomic coordinates of all genes.
```bash
cat ../../genome/adig-v2-ncbi.gff | awk 'BEGIN{OFS="\t"}$3=="gene"{print $1,$4,$5,$9}' | sed 's/ID=//' | sort -u > allgenes.bed
```
```{r results='hide'}
# Annotate regions with gene information
allgenes <- read_tsv("data/hpc/annotation/allgenes.bed",col_names = c("chr","start","end","geneid")) %>%
as.data.frame %>%
bedr.sort.region(check.chr = FALSE)
genes_in_region <- function(regions){
regions %>%
# mutate(chr=str_remove(chr,"\\.1")) %>%
bedr.join.region(allgenes, check.chr = FALSE) %>%
filter(geneid!=".") %>%
dplyr::select(chr,start,end,pop,geneid) %>%
right_join(regions) %>% mutate(chr=str_remove(chr,"\\.1"))
}
inshore_genes <- genes_in_region(inshore_candidate_regions)
northoffshore_genes <- genes_in_region(northoffshore_candidate_regions)
southoffshore_genes <- genes_in_region(southoffshore_candidate_regions)
# This is so we can construct GO terms by region in later analyses
#
all_regions %>% bedr.sort.region(check.chr = FALSE) %>%
genes_in_region() %>%
write_rds("cache/all_regions_genes.rds")
candidate_regions_genes <- rbind(inshore_genes,northoffshore_genes,southoffshore_genes) %>%
group_by(chr, start, end, pop) %>%
summarise(genes=paste(geneid %>% unique,collapse = ";")) %>%
arrange(pop) %>%
ungroup
```
```{r ,results='hide'}
add_ehh_window_annotation <- function(focal_pop){
sorted_ehh_windows <- windowed_ehh %>%
filter(percentile==1 & pop==focal_pop) %>%
dplyr::select(-pop) %>%
mutate(chr=str_remove(chr,"\\.1")) %>%
as.data.frame() %>%
bedr.sort.region(check.chr = FALSE)
candidate_regions_genes %>%
filter(pop==focal_pop) %>%
as.data.frame() %>%
bedr.sort.region(check.chr = FALSE) %>%
bedr.join.region(sorted_ehh_windows,check.chr = FALSE, check.merge = FALSE) %>%
mutate(frac=as.numeric(frac)) %>%
group_by(chr,start,end,pop,genes) %>%
summarise(frac = mean(frac), stats = paste(unique(stat),collapse = ";"))
}
candidate_regions_genes_ehhwin <- c("inshore","northoffshore","southoffshore") %>% map_dfr(add_ehh_window_annotation)
```
```{r, cache=TRUE, cache.lazy = FALSE, results='hide'}
# Annotate each region with the maximum score from each stat and the position of the corresponding snp
add_ehh_raw_annotation <- function(focal_pop,raw_stats,regions){
ihs_stats <- raw_stats %>%
filter(stat=="ihs")
xpehh_stats <- raw_stats %>%
filter(stat=="xpehh") %>%
filter(startsWith(pop,focal_pop)) %>%
dplyr::select(-pop)
xpnsl_stats <- raw_stats %>%
filter(stat=="xpnsl") %>%
filter(startsWith(pop,focal_pop)) %>%
dplyr::select(-pop)
tmp <- regions %>%
filter(pop==focal_pop)
tmp_ihs <- regions %>%
filter(pop==focal_pop) %>%
left_join(ihs_stats) %>%
filter(pos>=start & pos<=end) %>%
group_by(chr,start,end,pop,genes,frac,stats) %>%
mutate(max_ihs_pos=pos[which.max(norm_value)], max_ihs=max(norm_value)) %>%
dplyr::select(-pos,-norm_value,-pval,-stat) %>%
unique() %>%
ungroup
tmp_xpehh <- regions %>%
filter(pop==focal_pop) %>%
left_join(xpehh_stats) %>%
filter(pos>=start & pos<=end) %>%
group_by(chr,start,end,pop,genes,frac,stats) %>%
mutate(max_xpehh_pos=pos[which.max(norm_value)], max_xpehh=max(norm_value)) %>%
dplyr::select(-pos,-norm_value,-pval,-stat) %>%
unique() %>%
ungroup
tmp_xpnsl <- regions %>%
filter(pop==focal_pop) %>%
left_join(xpnsl_stats) %>%
filter(pos>=start & pos<=end) %>%
group_by(chr,start,end,pop,genes,frac,stats) %>%
mutate(max_xpnsl_pos=pos[which.max(norm_value)], max_xpnsl=max(norm_value)) %>%
dplyr::select(-pos,-norm_value,-pval,-stat) %>%
unique() %>%
ungroup
left_join(tmp,tmp_ihs) %>% left_join(tmp_xpehh) %>% left_join(tmp_xpnsl)
}
if ( !file.exists("data/r_essentials/candidate_regions_genes_ehh.rds")){
raw_ehh_stats <- list.files("data/hpc/selection2",pattern = "*.norm$",recursive = TRUE,full.names = TRUE) %>%
map_dfr(read_ehh_stat) %>%
mutate(chr=str_remove(chr,"\\.1")) %>%
dplyr::rename(pos=chr_pos)
candidate_regions_genes_ehh <- c("inshore","northoffshore","southoffshore") %>%
map_dfr(add_ehh_raw_annotation,raw_ehh_stats,candidate_regions_genes_ehhwin)
candidate_regions_genes_ehh %>% write_rds("data/r_essentials/candidate_regions_genes_ehh.rds")
} else {
candidate_regions_genes_ehh <- read_rds("data/r_essentials/candidate_regions_genes_ehh.rds")
}
```
```{r, eval=FALSE}
# Write candidate regions for mapping to pseudochromosome coords with ragtag
candidate_regions_genes_ehh %>% mutate(chr=paste(chr,".1",sep="")) %>% write_tsv("data/hpc/ragtag/ehh_sweeps.tsv",col_names = FALSE)
# Also write as a gff track
candidate_regions_genes_ehh %>%
add_column(source="selscan") %>%
add_column(feature="cds") %>%
add_column(strand="+") %>%
add_column(phase=".") %>%
mutate(start=start+1) %>%
mutate(chr=paste(chr,".1",collapse = "",sep="")) %>%
mutate(ID = paste(chr,":",start,"-",end,collapse="",sep="")) %>%
mutate(attributes = paste("ID=",ID,";genes=",genes,";pop=",pop,collapse = "",sep = "")) %>%
ungroup() %>%
dplyr::select(chr,source,feature,start,end,frac,strand,phase,attributes) %>%
write_tsv("data/hpc/tracks/sweeps.gff3",col_names = FALSE)
```
For the purposes of making a human viewable table in supp info we also attach gene names from uniprot. The final table is provided as Supp Table 5 in the paper
```{r ehh_genes_table}
# You must run 09_annotate_genes first
uniprot_gene_annot <- read_tsv("data/hpc/annotation/uniprot_gene_annot.tsv")
supp_table4 <- candidate_regions_genes_ehh %>%
separate_rows(genes,sep=";") %>%
left_join(uniprot_gene_annot, by = c("genes"="geneid") ) %>%
group_by(chr,start,end,pop,frac,stats,max_ihs_pos,max_ihs,max_xpehh_pos,max_xpehh,max_xpnsl_pos,max_xpnsl) %>%
dplyr::summarise(protein_names = paste(unique(protein), collapse = ";"), gene_ids = paste(unique(genes),collapse = ";")) %>%
mutate(pop=case_when(pop=="inshore"~"Inshore", pop=="northoffshore"~"North Offshore", pop=="southoffshore"~"South Offshore")) %>%
dplyr::rename_all(recode,chr="Scaffold",start="Start", end="End",pop="Population",protein_names="Gene names in this window",gene_ids="Gene ids in this window",frac="Fraction with z-score > 2") %>%
ungroup()
write_tsv(supp_table4,"cache/supp_table4.tsv") # For upload to google sheets
supp_table5 <- candidate_regions_genes_ehh %>%
separate_rows(genes,sep=";") %>%
left_join(uniprot_gene_annot, by = c("genes"="geneid") ) %>%
rename_all(recode,chr="Scaffold",start="Start", end="End",pop="Population", genes="Gene ID",frac="Fraction with z-score > 2", protein="Protein name",go="go_swissprot") %>%
dplyr::select(-genename)
write_tsv(supp_table5,"cache/supp_table5.tsv")
```