Skip to content

Latest commit

 

History

History
149 lines (116 loc) · 6.16 KB

25.geva.md

File metadata and controls

149 lines (116 loc) · 6.16 KB

Dating allele emergence using GEVA

To investigate the timing of selection in relation to the emergence of variants at selective sweeps we used GEVA (Albers and McVean 2020) (Genealogical Estimation of Variant Age) to estimate the age of variants at selected loci.

One requirement for GEVA is that the ancestral and derived alleles are identified for each SNP. We used est-sfs (Keightley and Jackson 2018) for this task as follows;

First we created a whole-genome alignment of A. digitifera with two outgroup species, A. millepora and A. tenuis using Cactus (v2.0.5). We then exported the alleles at all snps using halSnps and filtered these to only include those overlapping our SNP callset. These were then used to generate an input file for est-sfs.

After running est-sfs we used bcftools and custom awk scripts to update our phased variant callset by assigning the ancestral allele inferred by est-sfs to the reference allele. This updating process takes care to update genotypes in instances where ref and alt alleles are swapped from their original values. See bash and shell scripts in data/hpc/ancestral_allele/ for details.

The phased vcf file with ref updated to the ancestral allele was then used as input to GEVA. See data/hpc/geva/ for details

Variant consequence calling

We used bcftools csq to predict variance consequences for all SNPs overlapping the haem peroxidase gene s0150.g24. Since we are interested in the consequences of derived alleles we first created a version of the A. digitifera genome in which the bases were altered at all SNP positions to be that of the ancestral allele.

# This uses bcftools to identify ref mismatches in the vcf with aa=ref
bcftools norm -c=w BLFC01000154.1_aaref.vcf.gz -f BLFC01000154.1.fasta 2> ref_mismatches.txt

# This creates a version of the ref with ancestral allele at SNP positions
cat ref_mismatches.txt | awk -f aaref.awk | bioawk -c fastx '{printf(">%s\n%s\n",$name,$seq)}' > BLFC01000154.1_aa.fasta

# Check that the new ref is correct
bcftools norm -c=x BLFC01000154.1_aaref.vcf.gz -f BLFC01000154.1_new.fasta

Consequence calling was then done using the vcf and reference sequences where the AA is encoded as REF.

bcftools csq -f BLFC01000154.1_aa.fasta -g s0150.g24.gff BLFC01000154.1_aaref.vcf.gz -O t > s0150.g24.csq.tsv 

Haem Peroxidase locus on BLFC01000154.1

We used results from GEVA in combination with allele frequency and variant consequence information to investigate variants overlapping s0150.g24 in detail.
A plot of these allele frequencies by population indicates a large number of high-frequency derived alleles in the inshore population compared with the two offshore populations. This is expected since the inshore population was under selection. Interestingly, when we break this down further we see that low frequency alleles in the inshore population are far younger than high frequency ones with the switch occurring between 2000 and 10000 generations ago. This strong dichotomy between young, rare alleles and old frequent ones is not present in the reference (offshore) populations. It suggests that most rare alleles in the inshore population reflect variants that have arisen since the onset of strong selection at this locus.

Combining all this information we create the plot below

As a quality check on the GEVA estimates we examine the number of concordant and discordant pairs available for each allele. We found that in general all alleles had sufficient Discordant pairs, however for low frequency alleles the number of concordant pairs was often quite low, sometimes as low as 1. GEVA did not produce age estimates for alleles with no concordant pairs (ie present on a single haplotype). In general, alleles with few concordant pairs and low frequency were also young, however among background haplotypes we found many alleles at low frequency that were also old.

## # A tibble: 4 × 3
## # Groups:   age_class [2]
##   age_class     haptype    count
##   <fct>         <chr>      <int>
## 1 (0,1.5e+04]   Background    28
## 2 (0,1.5e+04]   Selected      28
## 3 (1.5e+04,Inf] Background   129
## 4 (1.5e+04,Inf] Selected     129

An interesting feature of this plot is the very large number of missense variants in the second exon. Let’s count how many of these there are compared with the background

## # A tibble: 4 × 3
## # Groups:   selection_status [2]
##   selection_status Category   count
##   <chr>            <chr>      <int>
## 1 Background       Missense       1
## 2 Background       Synonymous     3
## 3 Selected         Missense      10
## 4 Selected         Synonymous     7

There are quite a few missense variants in the inshore copy. We can export the resulting protein sequence to see if there are any consequences of this change

bcftools view -S selected_indvs.txt BLFC01000154.1_aaref.vcf.gz > BLFC01000154.1_aaref_inshore.vcf
bgzip BLFC01000154.1_aaref_inshore.vcf 
tabix BLFC01000154.1_aaref_inshore.vcf.gz
bcftools consensus  -f BLFC01000154.1_aa.fasta BLFC01000154.1_aaref_inshore.vcf.gz > BLFC01000154.1_inshore_consensus.fasta
gffread -g BLFC01000154.1_inshore_consensus.fasta -y s0150.g24.protein_inshore.fa s0150.g24.gff
samtools faidx ../annotation/protein.fa adig_s0150.g24.t1 > s0150.g24.protein.fa 

Albers, Patrick K, and Gil McVean. 2020. “Dating Genomic Variants and Shared Ancestry in Population-Scale Sequencing Data.” PLoS Biol. 18 (1): e3000586.

Keightley, Peter D, and Benjamin C Jackson. 2018. “Inferring the Probability of the Derived Vs. The Ancestral Allelic State at a Polymorphic Site.” Genetics 209 (3): 897–906.