This repository contains the code and command lines for the re-sequencing calling pipeline and DFE analysis presented in Verta et al. (2021) (https://academic.oup.com/gbe/advance-article/doi/10.1093/gbe/evab059/6179807). The different analysis and preparation steps are contained in their own subdirectories which are outlined below in rough sequential order.
annotation/: containes the code for downloading the reference genome and coordinate files from NCBI, as well as code to calculate coordinates for sites without coordinates in the GFF file.
genome_alignment/: pipeline for 9 species whole genome alignment - used for polarising SNPs and calculating divergence in order to calculate alpha.
read_mapping/: steps for downloading the raw reads from Barson et al. (2015) and mapping to the reference genome.
training_set/: initial SAMtools and GATK calling to produce a training set for the main GATK calling.
variant_calling/: GATK SNP calling pipeline and generation of callable sites FASTA file.
sfs/: Pipeline to extract frequency data and number of callable sites for different sequence contexts per gene.
summary_stats/: Calculating nucleotide diversity, theta and Tajima's D from SFS data.
dfe/: Estimating the distribution of fitness effects using anavar.
divergence/: Estimating divergence from the genome alignment using APE and estimating alpha with the DFE and divergence estimates.
A lot of the pipeline scripts write and submit jobs on SLURM based CSC computer cluster Puhti (https://docs.csc.fi/computing/overview/).
This behaviour can be changed from submitting the job to only writing the batch file by substituting the line
from qsub import q_sub
with from qsub import q_write as q_sub
at the top of the offending python scripts.
The scripts also make use of a number of python modules:
anavar_utils
: https://github.com/henryjuho/anavar_utils - some code to make working with anavar control files easier.
python_qsub_wrapper
: https://github.com/henryjuho/python_qsub_wrapper - some code to write and submit batch jobs from within the python scripts.
sfs_utils
: https://github.com/henryjuho/sfs_utils - code for extracting frequency data from VCF files
pysam
: https://pysam.readthedocs.io/en/latest/api.html - module for working with VCF, BED, FASTA and other files in python.
There are also the following directories that contain pipelines that did not end up in the paper:
homeoblock_alignments/: Pipeline for aligning homeoblocks in the salmon, not used for any analysis.
coverage_impact/: A sanity check to see if including low coverage individuals to maximise sample size and number of SNPs was skewing the estimated DFE.