Skip to content

Latest commit

 

History

History
146 lines (121 loc) · 8.32 KB

File metadata and controls

146 lines (121 loc) · 8.32 KB

Snakemake Partial Genome Sequencing Pipeline

Pipeline for processing Illumina sequencing data generated by target enrichment via hybrid capture experiments. Heavily follows the Phyluce methodology outlined in Tutorial I: UCE Phylogenomics.

  1. Trims Illumina adapters and merges reads together BBDuk, BBMerge
  2. Assembles trimmed and merged reads Abyss, SPAdes, rnaSPAdes
  3. Detects and extracts target contigs Phyluce
  4. Summary statistics on targets and assemblies BBTools Stats
  5. Optional scripts and starting points to perform phylogenic inference

Prerequisites

wget https://repo.continuum.io/miniconda/Miniconda3-latest-Linux-x86_64.sh
bash Miniconda3-latest-Linux-x86_64.sh
conda install -c bioconda -c conda-forge snakemake
  • Git

Getting Started

Within a working directory:

git clone https://github.com/AAFC-BICoE/snakemake-partial-genome-pipeline.git .
  • Create a folder named "fastq" that contains Illumina based raw reads in fastq.gz format. Fastq files should not begin with numbers, or contain a mix of "_" and "-" characters.
  • Create a folder named "probes" that contains a probe fasta file with fasta headers in Phyluce UCE format
>uce-1_p1
GCTGGTTATC...
>uce-1_p2
TAACAATA....
>uce-2_p1
AAGCATCT...

Dry-run to see if everything is prepared correctly

snakemake --use-conda -n

To run pipeline with 32 cores and continue if some samples fail:

snakemake --use-conda -k --cores 32 

To save time on future runs, a central folder of conda environments can be called so they don't need to be repeatedly rebuilt. There is a path length limit to this feature so ensure the central folder is located in the home directory

snakemake --use-conda --conda-prefix <Path To Snakemake Conda Envs> --cores 32

Pipeline Overview

Alt text

Pipeline Summary

This pipeline was heavily inspired by and closely followed protocols developed by Dr. Brant Faircloth and prescribed in Tutorial I: UCE Phylogenomics. Software versions employed and specific parameters and commands are available in the Conda yml environment files and the Snakefile respectively.

Illumina paired end reads from target enrichment sequencing are trimmed of adaptors using BBDuk. A copy of the trimmed fastq reads are merged using BBMerge. The unmerged reads are assembled using SPAdes, rnaSPAdes and Abyss. Merging paired end reads prior to assembly with Abyss demonstrated a noticeable impact on the number of detected targets when using Phyluce. Merging reads had neglible impact with SPAdes and rnaSPAdes. Therefore the merged reads were assembled using Abyss.

Phyluce, along with the corresponding probe set used in the target enrichment experiment is used to process each assembly independently. This generates four separate Phyluce databases of probe hits and UCE target contigs. Due to the heavy variation in target detection depending on assembly method, we opted to combine all detected targets into a unique set per sample. The custom script merge_uces.py examines each sample, and all detected UCEs across the four assemblies. It combines all targets, and keeps only the longest of any targets found in multiple assemblies. This unique set of merged targets dramatically increases the amount of data available for Phylogeny. However, the unadulterated assemblies are available for processing if required.

The merged targets are concatenated into a single file which is a substitute of the Phyluce generated all-taxa-incomplete.fasta file that is the entry point for the Phyluce phylogeny workflow. A rapid phylogeny is generated for quality control examination. Example commands are provided in the script phylogeny.sh. Phyluce aligns all UCE targets using Mafft, trims the alignments using Gblocks, and removes any targets not present in 50% or more of samples. The generated phylip file serves as the entry point for RAxML or IQTree which produces a rapid phylogeny for the purposes of quality control and detecting sample or sequencing errors.

Author

Jackson Eyres
Bioinformatics Programmer
Agriculture & Agri-Food Canada
jackson.eyres@canada.ca

Copyright

Government of Canada, Agriculture & Agri-Food Canada

License

This project is licensed under the MIT License - see the LICENSE file for details

Publications & Additional Resources

  1. Brunke, A J., Hansen, A. K., Salnitska, M., Kypke, J. L., Escalona, H., Chapados, J.T., Eyres, J., Richter, R., Smetana, A., Ślipiński, A., Zwick, A., Hájek, J., Leschen, R., Solodovnikov, A. and Dettman, J.R. The limits of Quediini at last (Coleoptera: Staphylinidae: Staphylininae): a rove beetle mega-radiation resolved by comprehensive sampling and anchored phylogenomics. Systematic Entomology. Accepted. 1–36.
  2. Dr. Adam Brunke provides some further custom phylogeny instructions

Known Issues

  • Fastq files that start with numbers fail with Phyluce

  • rnaSPAdes 3.13.1 sometimes with randomly fails to generate a transcripts.fasta on a sample after completing K127. A workaround is to choose one of the K*** assemblies, and copy and rename it to transcripts.fasta in the higher level directory. Snakemake requires a transcripts.fasta for each rnaspades assembly to progress to Phyluce.

  • AAFC Specific Due to an incorrect and challenging to fix server wide implementation of OpenMPI, qsub commands should be run with "qsub -pe smp 1" which prevents abyss from starting in parallel mode and crashing. However Spades and rnaSPAdes appear to still use multiple cores as assigned via snakemake jobs

Citations

  • BioPython - Tools for biological computation
    Cock, P.J.A. et al. Biopython: freely available Python tools for computational molecular biology and bioinformatics. Bioinformatics 2009 Jun 1; 25(11) 1422-3 http://dx.doi.org/10.1093/bioinformatics/btp163 pmid:19304878

  • Snakemake - Workflow management system
    Köster, Johannes and Rahmann, Sven. “Snakemake - A scalable bioinformatics workflow engine”. Bioinformatics 2012.

  • SPAdes
    Nurk S. et al. (2013) Assembling Genomes and Mini-metagenomes from Highly Chimeric Reads. In: Deng M., Jiang R., Sun F., Zhang X. (eds) Research in Computational Molecular Biology. RECOMB 2013. Lecture Notes in Computer Science, vol 7821. Springer, Berlin, Heidelberg

  • BBTools
    Brian-JGI (2018) BBTools is a suite of fast, multithreaded bioinformatics tools designed for analysis of DNA and RNA sequence data.

  • FASTQC
    Andrews S. (2018). FastQC: a quality control tool for high throughput sequence data. Available online at:

  • Phyluce - Target enrichment data analysis
    Faircloth BC. 2016. PHYLUCE is a software package for the analysis of conserved genomic loci. Bioinformatics 32:786-788. doi:10.1093/bioinformatics/btv646.

  • Ultraconserved elements BC Faircloth, McCormack JE, Crawford NG, Harvey MG, Brumfield RT, Glenn TC. 2012. Ultraconserved elements anchor thousands of genetic markers spanning multiple evolutionary timescales. Systematic Biology 61: 717–726. doi:10.1093/sysbio/SYS004.

  • Abyss
    Shaun D Jackman, Benjamin P Vandervalk, Hamid Mohamadi, Justin Chu, Sarah Yeo, S Austin Hammond, Golnaz Jahesh, Hamza Khan, Lauren Coombe, René L Warren, and Inanc Birol (2017). ABySS 2.0: Resource-efficient assembly of large genomes using a Bloom filter. Genome research, 27(5), 768-777. doi:10.1101/gr.214346.116