This repo contains workflows for computational pathogen discovery using PathSeq, a pipeline in the Genome Analysis Toolkit (GATK) for detecting microbial organisms in short-read deep sequencing samples taken from a host organism.
Additional Resources:
- How to Run the Pathseq pipeline (manually)
- GATK PathSeq: a customizable computational tool for the discovery and identification of microbial sequences in libraries from eukaryotic hosts
Runs the PathSeq pipeline
- File must pass validation by ValidateSamFile
- All reads must have an RG tag
- One or more read groups all belong to a single sample (SM)
- Host and microbe references files available in the GATK Resource Bundle
- BAM file containing microbe-mapped reads and reads of unknown sequence
- Tab-separated value (.tsv) file of taxonomic abundance scores
- Picard-style metrics files for the filter and scoring phases of the pipeline
Builds a microbe reference for use with PathSeq
- FASTA file containing microbe sequences from NCBI RefSeq
- FASTA index and dictionary files
- GATK BWA-MEM index image
- PathSeq taxonomy file
Builds a host reference for use with PathSeq
- FASTA file containing host sequences
- FASTA index and dictionary files
- GATK BWA-MEM index image
- PathSeq Kmer file
- GATK 4 or later
- Cromwell version support
- Successfully tested on v36
- Does not work on versions < v23 due to output syntax
- Runtime parameters are optimized for Broad's Google Cloud Platform implementation.
- The provided JSON is a ready to use example JSON template of the workflow. Users are responsible for reviewing the GATK Tool and Tutorial Documentations to properly set the reference and resource variables.
- For help running workflows on the Google Cloud Platform or locally please view the following tutorial (How to) Execute Workflows from the gatk-workflows Git Organization.
- Please visit the User Guide site for further documentation on our workflows and tools.
- Relevant reference and resources bundles can be accessed in Resource Bundle.
- The following material is provided by the Data Science Platforum group at the Broad Institute. Please direct any questions or concerns to one of our forum sites : GATK or Terra.
This script is released under the WDL source code license (BSD-3) (see LICENSE in Note however that the programs it calls may be subject to different licenses. Users are responsible for checking that they are authorized to run all programs before running this script.