PIMA: Plasmid, Integrations, Mutations, and Antibiotic resistance annotation pipeline [MARKDOWN ENABLED]
- PIMA: Plasmid, Integrations, Mutations, and Antibiotic resistance annotation pipeline
- Installation
- All available arguments
- Literature Citations
- Software
PIMA (Plasmid, Integrations, Mutations, and Antibiotic resistance annotation pipeline) is a high-throughput sequence analysis pipeline. PIMA is an end-to-end solution encompassing all the steps required to transform raw ONT output into high-quality annotated assemblies. PIMA supports providing optional Illumina paired end reads for polishing and error correcting assemblies.
The PIMA pipeline performs the following actions [Analysis program options in brackets, default first]:
- Performs basecalling from FAST5 files [guppy or albacore]
- Demultiplexes basecalled ONT data (FASTQ files) [qcat or porechop]
- Performs basic quality control [Porechop]
- Optional base quality correction [LorDEC]
- Assemble reads [Flye, miniasm, or wtdbg2]
- Polish and improve assembly [Optional, racon, Medaka, nanopolish, Pilon]
- Annotate plasmids
- Annotate plasmid incompatiblity (INC) groups
- Annotate antimicrobial resistance (AMR) genes
- Annotation of other user-supplied genomic features (from FASTA)
- Comparison to reference genome, including variant calling and annotation
Codebase stage: development
Developers and maintainers, Testers: Andrew Conley, Lavanya Rishishwar, Nolan English, Vasanta Chivukula
- PiMA pipeline requires Python 3.6 or higher version.
- PiMA relies on GPU acceleration and parallezation for parts of its pipeline. Therefore a graphics card with a CUDA Compute Capability of >=6.0. Handy Reference linking GPUs to Compatibility
- Pilon version 1.24
A docker image of this pipeline, prebuilt with the dependencies, is available at https://hub.docker.com/r/appliedbioinformaticslab/pimadocker2 with steps detailing the installation and running the pipeline. This is the easiest way to load and run the pipeline.
All the configurations have been performed here for the user and is a lot easier to setup than installing each dependency.
PiMA pipeline can also be installed using conda environment. We have observed that conda doesn't always faithfully install the dependencies due to version conflicts of dependencies. Please pay attention to any dependency that may be failing during the process.
# Install the PIMA base environment (Python 3.6)
conda env create -f pima.yml
If you want to setup conda environment yourself and install all the dependencies, here are the steps:
# You can also install the dependencies while creating the PiMA base environment.
conda install mamba
conda create -n pima
# Activate the conda environement using
conda activa pima
mamba install -c bioconda -c conda-forge medaka=1.4.3 varscan r flye blast circos minimap2 bwa samtools \
bedtools pandas pathos joblib pylatex tectonic mummer qcat -y
# Some more dependencies (si_prefix, dna_features_viewer, and quast) installed using pip installer; conda version do not exactly work
pip install si_prefix dna_features_viewer
pip install git+https://github.com/ablab/quast
# Once the dependencies have been installed, proceed to cloning PiMA
git clone https://github.com/abconley/pima.git
# If you want to download the reference file, kraken database, and other required folders, run
pima.py --download
Here is an example of a basic run using fast5 files as input.
# basic command
./pima.py --output output_folder --ont-fast5 fast5_files --genome-size 5M --threads 12
Consider an example scenario where you want to assemble Bacillus anthracis ont reads. If the reference file is named ref.fasta and the query fast5 files are in the folder named barcodes_folder, the mutation regions bed file is named mutation_regions.bed and the output folder you named is ont_output then your pima command will look as follows:
pima.py --out ont_output --ont-fast5 barcodes_folder --threads 16 --overwrite --genome-size 5m \
--verb 3 --reference-genome ref.fasta --mutation-regions mutation_regions.bed
Please make sure that the first column in the mutation_regions.bed file (#contig) is the same as the header in the fasta file. Example: if the mutation_regions.bed file has 'chromosome' in the first column, then the contig header must include 'chromosome'.
> chromosome NC.03921
or
> NC.053521 chromosome
I prefer to run PiMA using the verbose flag and my hardware can support upto 20 threads. Feel free to turn off the verbose flag if you so desire, and change the number of threads in accordance with your hardware capability. This command will create the following outputs listed in no particular order:
- ont_assembly: Folder with the assembly FASTA file (assembly.fasta) and assembly metrics.
- ont_fastq: Folder with the raw FASTQ file that is used to generate the assembly and the guppy run.
- quast: Folder with the quality metrics and the quast report.
- downsample: folder with a FASTQ with a subset of a high-coverage data set to a given average coverage.
- medaka: folder with the consensus sequence (consensus.fasta).
- drawing: provides the contig plot.
- insertions: provides information on the alignment with the reference, alignment coordinates, any snps and report generated with this information.
- circos: provides svg and png files with visuals of the chromosome and plasmids (if present).
- mutations: includes files with information on mapping with the reference and the snps identified.
- report: A final report in a pdf format describing the assemblies and other information pertaining to the assembled genome (such as mutations, amr genes identified).
- info: provides the coverage information based on mapping to a reference (ont_coverage.tsv).
- features: provides information on the presence of amr genes and any inserions, if present in the assembled genome. If there are no insertions or amr genes then you will get an empty file.
PiMA is a multi-step process that can be run as a single command - base calling, checking for contamination, assembly, generating a quast report, identifying any mutations based on the reference file provided.
usage: pima.py [--help]
pima.py - Plasmid, Integrations, Mutations, and Antibiotic resistance annotation pipeline
optional arguments:
--help, -h, --h
pima.py command: pima.py --output <your output folder name> --ont-fast5 <where the input fast5 files are located> \
--genome-size <5M for Ba; genome size in Mb> --threads <number of parallel threads; 12 is the maximum>
The input file can be in fast5 format (use the --ont-fast5 flag for fast5 files) or fastq format (the flag would be --ont-fastq).
The full description of each commandline option is provided below.
usage: pima.py [--help] [--version] [--ont-watch <ONT_DIR>] [--ont-watch-min-reads <INT>] [--ont-watch-max-time <HOURS>]
[--ont-watch-between-time <MINUTES>] [--ont-watch-min-coverage <X>] [--ont-fast5 <ONT_DIR>]
[--basecaller {guppy}] [--ont-fastq <FASTQ|GZ>] [--multiplexed] [--error-correct] [--contamination]
[--only-basecall] [--illumina-fastq <R1> [R2] [<R1> [R2] ...]] [--genome <GENOME_FASTA>]
[--output <OUTPUT_DIR>] [--overwrite] [--assembler {wtdbg2,flye}] [--genome-size <GENOME_SIZE>]
[--assembly-coverage <X>] [--racon] [--racon-rounds <NUM_ROUNDS>] [--no-medaka] [--only-assemble]
[--no-assembly] [--download] [--plasmids] [--plasmid-database <PLASMID_FASTA>]
[--amr-database <AMR_FASTA>] [--no-amr] [--inc-database <INC_FASTA>] [--no-inc]
[--feature <FEATURE_FASTA>] [--no-drawing] [--reference-dir <REFERNCE_DIR>] [--organism <ORGANISM>]
[--list-organisms] [--auto-reference] [--reference-genome <GENOME_FASTA>]
[--mutation-regions <REGION_BED>] [--name <NAME>] [--threads <NUM_THREADS>] [--verbosity <INT>]
[--bundle <PATH>] [--fake-run]
P.I.M.A. bacterial genome analysis pipeline
Help and version:
--help Print this help and exit.
--version Print the software version.
Input and basecalilng options:
--ont-watch <ONT_DIR> Directory from an ONT run.
--ont-watch-min-reads <INT> Minimum number of ONT reads to watch for.
--ont-watch-max-time <HOURS> Maxmium time to watch for new reads.
--ont-watch-between-time <MINUTES> Maximum to to wait between new FAST5 files
--ont-watch-min-coverage <X> Minimum genome coverage achieved before starting analysis.
--ont-fast5 <ONT_DIR> Directory containing ONT FAST5 files
--basecaller {guppy} The basecaller for ONT FAST5 data (default : guppy)
--ont-fastq <FASTQ|GZ> File containing basecalled ONT reads
--multiplexed The ONT data are multiplexed (default : False)
--error-correct Use LORMA to error-correct ONT reads (default : False)
--contamination Use Kraken2 to look for contamination in the input read (default : False)
--only-basecall Just basecall and demultiplex/trim/correct. No downstream analysis.
--illumina-fastq <R1> [R2] [<R1> [R2] ...] Files containing R1 & R2 Illumina reads
--genome <GENOME_FASTA> A genome FASTA file to be used in place of assembly
Output options:
--output <OUTPUT_DIR> Output directory for the analysis
--overwrite Overwrite an existing output directory (default : False)
Assembly options:
--assembler {wtdbg2,flye} Assembler to use (default : flye)
--genome-size <GENOME_SIZE> Genome size estimate for the assembly & downsampling (default : None))
--assembly-coverage <X> Downsample the provided reads to this coverage (default : 200X)
--racon Force the generation a racon consenus (default : False)
--racon-rounds <NUM_ROUNDS> Number of RACON rounds used to generate a consensus (default : 4)
--no-medaka Skip Medaka polising of the ONT assembly (faster) (default : False)
--only-assemble Only carry out assembly steps with the given data; no downstream analysis.
--no-assembly Don't attempt to assembly/polish a given genome/set of reads (default :
False)
Database downloading arguments:
--download Attempt to download Kraken/Plasmid databases if not found locally.Use
witout other options.
Plasmid and vector search options:
--plasmids Do a plasmid search (default : False)
--plasmid-database <PLASMID_FASTA> Path to a FASTA file with reference plasmid sequences
AMR gene search options:
--amr-database <AMR_FASTA> Path to a FASTA file with AMR gene sequences (default :
pima/data/amr.fasta)
--no-amr Skip AMR search (default : False)
Incompatibility group search options:
--inc-database <INC_FASTA> Path to a FASTA file with incompatibility group sequences (default :
pima/data/inc.fasta)
--no-inc Skip incompatibility group search (default : False)
Other feature search options:
--feature <FEATURE_FASTA> Path to a FASTA file with feature sequences
Drawing options:
--no-drawing Skip drawing of contigs & Features (default : False)
Reference options:
--reference-dir <REFERNCE_DIR> Directory containing refrence organisms (default :
pima/data/reference_sequences)
--organism <ORGANISM> Reference organism to compare against
--list-organisms List the reference organisms available to this pipeline
--auto-reference Automatically choose an appropriate reference if available.
--reference-genome <GENOME_FASTA> Reference genome to compare against (default : None)
--mutation-regions <REGION_BED> Regions in the reference genome to screen for mutations (default : None)
Other options:
--name <NAME> Name of this analysis for reporting.
--threads <NUM_THREADS> Number of worker threads to use (default : 1)
--verbosity <INT> How much information to print as PiMA runs (default : 1)
--bundle <PATH> Local Tectonic bundle (default : None)
--fake-run Don't actually run the pipeline, just pretend to (default : False)
-
The --genome-size option where the genome size estimate for the assembly must be provided is recommended but not required.
-
For error correcting purposes, if illumina reads are used along with ONT reads, this pipeline will map the illumina reads against the ONT assembly. It will then pass the resulting BAM file, and the ONT assembly into Pilon. This will give us a new assembly with errors corrected.
If you are using PIMA, please cite the following literature: McLaughlin HP, Bugrysheva JV, Conley AB, Gulvik CA, Kolton CB, Marston C, Swaney E, Lonsway DR, Cherney B, Gargis AS, Kongphet-Tran T, Lascols C, Michel P, Villanueva J, Hoffmaster ER, Gee JE, Sue D. 2020. When minutes matter: rapid nanopore whole genome sequencing for anthrax emergency preparedness. Emerging Infectious Diseases
This error occurs in executing pima in a headless environment with no display environment for wkhtmltopdf to operate in. To solve it, install xvfb and then use xvfb-run to execute pima.
PIMA utilizes the following programs internally:
- BCFtools & Samtools: http://www.htslib.org/ Citation
- bedtools: https://bedtools.readthedocs.io/en/latest/ Citation
- NCBI BLAST+ (v 2.8+): https://blast.ncbi.nlm.nih.gov/Blast.cgi Citation
- bwa: https://github.com/lh3/bwa
- flye: https://github.com/fenderglass/Flye Citation
- GNU Parallel: https://www.usenix.org/publications/login/february-2011-volume-36-number-1/gnu-parallel-command-line-power-tool
- Medaka: https://github.com/nanoporetech/medaka
- miniasm: https://github.com/lh3/miniasm Citation
- minimap2: https://github.com/lh3/minimap2
- MUMmer: https://mummer4.github.io/ Citation
- nanopolish: https://github.com/jts/nanopolish
- parallel: https://www.gnu.org/software/parallel/
- Pilon: https://github.com/broadinstitute/pilon/wiki
- Porechop: https://github.com/rrwick/Porechop
- qcat: https://github.com/nanoporetech/qcat
- racon: https://github.com/isovic/racon Citation
- SAMtools: http://htslib.org/
- SPAdes: http://cab.spbu.ru/software/spades/ Citation
- wtdbg2: https://github.com/ruanjue/wtdbg2 Citation