Skip to content

Running PHoeNIx

Jill V. Hagey, PhD edited this page Dec 6, 2024 · 70 revisions

You should have already set up your config file to make sure Nextflow knows how to run the programs within PHoeNIx. If you haven't already, please review the config set up portion of the install page.

Input Parameters

The following are the possible parameters you can pass to PHoeNIx. You can get this screen by running:

nextflow run cdcgov/phoenix -r v2.0.0 --help

Pipeline Workflow

Note that for PHX v1.0.0 the output argument --outdir CANNOT be a relative path. If you want to put the output in the directory you are in then append $PWD to the directory name like: --outdir $PWD/results. PHX >=1.1.0 allows relative paths for inputs and outputs.

Input: -entry PHOENIX or -entry CDC_PHOENIX

The full PHoeNIx pipeline (-entry PHOENIX or -entry CDC_PHOENIX) only runs on Illumina paired-end reads. Multiple samples can be run using a samplesheet.csv file

nextflow run cdcgov/phoenix -profile <docker/singularity/custom> -entry PHOENIX --input samplesheet.csv --kraken2db $PATH_TO_DB

Samplesheet Input

You will need to create a samplesheet with information about the samples you would like to analyze before running the pipeline. Use the --input parameter to specify its location. It must be a comma-separated file (csv) with at least 3 columns and a header row, as shown in the example below. DO NOT HAVE ANY SPACES IN THIS FILE. Do make sure the paths are full paths and not relative. For best results use the automated samplesheet creation scripts described in the automated section below.

--input '[path to samplesheet file]'

Reads Samplesheet

The samplesheet can have as many columns as you desire; however, there is a strict requirement for the first 3 columns to match those defined in the table below.

A final samplesheet file consisting of paired-end data may look something like the one below.

sample,fastq_1,fastq_2
SAMPLE_1,$PATH/AEG588A1_S1_L002_R1_001.fastq.gz,AEG588A1_S1_L002_R2_001.fastq.gz
SAMPLE_2,$PATH/AEG588A2_S2_L002_R1_001.fastq.gz,AEG588A2_S2_L002_R2_001.fastq.gz
SAMPLE_3,$PATH/AEG588A3_S3_L002_R1_001.fastq.gz,AEG588A3_S3_L002_R2_001.fastq.gz
Column Description
sample Custom sample name. This entry will be identical for multiple sequencing libraries/runs from the same sample. Spaces in sample names are automatically converted to underscores (_).
fastq_1 Full path to FastQ file for Illumina short reads 1. File has to be gzipped and have the extension ".fastq.gz" or ".fq.gz".
fastq_2 Full path to FastQ file for Illumina short reads 2. File has to be gzipped and have the extension ".fastq.gz" or ".fq.gz".

An example samplesheet has been provided with the pipeline and can be used for testing.

Automated Samplesheet Creation

A script is available to create a samplesheet from a directory of fastq files. The script will search 1 directory deep and attempt to determine sample id names and pairing/multilane information and will automatically create a samplesheet.

- Please review the samplesheet for accuracy before using it in the pipeline.
phoenix/bin/create_samplesheet.sh <directory of fastq files> > samplesheet.csv

You can change the name of the samplesheet.csv above to anything you want.

Input: -entry SCAFFOLDS or -entry CDC_SCAFFOLDS in versions >=2.0.0

Supports scaffold files from shovill, SPAdes, SKESA and unicycler.

If you already have scaffolds from PHoeNIx or another pipeline and want to run everything in the pipeline post SPAdes you can use -entry SCAFFOLDS or -entry CDC_SCAFFOLDS. To do this, you can either pass a samplesheet using --input or pass a directory using --indir which will search and look for the assemblies in the file directory passed based on the regrex *.scaffolds.fa.gz. If you need to change the regrex based on a different directory output structure or file extension use the argument --scaffolds_ext. Assemblies must end in '.fa.gz' or '.fasta.gz'.

For example:

  • If your file names are <sample_id>.fa.gz then run
nextflow run cdcgov/phoenix -r v2.0.0 <docker/singularity/custom> -entry PHOENIX --indir <path_to_dir> --scaffolds_ext ".fa.gz" --kraken2db $PATH_TO_DB
  • If you want to look through subdirectories for files with the name <sample_id>.fa.gz then run
nextflow run cdcgov/phoenix -r v2.0.0 -profile <docker/singularity/custom> -entry PHOENIX --indir <path_to_dir> --scaffolds_ext "/*.fa.gz" --kraken2db <path_to_db>

NOTE: The pipeline does not allow "renamed" or "filtered" to be in the name of your scaffold files as this will cause file naming duplicates & confuse PHoeNIx. 😢

Assembly Samplesheet

The samplesheet can have as many columns as you desire; however, there is a strict requirement for the first 2 columns to match those defined in the table below. A final samplesheet file consisting of comma separated csv file that has a sample name and the location of the assembly. DO NOT HAVE ANY SPACES IN THIS FILE. Do make sure the paths are full paths and not relative. For example:

sample,assembly
SAMPLE_1,/FULL_PATH/assembly_1.fa.gz
SAMPLE_2,/FULL_PATH/assembly_2.fa.gz
SAMPLE_3,/FULL_PATH/assembly_3.fa.gz
Column Description
sample Custom sample name. Spaces in sample names are automatically converted to underscores (_).
assembly Full path to a scaffolds fasta file. File has to be gzipped and have the extension ".fasta.gz" or ".fa.gz".

Input: -entry SRA or -entry CDC_SRA in versions >=2.0.0

This entry point will do the following for each SRR number in the input file.

  1. Download the raw fastq from the sequence read archive (SRA)
  2. Separate the fastq file into forward and reverse reads
  3. Use esearch to get the metadata for the isolate
  4. From the metadata, the forward and reverse reads will be renamed with the sample name rather than the SRR number (unless you pass --use_sra). NOTE: this argument is NOT available when running PHoeNIx on Terra.bio.
  5. Run samples through PHoeNIx

By default the pipeline will name the samples according to their sample name on NCBI, however, if there are duplicate sample names in your run PHoeNIx will use the SRA number instead. If you prefer to keep samples named after their SRA number then pass the --use_sra argument. The file sra_samplesheet.csv will contain the sample name and it's matching SRA number should you need to cross reference them.

SRA Samplesheet

The samplesheet has strict requirement for there to be one column with each SRR number on each line. No header is necessary. DO NOT HAVE ANY SPACES IN THIS FILE. For example:

SRR23709130
SRR23709131

The SRA entry points are different in that you need to use --input_sra rather than --input to pass the samplesheet. This is because the structure of the samplesheet is quite different and requires different checks. For example:

nextflow run cdcgov/phoenix -r v2.0.0 <docker/singularity/custom> -entry SRA --input_sra <path_to_samplesheet> --use_sra --kraken2db $PATH_TO_DB

Input: -entry UPDATE_PHOENIX in versions >=2.2.0

🚨🚧🚧🚨This entry point is under development and will be released in the v2.2.0 release. Stay Tuned

There are two ways to run this entry point.

  1. Pass a directory and all samples in this directory will be updated. This directory needs to have a phx output folder structure and expects that all files from phx output are present. Samples that failed QC in the initial run will be excluded from the update.
nextflow run cdcgov/phoenix -r v2.2.0 <docker/singularity/custom> -entry UPDATE_PHOENIX --indir <path_to_phx_output_folder> --kraken2db $PATH_TO_DB
  1. Pass a csv "sample sheet" that has 2 columns. The first should be the sample name, the second should be the full path to the sample level phx output folder. The directory should be a the full path only to the sample output folder like the other phx entry points.
sample,directory
SAMPLE_1,/FULL_PATH/PROJECT_DIR/SAMPLE_1
SAMPLE_2,/FULL_PATH/PROJECT_DIR/SAMPLE_2
SAMPLE_3,/FULL_PATH/PROJECT_DIR/SAMPLE_3
Column Description
sample Custom sample name. Spaces in sample names are automatically converted to underscores (_).
directory Full path to a project level directory that has a phoenix style output structure. The entry expects that all phx output files are present.
nextflow run cdcgov/phoenix -r v2.2.0 <docker/singularity/custom> -entry UPDATE_PHOENIX --input <path_to_samplesheet> --kraken2db $PATH_TO_DB

Output options:

  • By default, when --indir is passed as the input this will be viewed as the project level directory and output will be directed there. However, this can be overridden by also passing --outdir, in which case it will be used as the project level output folder.
  • If you pass --input AND all samples in the samplesheet have the SAME project level directory this is equivalent to running --indir on that folder without passing --outdir.
  • If you pass --input AND samples in the samplesheet have DIFFERENT project level directories the sample output files will be written to their respective project and sample directories.

The following files will be created with this pipeline:

📦phx_output
┣ 📂<sample_id>
┃ ┣ 📂gamma_ar
┃ ┃ ┣ 📜<sample_id>ResGANNCBI_srst2.gamma*
┃ ┃ ┗ 📜<sample_id>ResGANNCBI_srst2.psl*
┃ ┣ 📂mlst
┃ ┃ ┣ 📜<sample_id>_combined.tsv
┃ ┃ ┗ 📜<sample_id>.tsv
┃ ┣ 📂AMRFinder
┃ ┃ ┣ 📜<sample_id>_AMRFinder_Organism.csv
┃ ┃ ┣ 📜<sample_id>_all_mutations.tsv
┃ ┃ ┗ 📜<sample_id>_all_genes.tsv
┃ ┣ 📜<sample_id>_updater_log.tsv
┃ ┗ 📜<sample_id>_summaryline.tsv
┣ 📂pipeline_info
┣ 📜Phoenix_Summary.tsv*
┣ 📜<project_folder>_GRiPHin_Summary.xlsx
┗ 📜<project_folder>_GRiPHin_Summary.tsv

Files that are produced by gamma will be new, while all other files will created new, overwriting the old files. Information on the date of the update, changes in databases and program versions will be in the <sample_id>_updater_log.tsv file. The first time the pipeline runs this file will be created and any time you run the entry point on this same sample or project folder the file will be updated for a running log of any changes that have been made.

Input: -entry CENTAR in versions >=2.2.0

🚨🚧🚧🚨This entry point is under development and will be released in the v2.2.0 release. Stay Tuned

For species specific pipelines, like -entry CENTAR, the expectation is that additional species specific analysis is required after PHX was already run on these samples (either you forgot to add species specific flag i.e. --centar or the samples were run with phx <=v2.1.1).

There are two ways to run this entry point.

  1. Pass a directory and all samples in this directory will run through CENTAR. Isolates that are not C. diff will be removed from analysis. This directory needs to have a phx output folder structure and expects that all files from phx output are present. Samples that failed QC in the initial run will be excluded from the update.
nextflow run cdcgov/phoenix -r v2.2.0 <docker/singularity/custom> -entry CENTAR --indir <path_to_phx_output_folder> --kraken2db $PATH_TO_DB
  1. Pass a csv "sample sheet" that has 2 columns. The first should be the sample name, the second should be the full path to the sample level phx output folder. The directory should be a the full path only to the sample output folder like the other phx entry points.
sample,directory
SAMPLE_1,/FULL_PATH/PROJECT_DIR/SAMPLE_1
SAMPLE_2,/FULL_PATH/PROJECT_DIR/SAMPLE_2
SAMPLE_3,/FULL_PATH/PROJECT_DIR/SAMPLE_3
Column Description
sample Custom sample name. Spaces in sample names are automatically converted to underscores (_).
directory Full path to a project level directory that has a phoenix style output structure. The entry expects that all phx output files are present.
nextflow run cdcgov/phoenix -r v2.2.0 <docker/singularity/custom> -entry UPDATE_PHOENIX --input <path_to_samplesheet> --kraken2db $PATH_TO_DB

Output options:

  • By default, when --indir is passed as the input this will be viewed as the project level directory and output will be directed there. However, this can be overridden by also passing --outdir, in which case it will be used as the project level output folder. Output will be found at project_folder --> sample_id --> CENTAR
  • If you pass --input AND all samples in the samplesheet have the SAME project level directory this is equivalent to running --indir on that folder without passing --outdir.
  • If you pass --input AND samples in the samplesheet have DIFFERENT project level directories the sample output files will be written to the respective sample directory listed in the samplesheet. Likewise, the summary files will be split by their phx project folder and will be written there. If you want to have all sample CENTAR files and the samples in the same summary files then you will need to provide an --outdir.
Sample image

The following files will be created with this pipeline:

📦<project_folder>
┣ 📂<sample_id>
┃ ┣ 📂CENTAR
┃ ┃ ┗ 📜<sample_id>_centar_output.tsv
┃ ┃ ┣ 📂clade
┃ ┃ ┃ ┗ 📜<sample_id>_cdifficile_clade.tsv
┃ ┃ ┣ 📂plasmids
┃ ┃ ┃ ┗ 📜<sample_id>_plasmids.tsv
┃ ┃ ┣ 📂gamma_ar
┃ ┃ ┃ ┣ 📜<sample_id>_centar_ar_db_wt_NT_20240910.gamma
┃ ┃ ┃ ┣ 📜<sample_id>_centar_ar_db_wt_NT_20240910.psl
┃ ┃ ┃ ┣ 📜<sample_id>_centar_ar_db_wt_AA_20240910.gamma
┃ ┃ ┃ ┗ 📜<sample_id>_centar_ar_db_wt_AA_20240910.psl
┃ ┃ ┣ 📂gamma_tox
┃ ┃ ┃ ┣ 📜<sample_id>_Cdiff_toxins_srst2_20240909.gamma
┃ ┃ ┃ ┗ 📜<sample_id>_Cdiff_toxins_srst2_20240909.psl
┃ ┃ ┣ 📂ML_ribotype
┃ ┃ ┃ ┗ 📜<sample_id>_ribotype.tsv
┣ 📂centar_pipeline_info
┣ 📜Phoenix_Summary.tsv*
┣ 📜<project_folder>_GRiPHin_Summary.xlsx
┗ 📜<project_folder>_GRiPHin_Summary.tsv

Files that are produced by gamma will be new, while all other files will created new, overwriting the old files. Information on the date of the update, changes in databases and program versions will be in the <sample_id>_updater_log.tsv file. The first time the pipeline runs this file will be created and any time you run the entry point on this same sample or project folder the file will be updated for a running log of any changes that have been made.

Outputs

Output file structure

The project level output (phx_output is the project folder) of PHoeNIx is structured like the following:

📦phx_output
┣ 📂SRR17250615
┃ ┣ 📂AMRFinder
┃ ┃ ┣ 📜SRR17250615_AMRFinder_Organism.csv
┃ ┃ ┣ 📜SRR17250615_all_mutations.tsv
┃ ┃ ┗ 📜SRR17250615_amr_genes.tsv
┃ ┣ 📂ANI
┃ ┃ ┣ 📂mash_dist
┃ ┃ ┃ ┣ 📜SRR17250615_REFSEQ_[date].txt
┃ ┃ ┃ ┗ 📜SRR17250615_REFSEQ_[date]_best_MASH_hits.txt
┃ ┃ ┣ 📜SRR17250615_REFSEQ_[date].fastANI.txt
┃ ┃ ┗ 📜SRR17250615_REFSEQ_[date].ani.txt
┃ ┣ 📂annotation
┃ ┃ ┣ 📜SRR17250615.faa
┃ ┃ ┣ 📜SRR17250615.fna
┃ ┃ ┗ 📜SRR17250615.gff
┃ ┣ 📂assembly‡
┃ ┃ ┣ 📜SRR17250615.assembly.gfa.gz
┃ ┃ ┣ 📜SRR17250615.bbmap_filtered.log
┃ ┃ ┣ 📜SRR17250615.contigs.fa.gz
┃ ┃ ┣ 📜SRR17250615.filtered.scaffolds.fa.gz
┃ ┃ ┣ 📜SRR17250615.renamed.scaffolds.fa.gz
┃ ┃ ┣ 📜SRR17250615.scaffolds.fa.gz
┃ ┃ ┗ 📜SRR17250615.spades.log
┃ ┣ 📂BUSCO*
┃ ┃ ┣ 📜SRR17250615-auto-busco.batch_summary.txt
┃ ┃ ┣ 📜short_summary.generic.bacteria_odb10.SRR17250615.filtered.scaffolds.fa.json
┃ ┃ ┣ 📜short_summary.generic.bacteria_odb10.SRR17250615.filtered.scaffolds.fa.txt
┃ ┃ ┣ 📜short_summary.specific.enterobacterales_odb10.SRR17250615.filtered.scaffolds.fa.json
┃ ┃ ┗ 📜short_summary.specific.enterobacterales_odb10.SRR17250615.filtered.scaffolds.fa.txt
┃ ┣ 📂fastp_trimd‡
┃ ┃ ┣ 📜SRR17250615.fastp.html
┃ ┃ ┣ 📜SRR17250615.fastp.json
┃ ┃ ┣ 📜SRR17250615.singles.fastq.gz
┃ ┃ ┣ 📜SRR17250615_1.trim.fastq.gz
┃ ┃ ┣ 📜SRR17250615_2.trim.fastq.gz
┃ ┃ ┣ 📜SRR17250615_raw_read_counts.txt
┃ ┃ ┣ 📜SRR17250615_singles.fastp.html
┃ ┃ ┗ 📜SRR17250615_singles.fastp.json
┃ ┣ 📂gamma_ar
┃ ┃ ┣ 📜SRR17250615_ResGANNCBI_[date]_srst2.gamma
┃ ┃ ┗ 📜SRR17250615_ResGANNCBI_[date]_srst2.psl
┃ ┣ 📂gamma_hv
┃ ┃ ┣ 📜SRR17250615_HyperVirulence_[date].gamma
┃ ┃ ┗ 📜SRR17250615_HyperVirulence_[date].psl
┃ ┣ 📂gamma_pf
┃ ┃ ┣ 📜SRR17250615_PF-Replicons_[date].gamma
┃ ┃ ┗ 📜SRR17250615_PF-Replicons_[date].psl
┃ ┣ 📂kraken2_asmbld*
┃ ┃ ┣ 📂krona
┃ ┃ ┃ ┣ 📜SRR17250615_asmbld.html
┃ ┃ ┃ ┗ 📜SRR17250615_asmbld.krona
┃ ┃ ┣ 📜SRR17250615.asmbld_summary.txt
┃ ┃ ┣ 📜SRR17250615.classified.fasta.gz
┃ ┃ ┣ 📜SRR17250615.kraken2_asmbld.classifiedreads.txt
┃ ┃ ┣ 📜SRR17250615.kraken2_asmbld.summary.txt
┃ ┃ ┣ 📜SRR17250615.kraken2_asmbld.top_kraken_hit.txt
┃ ┃ ┣ 📜SRR17250615.mpa
┃ ┃ ┗ 📜SRR17250615.unclassified.fasta.gz
┃ ┣ 📂kraken2_asmbld_weighted
┃ ┃ ┣ 📂krona
┃ ┃ ┃ ┣ 📜SRR17250615_wtasmbld.html
┃ ┃ ┃ ┗ 📜SRR17250615_wtasmbld.krona
┃ ┃ ┣ 📜SRR17250615.kraken2_wtasmbld.classifiedreads.txt
┃ ┃ ┣ 📜SRR17250615.kraken2_wtasmbld.summary.txt
┃ ┃ ┣ 📜SRR17250615.kraken2_wtasmbld.top_kraken_hit.txt
┃ ┃ ┗ 📜SRR17250615.wtasmbld_summary.txt
┃ ┣ 📂kraken2_trimd‡
┃ ┃ ┣ 📂krona
┃ ┃ ┃ ┣ 📜SRR17250615_trimd.html
┃ ┃ ┃ ┗ 📜SRR17250615_trimd.krona
┃ ┃ ┣ 📜SRR17250615.classified_1.fasta.gz
┃ ┃ ┣ 📜SRR17250615.classified_2.fasta.gz
┃ ┃ ┣ 📜SRR17250615.kraken2_trimd.classifiedreads.txt
┃ ┃ ┣ 📜SRR17250615.kraken2_trimd.summary.txt
┃ ┃ ┣ 📜SRR17250615.mpa
┃ ┃ ┣ 📜SRR17250615.kraken2_trimd.top_kraken_hit.txt
┃ ┃ ┣ 📜SRR17250615.unclassified_1.fasta.gz
┃ ┃ ┗ 📜SRR17250615.unclassified_2.fasta.gz
┃ ┣ 📂mlst
┃ ┃ ┣ 📜SRR17250615.tsv
┃ ┃ ┗ 📜SRR17250615_combined.tsv
┃ ┣ 📂qc_stats‡
┃ ┃ ┣ 📜SRR17250615_1_fastqc.html
┃ ┃ ┣ 📜SRR17250615_1_fastqc.zip
┃ ┃ ┣ 📜SRR17250615_2_fastqc.html
┃ ┃ ┣ 📜SRR17250615_2_fastqc.zip
┃ ┃ ┣ 📜SRR17250615.bbduk.log
┃ ┃ ┗ 📜SRR17250615_trimmed_read_counts.txt
┃ ┣ 📂quast
┃ ┃ ┗ 📜SRR17250615_summary.tsv
┃ ┣ 📂raw_stats‡
┃ ┃ ┣ 📜SRR17250615_FAIry_synopsis.txt
┃ ┃ ┗ 📜SRR17250615_raw_read_counts.txt
┃ ┣ 📂srst2*‡
┃ ┃ ┗ 📜SRR17250615__fullgenes__ResGANNCBI_[date]_srst2__results.txt
┃ ┣ 📜SRR17250615.synopsis
┃ ┣ 📜SRR17250615.tax
┃ ┣ 📜SRR17250615_Assembly_ratio_[date].txt
┃ ┣ 📜SRR17250615_GC_content_[date].txt
┃ ┗ 📜SRR17250615_summaryline.tsv
┣ 📂multiqc
┃ ┣ 📂multiqc_data
┃ ┃ ┣ 📜mqc_fastqc_per_base_n_content_plot_1.txt
┃ ┃ ┣ 📜mqc_fastqc_per_base_sequence_quality_plot_1.txt
┃ ┃ ┣ 📜mqc_fastqc_per_sequence_gc_content_plot_Counts.txt
┃ ┃ ┣ 📜mqc_fastqc_per_sequence_gc_content_plot_Percentages.txt
┃ ┃ ┣ 📜mqc_fastqc_per_sequence_quality_scores_plot_1.txt
┃ ┃ ┣ 📜mqc_fastqc_sequence_counts_plot_1.txt
┃ ┃ ┣ 📜mqc_fastqc_sequence_duplication_levels_plot_1.txt
┃ ┃ ┣ 📜mqc_fastqc_sequence_length_distribution_plot_1.txt
┃ ┃ ┣ 📜multiqc.log
┃ ┃ ┣ 📜multiqc_data.json
┃ ┃ ┣ 📜multiqc_fastqc.txt
┃ ┃ ┣ 📜multiqc_general_stats.txt
┃ ┃ ┗ 📜multiqc_sources.txt
┃ ┣ 📂multiqc_plots
┃ ┃ ┣ 📂pdf
┃ ┃ ┃ ┣ 📜mqc_fastqc_per_base_n_content_plot_1.pdf
┃ ┃ ┃ ┣ 📜mqc_fastqc_per_base_sequence_quality_plot_1.pdf
┃ ┃ ┃ ┣ 📜mqc_fastqc_per_sequence_gc_content_plot_Counts.pdf
┃ ┃ ┃ ┣ 📜mqc_fastqc_per_sequence_gc_content_plot_Percentages.pdf
┃ ┃ ┃ ┣ 📜mqc_fastqc_per_sequence_quality_scores_plot_1.pdf
┃ ┃ ┃ ┣ 📜mqc_fastqc_sequence_counts_plot_1.pdf
┃ ┃ ┃ ┣ 📜mqc_fastqc_sequence_counts_plot_1_pc.pdf
┃ ┃ ┃ ┣ 📜mqc_fastqc_sequence_duplication_levels_plot_1.pdf
┃ ┃ ┃ ┗ 📜mqc_fastqc_sequence_length_distribution_plot_1.pdf
┃ ┃ ┣ 📂png
┃ ┃ ┃ ┣ 📜mqc_fastqc_per_base_n_content_plot_1.png
┃ ┃ ┃ ┣ 📜mqc_fastqc_per_base_sequence_quality_plot_1.png
┃ ┃ ┃ ┣ 📜mqc_fastqc_per_sequence_gc_content_plot_Counts.png
┃ ┃ ┃ ┣ 📜mqc_fastqc_per_sequence_gc_content_plot_Percentages.png
┃ ┃ ┃ ┣ 📜mqc_fastqc_per_sequence_quality_scores_plot_1.png
┃ ┃ ┃ ┣ 📜mqc_fastqc_sequence_counts_plot_1.png
┃ ┃ ┃ ┣ 📜mqc_fastqc_sequence_counts_plot_1_pc.png
┃ ┃ ┃ ┣ 📜mqc_fastqc_sequence_duplication_levels_plot_1.png
┃ ┃ ┃ ┗ 📜mqc_fastqc_sequence_length_distribution_plot_1.png
┃ ┃ ┗ 📂svg
┃ ┃ ┃ ┣ 📜mqc_fastqc_per_base_n_content_plot_1.svg
┃ ┃ ┃ ┣ 📜mqc_fastqc_per_base_sequence_quality_plot_1.svg
┃ ┃ ┃ ┣ 📜mqc_fastqc_per_sequence_gc_content_plot_Counts.svg
┃ ┃ ┃ ┣ 📜mqc_fastqc_per_sequence_gc_content_plot_Percentages.svg
┃ ┃ ┃ ┣ 📜mqc_fastqc_per_sequence_quality_scores_plot_1.svg
┃ ┃ ┃ ┣ 📜mqc_fastqc_sequence_counts_plot_1.svg
┃ ┃ ┃ ┣ 📜mqc_fastqc_sequence_counts_plot_1_pc.svg
┃ ┃ ┃ ┣ 📜mqc_fastqc_sequence_duplication_levels_plot_1.svg
┃ ┃ ┃ ┗ 📜mqc_fastqc_sequence_length_distribution_plot_1.svg
┃ ┗ 📜multiqc_report.html
┣ 📂pipeline_info
┃ ┣ 📜execution_report_2022-06-16_09-34-32.html
┃ ┣ 📜execution_timeline_2022-06-16_09-34-32.html
┃ ┣ 📜execution_trace_2022-06-16_09-34-32.txt
┃ ┣ 📜pipeline_dag_2022-06-16_09-34-32.svg
┃ ┣ 📜samplesheet.valid.csv
┃ ┗ 📜software_versions.yml
┣ 📜results_GRiPHin_Summary.xlsx
┣ 📜Phoenix_Summary.tsv
┣ 📜BiosampleAttributes_Microbe.1.0.xlsx
┗ 📜SRA_Microbe.1.0.xlsx

This is the file tree for running one sample.

  • *Designates files that will only be generated when you use the -entry CDC_PHOENIX for the pipeline.
  • ‡Designates steps that are SKIPPED when you use -entry CDC_SCAFFOLDS or -entry SCAFFOLDS.

Output File Overview

The following are an explanation of the files that are output:

  • ANI - Output of FastANI and Mash dist
  • AMRFinder - Output of FastANI and Mash dist
  • assembly - Assembly output from SPADes and filtering/header renaming steps.
  • annotation - Annotation output from PROKKA.
  • BUSCO - Output from BUSCO run on scaffolds summarizing assembly completeness.
  • fastp_trimd - Output of raw reads filtering and stats for trimmed, raw and unpaired reads.
  • GAMMA
    • gamma_ar - Output of GAMMA hits from curated AR database
    • gamma_hv - Output of GAMMA hits from hypervirulence gene database
    • gamma_pf - Output of GAMMA-S hits from plasmid finder database
  • Kraken2
    • kraken2_trimd - Output of Kraken2 run on trimmed reads and Krona plots
    • kraken2_asmbld - Output of Kraken2 run on the assembly and Krona plots
    • kraken2_asmbld_weighted - Output of Kraken2 run on the assembly weighted by sequence length and Krona plots
  • mlst - Output of MLST scans for assembly files against traditional PubMLST typing schemes
  • qc_stats - Output of fastqc on trimmed reads, BBDUK log (remove adapters and PhiX reads) and Fastp summary file _trimmed_read_counts.txt
  • quast - Assembly QC metrics
  • raw_stats- Summary file of Fastp run _raw_read_counts.txt
  • srst2 - Output of from SRST2 after mapping trimmed reads to a curated AR database
  • Sample Specific Files - Files that summarize the results for a sample
  • Run Specific Files - A file that summarizes multiple samples A good first place to start
  • MultiQC - Aggregate report describing results and FastQC from the whole pipeline
  • Pipeline information - Report metrics generated during the workflow execution

ANI

Output files
  • ANI/
    • *_REFSEQ_[date].ani.txt: Output of FastANI. Which shows the ANI estimate between the assembly and the top 20 closest genomes (determined via the mash distance). The remaining columns are the ANI estimate, the number of genomes that were aligned as orthologous matches, and the total sequence fragments from the assembly. For further details see the FastANI documentation.
    • ANI/fastANI
      • *_REFSEQ_[date].fastANI.txt: This is a reformatted version of *.ani.txt that list matches in order of ANI and includes the top match information as the first line of the file to be extracted in downstream processes for reporting.
    • ANI/mash_dist
      • *_REFSEQ_[date].txt: output of mash distance.F or further details see the Mash documentation
      • *_REFSEQ_[date]_best_MASH_hits.txt: A list of the top 20 matches found via mash dist that is past to FastANI to calculate

FastANI FastANI is developed for fast alignment-free computation of whole-genome Average Nucleotide Identity (ANI). ANI is defined as mean nucleotide identity of orthologous gene pairs shared between two microbial genomes. FastANI avoids expensive sequence alignments and uses Mashmap as its MinHash based sequence mapping engine to compute the orthologous mappings and alignment identity estimates.

AMRFinder

Output files
  • AMRFinder/
    • *_AMRFinder_Organism.csv: This file just contains the organism (if found) to be passed to amrfinder using the --organism parameter. Read more about the organism option in AMRFinder's documentation.
    • *_all_mutations.tsv: File generated by passing --mutation_all argument to AMRFinder read more about the mutation option in AMRFinder's documentation.
    • *_all_genes.tsv: The AR gene calls by AMRFinder. Only the point mutations are reported in Phoenix_Output_Report.tsv.

AMRFinder AMRFinder and the accompanying database identify acquired antimicrobial resistance genes in bacterial protein and/or assembled nucleotide sequences as well as known resistance-associated point mutations for several taxa. AMRFinderPlus has added select members of additional classes of genes such as virulence factors, biocide, heat, acid, and metal resistance genes.

Assembly

Output files
  • assembly/
    • *.assembly.gfa.gz: Contains SPAdes assembly graph and scaffolds paths in GFA 1.0 format
    • *.bbmap_filtered.log: The log file of bbmap which is used to remove scaffolds that have <500bp
    • *.contigs.fa.gz: Contains contigs generated by SPAdes
    • *.filtered.scaffolds.fa.gz: Scaffolds file that has <500bp sequences remove.
    • *.renamed.scaffolds.fa.gz: Same as the *.filtered.scaffolds.fa.gz file, but headers contain the sample name.
    • *.scaffolds.fa.gz: Contains scaffolds generated by SPAdes
    • *.spades.log: SPAdes log

Annotation

Output files
  • annotation/
    • *.faa: Protein FASTA file of the translated CDS sequences.
    • *.fna: Nucleotide FASTA file of the input contig sequences.
    • *.gff: This is the master annotation in GFF3 format, containing both sequences and annotations. It can be viewed directly in Artemis or IGV.

PROKKA – ⚡ ♒ Rapid prokaryotic genome annotation – Whole genome annotation is the process of identifying features of interest in a set of genomic DNA sequences, and labelling them with useful information. Prokka is a software tool to annotate bacterial, archaeal and viral genomes quickly and produce standards-compliant output files. For further reading and documentation see PROKKA Output Details.

BUSCO - only run with -entry CDC_PHOENIX

BUSCO output is based on evolutionarily-informed expectations of gene content of near-universal single-copy orthologs, thus the BUSCO metric is complementary to technical metrics like N50.

Output files
  • BUSCO/
    • *-auto-busco.batch_summary.txt:
    • short_summary.generic.*.filtered.scaffolds.fa.json: Contains a summary of the results in JSON form.
    • short_summary.generic.*.filtered.scaffolds.fa.txt: Contains a plain text summary of the results in BUSCO notation.
    • short_summary.specific.*.filtered.scaffolds.fa.json: Contains a summary of the results in JSON form.
    • short_summary.specific.*.filtered.scaffolds.fa.txt: Contains a plain text summary of the results in BUSCO notation.

For further reading and documentation see the BUSCO Users Guide.

Fastp

Output files
  • fastp_trimd/
    • *.fastp.html: Html output of fastp run on raw reads.
    • *.fastp.json: Same as the html output, just in json format
    • *.singles.fastq.gz: Unpaired reads that passed the QC filters when running fastp on the raw reads.
    • *_1.trim.fastq.gz: Forward reads from paired-end reads that passed the QC filters of fastp.
    • *_2.trim.fastq.gz : Reverse reads from paired-end reads that passed the QC filters of fastp.
    • *_singles.fastp.html: Html output of fastp run on unpaired reads.
    • *_singles.fastp.json: Same as the html output, just in json format.

FastP is a tool designed to provide fast all-in-one preprocessing for FastQ files. This tool is developed in C++ with multithreading supported to afford high performance. For further reading and documentation see the Fastp documentation.

GAMMA

Output files
  • gamma_ar/
    • *_ResGANNCBI_ResGANNCBI_[date]_srst2.gamma: Output of GAMMA that are the best matches from the curated AR gene database.
    • *_ResGANNCBI_ResGANNCBI_[date]_srst2.psl: blat output in psl format
  • gamma_hv/
    • *_HyperVirulence_ResGANNCBI_[date].gamma: Output of GAMMA that are the best matches from the hypervirulence database.
    • *_HyperVirulence_ResGANNCBI_[date].psl: blat output in psl format
  • gamma_pf/
    • *_PF-Replicons_ResGANNCBI_[date].gamma: Output of GAMMA-S that are the best matches from the plasmid finder database without translating them.
    • *_PF-Replicons_ResGANNCBI_[date].psl: blat output in psl format

GAMMA (Gene Allele Mutation Microbial Assessment) is a command line tool that finds gene matches in microbial genomic data using protein coding (rather than nucleotide) identity, and then translates and annotates the match by providing the type (i.e., mutant, truncation, etc.) and a translated description (i.e., Y190S mutant, truncation at residue 110, etc.). Because microbial gene families often have multiple alleles and existing databases are rarely exhaustive, GAMMA is helpful in both identifying and explaining how unique alleles differ from their closest known matches. GAMMA-S (Gene Allele Mutation Microbial Assessment-Sequence) finds best matches from a gene database without translating them--so it will find the best match by nucleotides, rather by the translated protein sequence. For further reading and documentation see the GAMMA's github.

Kraken2

Output files
  • kraken2_asmbld/
    • krona/
      • *_asmbld.html: Interactive hierarchical chart of kraken2's taxa calls on the assembly that can be viewed with any modern web browser.
      • *_asmbld.krona: Krona file used to make the *_asmbld.html file.
    • *.asmbld_summary.txt: The kraken2 best hit for the scaffolds.
    • *.classified.fasta.gz: The sequences that were able to be classified by kraken2.
    • *.kraken2_asmbld.classifiedreads.txt: Standard Kraken2 output on assembly scaffolds.
    • *.kraken2_asmbld.report.txt: Kraken2 report for assembly scaffolds.
    • *.mpa: Converted Kraken report style output to a mpa (MetaPhlAn)-style TEXT file. Used downstream to collect final stats.
    • *.unclassified.fasta.gz: The sequences that were unable to be classified by kraken2.
  • kraken2_asmbld_weighted/
    • krona/
      • *_wtasmbld.html: Interactive hierarchical chart of kraken2's taxa calls on the weighted assembly that can be viewed with any modern web browser.
      • *_wtasmbld.krona: Krona file used to make the *_wtasmbld.html file.
    • *.kraken2_wtasmbld.report.txt: Kraken2 report for weighted assembly.
    • *.wtasmbld_summary.txt: The kraken2 best hit for the weighted assembly.
  • kraken2_trimd/
    • krona/
      • *_trimd.html: Interactive hierarchical chart of kraken2's taxa calls on the trimmed reads that can be viewed with any modern web browser.
      • *_trimd.krona: Krona file used to make the *_trimd.html file.
    • *.classified_1.fasta.gz: The forward reads that were able to be classified by kraken2.
    • *.classified_2.fasta.gz: The reverse reads that were able to be classified by kraken2.
    • *.kraken2_trimd.classifiedreads.txt: Standard Kraken2 output on trimmed reads.
    • *.kraken2_trimd.report.txt: Kraken2 report for trimmed reads.
    • *.mpa: Converted Kraken report style output to a mpa (MetaPhlAn)-style TEXT file. Used downstream to collect final stats.
    • *.trimd_summary.txt: The kraken2 best hit for the trimmed reads.
    • *.unclassified_1.fasta.gz: The forward reads that were unable to be classified by kraken2.
    • *.unclassified_2.fasta.gz: The reverse reads that were unable to be classified by kraken2.

Kraken2 is a taxonomic classification system using exact k-mer matches to achieve high accuracy and fast classification speeds. This classifier matches each k-mer within a query sequence to the lowest common ancestor (LCA) of all genomes containing the given k-mer. For further reading and documentation see the Kraken2's github. Krona allows hierarchical data to be explored with zooming, multi-layered pie charts. The resulting interactive charts are self-contained and can be viewed with any modern web browser. For further reading and documentation see the Krona's github.

MLST

Output files
  • mlst/
    • All files will contain all schemes relevant to the identified taxonomy (e.g., Acinetobacter baumannii and Escherichia coli will have 2 schemes each)
    • *.tsv: Output of MLST that contains the filename, matching PubMLST scheme name, ST (sequence type), and allele IDs.
      • This output has the following allele markers:
        • '~' : full length novel allele
        • '?' : partial match (>min_cov & > min_ID). Default min_cov = 10, Default min_ID=95%
        • '-' : Allele is missing

Example output of a novel allele:

source_file  Database  ST  locus_1 locus_2 locus_3 locus_4 locus_5 locus_6 locus_7 locus_8 lous_9  locus_10
sample_A.filtered.scaffolds.fa   koxytoca        -       gapA(16)        infB(~28)       mdh(63) pgi(~37)        phoE(~7)        rpoB(20)        tonB(40?)

Example output of a partial allele match:

source_file  Database  ST  locus_1 locus_2 locus_3 locus_4 locus_5 locus_6 locus_7 locus_8 lous_9  locus_10
sample_B.filtered.scaffolds.fa   klebsiella      -     gapA(3) infB(3?) mdh(1)  pgi(1)  phoE(1) rpoB(1) tonB(79)

Example output of missing allele:

source_file  Database  ST  locus_1 locus_2 locus_3 locus_4 locus_5 locus_6 locus_7 locus_8 lous_9  locus_10
sample_C.filtered.scaffolds.fa   klebsiella      -     gapA(3) infB(3) mdh(-)  pgi(1)  phoE(1) rpoB(1) tonB(79)
  • *_srst2.mlst: Output of srst2 MLST that contains Sample, database, ST, mismatches, uncertainty, depth, maxMAF as well as all loci for the sample/database.
    • This output has the following allele markers:
      • '*' : Full length match with 1+ SNP (Novel)
      • '?' : edge depth is below N or average depth is below X (Default edge_depth = 2, Default average_depth = 5)
      • '-' : No allele assigned, usually because no alleles achieved >90% coverage

Example of novel allele:

Sample  database        ST      mismatches      uncertainty     depth   maxMAF  locus_1 locus_2 locus_3 locus_4 locus_5 locus_6 locus_7 locus_8 locus_9 locus_10
sample_D Klebsiella_pneumoniae   NF*     phoE_594/1snp   -       28.0804285714   0.25    gapA(3) infB(3) mdh(88) pgi(1)  phoE(594*)      rpoB(1) tonB(79)

Example output of low edge depth allele:

Sample  database        ST      mismatches      uncertainty     depth   maxMAF  locus_1 locus_2 locus_3 locus_4 locus_5 locus_6 locus_7 locus_8 locus_9 locus_10
sample_E Klebsiella_pneumoniae   NF*?    gapA_178/31holes;mdh_88/19holes;pgi_1/1snp;phoE_594/2snp7holes;tonB_79/24holes  gapA_178/edge0.0;infB_3/edge1.0;mdh_88/edge0.0;pgi_1/edge1.0;phoE_594/edge1.0;tonB_79/edge0.0   4.65714285714   0.5 gapA(178*?)      infB(3?)        mdh(88*?)       pgi(1*?)        phoE(594*?)     rpoB(1) tonB(79*?)

Example output of missing allele:

Sample  database        ST      mismatches      uncertainty     depth   maxMAF  locus_1 locus_2 locus_3 locus_4 locus_5 locus_6 locus_7 locus_8 locus_9 locus_10
sample_F Escherichia_coli#1      10      0       -       32.0842857143   0.0666666666667 adk(10) fumC(11)        gyrB(4) icd(8)  mdh(8)  purA(8) recA(2)
sample_F Escherichia_coli#2      NF       0       -       24.11425        0.1     dinB(8) icdA(-) pabB(7) polB(3) putP(7) trpA(1) trpB(4) uidA(2)
  • *_combined.tsv: Combines output of MLST and srst2 MLST results, if available, and also simplifies reasoning if a type is not able to be assigned. What the above isolates look like in _combined.tsv form.

Examples of what above entries look like when passed through clean up script:

Sample  Source  Pulled on       Database        ST      locus_1 locus_2 locus_3 locus_4 locus_5 locus_6 locus_7 locus_8 locus_9 locus_10
sample_A standard/srst2  2023-01-17      koxytoca     Novel_allele      gapA(16) infB(~28) mdh(63) pgi(~37) phoE(~7) rpoB(20) tonB(40?)
Sample  Source  Pulled on       Database        ST      locus_1 locus_2 locus_3 locus_4 locus_5 locus_6 locus_7 locus_8 locus_9 locus_10
sample_B standard/srst2  2023-01-17      klebsiella     Novel_allele     gapA(3) infB(3?) mdh(1)  pgi(1)  phoE(1) rpoB(1) tonB(79)
Sample  Source  Pulled on       Database        ST      locus_1 locus_2 locus_3 locus_4 locus_5 locus_6 locus_7 locus_8 locus_9 locus_10
sample_C standard/srst2  2023-01-17      klebsiella     Novel_allele     gapA(3) infB(3) mdh(-)  pgi(1)  phoE(1) rpoB(1) tonB(79)
Sample  Source  Pulled on       Database        ST      locus_1 locus_2 locus_3 locus_4 locus_5 locus_6 locus_7 locus_8 locus_9 locus_10
sample_D srst2   2023-01-17      klebsiella      Novel_allele    gapA(3) infB(3) mdh(88) pgi(1)  phoE(594*)      rpoB(1) tonB(79)
sample_D standard        2023-01-17      klebsiella      258     gapA(3) infB(3) mdh(1)  pgi(1)  phoE(1) rpoB(1) tonB(79)
Sample  Source  Pulled on       Database        ST      locus_1 locus_2 locus_3 locus_4 locus_5 locus_6 locus_7 locus_8 locus_9 locus_10
sample_E srst2   2023-01-17      klebsiella      Novel_allele    gapA(178*?)     infB(3?)        mdh(88*?)       pgi(1*?)        phoE(594*?)     rpoB(1) tonB(79*?)
sample_E standard        2023-01-17      klebsiella      258     gapA(3) infB(3) mdh(1)  pgi(1)  phoE(1) rpoB(1) tonB(79)
Sample  Source  Pulled on       Database        ST      locus_1 locus_2 locus_3 locus_4 locus_5 locus_6 locus_7 locus_8 locus_9 locus_10
sample_F standard/srst2  2023-01-17      ecoli_2(Pasteur)        Novel_allele       dinB(8) icdA(-) pabB(7) polB(3) putP(7) trpA(1) trpB(4) uidA(2)
sample_F standard/srst2  2023-01-17      ecoli(Achtman)  10      adk(10) fumC(11)        gyrB(4) icd(8)  mdh(8)  purA(8) recA(2)
- If both assembly (MLST) and read mlst (SRST2 MLST) are run and they don't agree, but they are still hitting to the same database, they will be placed on separate lines. If they do agree the source column (#2) will indicate it is a match.

MLST scans assembly files against traditional PubMLST typing schemes. srst2_MLST scans read files against traditional PubMLST typing schemes.

QC_Stats

Output files
  • qc_stats/
    • *_fastqc.html: FastQC report containing quality metrics.
    • *_fastqc.zip: Zip archive containing the FastQC report, tab-delimited data file and plot images.
    • *.bbduk.log: log file for the bbduk run.
    • *_trimmed_read_counts.txt: Parsed *.fastp.json on trimmed reads and single reads with custom stat calculations.

FastQC gives general quality metrics about your sequenced reads. It provides information about the quality score distribution across your reads, per base sequence content (%A/T/G/C), adapter contamination and overrepresented sequences. For further reading and documentation see the FastQC help pages.

BBDUK was developed to combine most common data-quality-related trimming, filtering, and masking operations into a single high-performance tool. For further reading and documentation see BBDUK's Manual.

QUAST

Output files
  • quast/
    • *_report.tsv: tab-separated version of the summary, suitable for spreadsheets.

QUAST QUAST evaluates genome assemblies. For further reading and documentation see QUAST's Manual.

Raw Stats

Output files
  • raw_stats/
    • *_raw_read_counts.txt: Parsed *.fastp.json on raw reads and custom stat calculations.
    • *_FAIry_synopsis.txt: FAIry stands for "FASTQ file Assessment of Integrity". This report will describe if the raw fastq files are corrupted and if there is a difference in the number of reads in R1 and R2.

srst2 - only run with -entry CDC_PHOENIX

Output files
  • srst2/
    • *._fullgenes__[date]__results.txt: STs will be printed in tab-delim format to a file called [outputprefix]mlst[db]__results.txt, output is explained further here.

SRST2 Short Read Sequence Typing for Bacterial Pathogens. For further reading and documentation see SRST2's GitHub.

Sample Specific Files

Output files
  • *.synopsis: This file contains a summary of stats for the sample and will provide warnings and alerts for metrics that don't meet quality standards. This is an example output:
---------- Checking SRR12352153 for successful completion on Fri Sep 16 15:27:15 EDT 2022 ----------
Summarized                    : SUCCESS  : Fri Sep 16 15:27:15 EDT 2022
FASTQs                        : SUCCESS  : R1: 165249420bps R2: 165400291bps
RAW_READ_COUNTS               : SUCCESS  : 1445464 individual reads found in sample (722732 paired reads)
RAW_Q30_R1%                   : WARNING  : Q30_R1% at 88% (Threshold is 90%)
RAW_Q30_R2%                   : SUCCESS  : Q30_R2% at 71% (Threshold is 70%)
TRIMMED_BPS                   : SUCCESS  : R1: 103339444bps R2: 84715304bps Unpaired: 28753074bps
TRIMMED_READ_COUNTS           : SUCCESS  : 1147210 individual reads found in sample (494219 paired reads, 158772 singled reads)
TRIMMED_Q30_R1%               : SUCCESS  : Q30_R1% at 98% (Threshold is 90%)
TRIMMED_Q30_R2%               : SUCCESS  : Q30_R2% at 96% (Threshold is 70%)
KRAKEN2_CLASSIFY_READS        : SUCCESS  : 32.86% Klebsiella pneumoniae with 1.63% unclassified reads
KRAKEN2_READS_CONTAM          : SUCCESS  : Only one genus has been found above the 25% threshold
ASSEMBLY                      : SUCCESS  : 195 scaffolds found
SCAFFOLD_TRIM                 : SUCCESS  : 123 scaffolds remain. 72 were removed due to shortness
KRAKEN2_CLASSIFY_WEIGHTED     : SUCCESS  : Klebsiella(97.59%) pneumoniae(97.28%) with 0.00% unclassified scaffolds
KRAKEN2_WEIGHTED_CONTAM       : SUCCESS  : Only one genus has been found above the 25% threshold
QUAST                         : SUCCESS  : #-123 length-5608577 n50-177421 %GC-57.19
QUAST_GC_Content              : SUCCESS  : %GC-57.19 is within 56.09796-58.11529 (2.58*0.39096stdevs) away from the mean of 57.10662.
TAXA-ANI_REFSEQ               : SUCCESS  : Klebsiella pneumoniae
ASSEMBLY_RATIO(SD)            : SUCCESS  : 1.0004x(.0074-SD) against K.pneumoniae
COVERAGE                      : ALERT    : 38.65x coverage based on trimmed reads (Target:40x)
FASTANI_REFSEQ                : SUCCESS  : 99.85%ID-94.76%COV-Klebsiella pneumoniae(Klebsiella_pneumoniae_GCF_001855315.1_ASM185531v1_genomic.fna.gz)
MLST-KLEBSIELLA               : SUCCESS  : ST147
GAMMA_AR                      : SUCCESS  : 26 AR gene(s) found from ResGANNCBI_20210507
AMRFINDER                     : SUCCESS  : 5 point mutation(s) found
PLASMID_REPLICONS             : SUCCESS  : 10 replicon(s) found from SRR12352153_PF-Replicons
HYPERVIRULENCE                : SUCCESS  : No hypervirulence genes were found from SRR12352153_HyperVirulence
Auto Pass/FAIL                : PASS     : Minimum Requirements met for coverage(30x)/ratio_stdev(<2.58)/min_length(>1000000) to pass auto QC filtering
---------- SRR12352153 completed as WARNING ----------
WARNINGS: out of line with what is expected and MAY cause problems downstream.
ALERT: something to note, does not mean it is a poor-quality assembly.
  • *.tax: This file contains the best taxa id. The number after the ":" is the NCBI assigned taxID for easy lookup. This is an example output:
ANI_REFSEQ	99.86	2002178_REFSEQ_20230504.fastANI.txt
K:2	Bacteria
P:1224	Pseudomonadota
C:1236	Gammaproteobacteria
O:91347	Enterobacterales
F:543	Enterobacteriaceae
G:561	Escherichia
s:562	coli
  • *_Assembly_ratio_[date].txt: This file contains information on the assembly ratio and standard dev for the sample. This is an example output:
Tax: Klebsiella pneumoniae
NCBI_TAXID: 573
Species_St.Dev: 264827
Isolate_St.Devs: .0074
Actual_length: 5608577
Expected_length: 5606613
Ratio: 1.0004
  • *_GC_content_[date].txt: This file contains information on the assembly ratio and standard dev for the sample. This is an example output:
Tax: Enterobacter sp.MGH-14
NCBI_TAXID: 1329823
Species_GC_StDev: Not calculated on species with n<10 references
Species_GC_Min: 54.7183
Species_GC_Max: 54.8
Species_GC_Mean: 54.75915
Species_GC_Count: 2
Sample_GC_Percent: 54.66
  • *_summaryline.tsv: This is a one line summary that contains the columns:
    • ID - The name of the sample ID, which is determined from the samplesheet.
    • Auto_QC_Outcome - Either PASS or FAIL of the Auto PASS/FAIL
    • Warning_Count - The number of warnings for the sample. Warnings can be viewed in the *.synopsis file.
    • Estimated_Coverage - Estimated coverage as determined by (total trimmed bases / assembly length)
    • Genome_Length - Length of the assembled genome in base pairs.
    • Assembly_Ratio_(STDev) - The calculated assembly ratio (assembly size / median genome size of species) with the samples standard deviation. Standard deviation is only calculated when there are >=10 reference genomes for that taxa.
    • #of_Scaffolds>500bp - The number of scaffolds in the genome that are >500bp, those <500bp were filtered out of downstream analysis.
    • GC_% - % of G/C in the assembled genome.
    • Species - The Taxa determined by either FastANI or Kraken2.
    • Taxa_Confidence - Depending on the method used to determine taxa (FastANI, Kraken2_Weighted, or Kraken2_Trimd)
    • Taxa_Source - This column will say which method was used to determine taxonomy. PHoeNIx will assign taxonomy based on the best match from FastANI that compares genomes from RefSeq. If FastANI fails PHoeNIx will fall back on the taxonomic assignment from Kraken2_Weighted and if no assembly was created then it will use Kraken2_Trimd.
    • Kraken2_Trimd - Taxa determined by running kraken2 on the cleaned reads. The percent of reads per genus/species is presented in parenthesis in next to the respective taxa level.
    • Kraken2_Weighted - Taxonomic assignment based on the assembly (scaffolds) and the % is generated by weighting the scaffolds by their length. The percent per genus/species is presented in parenthesis in next to the respective taxa level.
    • MLST_Scheme_1 - Primary MLST scheme used.
    • MLST_1 - Primary MLST alleles.
    • MLST_Scheme_2 - If there is a secondary scheme it will be listed here.
    • MLST_2 - If there was a secondary scheme then those MLST alleles are listed here.
    • GAMMA_Beta_Lactam_Resistance_Genes - GAMMA hits against our custom database that combines AMRFinderPlus, ARG-ANNOT, and ResFinder filtered to only report the beta lactam genes.
    • GAMMA_Other_AR_Genes - Same as above only non-beta lactam genes.
    • AMRFinder_Point_Mutations - Point mutations as determined by AMRFinderPlus.
    • Hypervirulence_Genes - GAMMA hits against the database of hypervirulence genes from Russo et al.
    • Plasmid_Incompatibility_Replicons - GAMMA hits against the PlasmidFinder database.
    • Auto_QC_Failure_Reason - The reason for the auto failing the sample

This *_summaryline.tsv file will be combined together for the full Phoenix_Output_Report.tsv.

Run Specific Files

Output files
  • Phoenix_Summary.tsv: A file that is a combination of all *_summaryline.tsv files that is a good overview of the entire run.
  • *_GRiPHin_Summary.xlsx: An excel file that summaries the output of PHoeNIx for all samples.
    • ID - The name of the sample ID, which is determined from the samplesheet.
    • QC Metrics
      • Minimum_QC_Check - Either PASS or FAIL of the Auto PASS/FAIL
      • Minimum_QC_Issues - The reason(s) for the auto failing the sample
      • Warnings - Short descriptions of the warnings for the sample. Further details on the warnings can be viewed in the *.synopsis file. See Pipeline Overview for details on the warnings.
      • Alerts - Short descriptions of the alerts for the sample. Further details on the alerts can be viewed in the *.synopsis file. See Pipeline Overview for details on the alerts.
      • Raw_Q30_R1[%] - The percentage of raw R1 bp that are at >=Q30.
      • Raw_Q30_R2[%] - The percentage of raw R2 bp that are at >=Q30.
      • Total_Raw_[reads] - The total raw read count.
      • Paired_Trimmed_[reads] - The total paired reads (R1 and R2) that passed the filter and QC process.
      • Total_Trimmed_[reads] - The total paired reads (R1 and R2) and singletons that passed the filter and QC process. We use paired and singleton reads for assembly.
      • Estimated_Trimmed_Coverage - Estimated coverage as determined by (total trimmed bases / assembly length).
      • GC[%] - G/C % in the assembled genome.
      • Scaffolds - The number of scaffolds in the genome that are >500bp, those <500bp were filtered out of downstream analysis.
      • Assembly_Length - Length of the assembled genome in base pairs.
      • Assembly_Ratio - The calculated assembly ratio (assembly size / median genome size of species) with the sample's standard deviation. Standard deviation is only calculated when there are >=10 reference genomes for that taxa.
      • Assembly_StDev - The standard deviation for species Only standard deviation is only calculated when there are >=10 reference genomes for that taxa.
    • Taxonomic Information
      • Taxa_Source - This column will say which method was used to determine taxonomy. PHoeNIx will assign taxonomy based on the best match from FastANI that compares genomes from RefSeq. If FastANI fails PHoeNIx will fall back on the taxonomic assignment from Kraken2_Weighted and if no assembly was created then it will use Kraken2_Trimd.
      • BUSCO_Lineage - The number of scaffolds in the genome that are >500bp, those <500bp were filtered out of downstream analysis.
      • BUSCO_%Match - % of G/C in the assembled genome.
      • Kraken_ID_Raw_Reads_% - Taxa determined by running kraken2 on the cleaned reads. The percent of reads per genus/species is presented in parenthesis in next to the respective taxa level.
      • Kraken_ID_WtAssembly_% - Taxonomic assignment based on the assembly (scaffolds) and the % is generated by weighting the scaffolds by their length. The percent per genus/species is presented in parenthesis in next to the respective taxa level.
      • FastANI_Organism - The taxa (genus and species) IDed by FastANI.
      • FastANI_%ID - Calculated % average nucleotide identity (ANI). ANI is the mean nucleotide identity of orthologous gene pairs shared between two microbial genomes.
      • FastANI_%Coverage - % of scaffolds in the query genome (your assembled genome) that aligned successfully to the reference sequence (the closest hit in RefSeq to your assembled genome as determined by mash).
      • Species_Support_ANI - The genome in RefSeq that is the closest match to your assembled genome.
    • MLST Schemes
      • Primary_MLST_Scheme - Primary MLST scheme used.
      • Primary_MLST_Source - The source of the primary MLST determination (assembly: MLST or reads: srst2).
      • Primary_MLST - The sequence type based on primary MLST scheme.
      • Primary_MLST_Alleles - Primary MLST alleles.
      • Secondary_MLST_Scheme - If there is a secondary scheme it will be listed here.
      • Secondary_MLST_Source - The source of the secondary MLST determination (assembly: MLST or reads: srst2).
      • Secondary_MLST - If there was a secondary MLST scheme the sequence type.
      • Secondary_MLST_Alleles - Secondary MLST alleles.
    • Antibiotic Resistance Genes
      • AR_Database - The database used for identifying AR genes. Genes IDed by Gamma are filtered to require a minimum of 98% amino acid identity and a 90% length to be included in report. Similarly for SRST2, genes are filtered with a threshold of >=98% nucleotide identity and >=90% of the length to be included in the report.
      • The next series of columns are the details of each AR gene identified in the isolates. As an example column name, blaPAO_1_AY083595_(beta-lactam) translates to the name of the gene "blaPAO", accession "AY083595" and the drug class it confers resistance for "beta-lactam". If there is a gene identified via GAMMA that passes the filter it is reported as [%Nuc_Identity/%AA_Identity/%Coverage: Contig Number the gene is found on]. Similarly, SRST2 identified genes are reported as [%Nuc_Identity/%Coverage].
    • Hypervirulence_Genes
      • HV_Database - The database used for identifying Hypervirulence genes.
      • The next series of columns are the details of each Hypervirulence genes. These will have the same structured output to the AR genes (i.e. [%Nuc_Identity/%AA_Identity/%Coverage: Contig Number the gene is found on]). There is no filter for the quality of these hits!
    • Plasmid_Incompatibility_Replicons
      • Plasmid_Replicon_Database - The database used for identifying plasmid incompatibility replicons.
      • The next series of columns are the details of each plasmid incompatibility replicons identified. These will have the same structured output to the AR and hypervirulence genes (i.e. [%Nuc_Identity/%Coverage: Contig Number the gene is found on]).
  • *_GRiPHin_Summary.tsv: An tsv file that summaries the output of PHoeNIx for all samples. Just a tsv version of *_GRiPHin_Summary.xlsx, but with less flare (aka highlighting) for easier parsing.
  • Biosample_Attribute_Microbe.1.0.xlsx: To facilitate data sharing, PHoeNIx generates the required template for Biosample creation when run with --create_ncbi_sheet. Metadata fields in this template are optimized for CDC partners planning to upload raw Illumina sequencing reads to the CDC's HAISeq Umbrella BioProject on NCBI, including designating the appropriate sub-BioProject within HAISeq. Phoenix also fills out as much metadata as possible from the pipeline results. A small number of required variables (Host, Isolate Source, Collection Date) need to be filled out by the submitter before completing NCBI SRA submissions. Additional information on uploading data to the HAISeq Umbrella BioProject can be found in the General Guidance for WGS of HAI AR Pathogens v2.
  • SRA_Microbe.1.0.xlsx: To facilitate data sharing, PhoeNIx generates the required template for describing experimental data (associated with the sequencing process that generated the raw fastq files) when run with --create_ncbi_sheet. Metadata fields in this template are optimized for CDC partners planning to upload raw Illumina sequencing reads to the CDC's HAISeq Umbrella BioProject on NCBI. A small number of required variables (Instrument Model, Design Description) need to be filled out by the submitter before completing NCBI SRA submissions. Additional information on uploading data to the HAISeq Umbrella BioProject can be found in the General Guidance for WGS of HAI AR Pathogens v2.

MultiQC

Output files
  • multiqc/
    • multiqc_report.html: a standalone HTML file that can be viewed in your web browser.
    • multiqc_data/: directory containing parsed statistics from the different tools used in the pipeline.
    • multiqc_plots/: directory containing static images from the report in various formats.

MultiQC is a visualization tool that generates a single HTML report summarizing all samples in your project. Most of the pipeline QC results are visualized in the report and further statistics are available in the report data directory.

Results generated by MultiQC collate pipeline QC from supported tools e.g. FastQC. The pipeline has special steps which also allow the software versions to be reported in the MultiQC output for future traceability. For more information about how to use MultiQC reports, see http://multiqc.info.

Pipeline information

Output files
  • pipeline_info/
    • Reports generated by Nextflow: execution_report.html, execution_timeline.html, execution_trace.txt and pipeline_dag.dot/pipeline_dag.svg.
    • Reports generated by the pipeline: pipeline_report.html, pipeline_report.txt and software_versions.yml. The pipeline_report* files will only be present if the --email / --email_on_fail parameter's are used when running the pipeline.
    • Reformatted samplesheet files used as input to the pipeline: samplesheet.valid.csv.

Nextflow provides excellent functionality for generating various reports relevant to the running and execution of the pipeline. This will allow you to troubleshoot errors with the running of the pipeline, and also provide you with other information such as launch commands, run times and resource usage.

Clone this wiki locally