The Pathogen Annotation and Submission pipeline facilitates the running of several Python scripts, which validate metadata (QC), annotate assembled genomes, and submit to NCBI. Current implementation was tested using MPOX but future testing will seek to made the pipeline pathogen-agnostic.
- Overview
- Table of Contents
- Pipeline Summary
- Setup
- Quickstart
- Running the Pipeline
- Profile Options & Input Files
- Outputs
- Parameters
- Helpful Links
- Acknowledgements
The validation workflow checks if metadata conforms to NCBI standards and matches the input fasta file. The script also splits a multi-sample xlsx file into a separate .tsv file for each individual.
The liftoff workflow annotates input fasta-formatted genomes and produces accompanying gff and genbank tbl files. The input includes the reference genome fasta, reference gff and your multi-sample fasta and metadata in .xlsx format. The Liftoff workflow was brought over and integrated from the Liftoff tool, responsible for accurately mapping annotations for assembled genomes.
Submission workflow generates the necessary files for Genbank submission, generates a BioSample ID, then optionally uploads Fastq files via FTP to SRA. This workflow was adapted from SeqSender public database submission pipeline.
The environment setup needs to occur within a terminal, or can optionally be handled by the Nextflow pipeline according to the conda block of the nextflow.config file.
- Note: With mamba and nextflow installed, when you run nextflow it will create the environment from the provided environment.yml.
- If you want to create a personalized environment you can create this environment as long as the environment name lines up with the environment name provided in the environment.yml file.
curl -L -O https://github.com/conda-forge/miniforge/releases/latest/download/Mambaforge-$(uname)-$(uname -m).sh
bash Mambaforge-$(uname)-$(uname -m).sh -b -p $HOME/mambaforge
export PATH="$HOME/mambaforge/bin:$PATH"
(3) Now you can create the conda environment and install the dependencies set in your environment.yml:
mamba create -n tostadas -f environment.yml
(4) After the environment is created activate the environment. Always make sure to activate the environment with each new session.
source activate tostadas
(5) To examine which environment is active, run the following conda command: conda env list
, then the active environment will be denoted with an asterisk*
- First make sure your path is set correctly and you are active in your tostadas environment. Then run the following command to install nextflow with Conda:
mamba install -c bioconda nextflow
Access the link provided for help with installing nextflow
To clone the code from the repo to your local machine:
git clone https://github.com/CDCgov/tostadas.git
If the following applies to you:
- CDC user with access to the Monkeypox group on Gitlab (https://git.biotech.cdc.gov/monkeypox)
- Require access to available submission config files
Then, follow the cloning instructions outlined here: cdc_configs_access
The configs are set-up to run the default params with the test option
* Version of nextflow should be >=22.10.0
This is the default directory set in the nextflow.config file to allow for running the nextflow pipeline with the provided test input files.
(3) Change the submission_config
parameter within test_params.config
to the location of your personal submission config file.
(4) Run the following nextflow command to execute the scripts with default parameters and with local run environment:
nextflow run main.nf -profile test,conda
The outputs of the pipeline will appear in the "nf_test_results" folder within the project directory (update this in the standard params set for a different output path).
The typical command to run the pipeline based on your custom parameters defined/saved in the standard_params.config (more information about profiles and parameter sets below) and created conda environment is as follows:
nextflow run main.nf -profile standard,conda
OR with the parameters specified in the .json/.yaml files with the following command:
nextflow run main.nf -profile standard,conda --<param name> <param value>
Other options for the run environment include docker
and singularity
. These options can be used simply by replacing the second profile option:
nextflow run main.nf -profile standard,<docker or singularity>
Either one of the above commands will launch the nextflow pipeline and show the progress of the subworkflow:process and checks looking similar to below depending on the entrypoint specified.
N E X T F L O W ~ version 22.10.0
Launching `main.nf` [festering_spence] DSL2 - revision: 3441f714f2
executor > local (7)
[e5/9dbcbc] process > VALIDATE_PARAMS [100%] 1 of 1 Γ’Εβ
[53/a833be] process > CLEANUP_FILES [100%] 1 of 1 Γ’Εβ
[e4/a50c97] process > with_submission:METADATA_VALIDATION (1) [100%] 1 of 1 Γ’Εβ
[81/badd3b] process > with_submission:LIFTOFF (1) [100%] 1 of 1 Γ’Εβ
[d7/16d16a] process > with_submission:RUN_SUBMISSION:SUBMISSION (1) [100%] 1 of 1 Γ’Εβ
[3c/8c7ba4] process > with_submission:RUN_SUBMISSION:GET_WAIT_TIME (1) [100%] 1 of 1 Γ’Εβ
[13/85f6f3] process > with_submission:RUN_SUBMISSION:WAIT (1) [ 0%] 0 of 1
[- ] process > with_submission:RUN_SUBMISSION:UPDATE_SUBMISSION -
USING CONDA
** NOTE: The default wait time between initial submission and updating the submitted samples is three minutes or 180 seconds per sample. To override this default calculation, you can modify the submission_wait_time parameter within your config or through the command line (in terms of seconds):
nextflow run main.nf -profile <param set>,<env> --submission_wait_time 360
Outputs will be generated in the nf_test_results folder (if running the test parameter set) unless otherwise specified in your standard_params.config file as output_dir param.
This section walks through the available parameters to customize your workflow.
Input files | File type | Description |
---|---|---|
fasta | .fasta | Multi-sample fasta file with your input sequences |
metadata | .xlsx | Multi-sample metadata matching metadata spreadsheets provided in input_files |
ref_fasta | .fasta | Reference genome to use for the liftoff_submission branch of the pipeline |
ref_gff | .gff | Reference GFF3 file to use for the liftoff_submission branch of the pipeline |
Input files | File type | Description |
---|---|---|
fasta | .fasta | Multi-sample fasta file with your input sequences |
metadata | .xlsx | Multi-sample metadata matching metadata spreadsheets provided in input_files |
ref_fasta | .fasta | Reference genome to use for the liftoff_submission branch of the pipeline |
ref_gff | .gff | Reference GFF3 file to use for the liftoff_submission branch of the pipeline |
submission_config | .yaml | configuration file for submitting to NCBI, sample versions can be found in repo |
The standard_params.config file found within the conf directory is where parameters can be adjusted based on preference for running the pipeline. First you will want to ensure the file paths are correctly set for the params listed above depending on your preference for submitting your results.
- Adjust your file inputs within standard_params.config ensuring accurate file paths for the inputs listed above.
- The params can be changed within the standard_params.config or you can change the standard.yml/standard.json file inside the nf_params directory and pass it in with:
-params-file <standard_params.yml or standard_params.json>
- Note: DO NOT EDIT the main.nf file or other paths in the nextflow.config unless familiar with editing nextflow workflows
Within the nextflow pipeline the -profile
option is required as an input. The profile options with the pipeline include test and standard. These two options can be seen listed in the nextflow.config file. The test params should remain the same for testing purposes, but the standard profile can be changed to fit user preferences. Also within the nextflow pipeline there is the use of varying run environments as the second profile input. Nextflow expects at least one option for both of these configurations to be passed in: -profile <test/standard>,<conda/docker/singularity>
Now that your file paths are set within your standard.yml or standard.json or standard_params.config file, you will want to define whether to run the full pipeline with submission or without submission. This is defined within the standard_params.config file underneath the subworkflow section as run_submission run_submission = true/false
- Apart from this main bifurcation, there exists entrypoints that you can use to access specific processes. More information is listed in the table below.
The submission piece of the pipeline uses the processes that are directly integrated from SeqSender public database submission pipeline. It has been developed to allow the user to create a config file to select which databases they would like to upload to and allows for any possible metadata fields by using a YAML to pair the database's metadata fields which your personal metadata field columns. The requirements for this portion of the pipeline to run are listed below.
(A) Create Appropriate Accounts as needed for the SeqSender public database submission pipeline integrated into TOSTADAS:
- NCBI: If uploading to NCBI, an account is required along with a center account approved for submitting via FTP. Contact the following for account creation:gb-admin@ncbi.nlm.nih.gov.
- GISAID: A GISAID account is required for submission to GISAID, you can register for an account at https://www.gisaid.org/. Test submissions are first required before a final submission can be made. When your first test submission is complete contact GISAID at hcov-19@gisaid.org to recieve a personal CID. GISAID support is not yet implemented but it may be added in the future.
(B) Config File Set-up:
- The template for the submission config file can be found in bin/default_config_files within the repo. This is where you can edit the various parameters you want to include in your submission.
Table of entrypoints available for the nextflow pipeline:
Workflow | Description |
---|---|
only_validate_params | Validates parameters utilizing the validate params process within the utility sub-workflow |
only_cleanup_files | Cleans-up files utilizing the clean-up process within the utility sub-workflow |
only_validation | Runs the metadata validation process only |
only_liftoff | Runs the liftoff annotation process only |
only_submission | Runs submission sub-workflow only |
only_initial_submission | Runs the initial submission process but not follow-up within the submission sub-workflow |
only_update_submission | Updates NCBI submissions |
- Documentation for using entrypoints with NF can be found at Nextflow_Entrypoints under section 5.
The following command can be used to specify entrypoints for the workflow:
nextflow run main.nf -profile <param set>,<env> -entry <insert option from table above>
The following section walks through the outputs from the pipeline.
The workflow will generate outputs in the following order:
- Validation
- Responsible for QC of metadata
- Aligns sample metadata .xlsx to sample .fasta
- Formats metadata into .tsv format
- Annotation
- Extracts features from .gff
- Aligns features
- Annotates sample genomes outputting .gff
- Submission
- Formats for database submission
- This section runs twice, with the second run occuring after a wait time to allow for all samples to be uploaded to NCBI. Entrypoint
only_update_submission
can be run as many times as necessary until all files are fully uploaded.
The outputs are recorded in the directory specified within the nextflow.config file and will contain the following:
- validation_outputs (**name configurable with val_output_dir)
- sample_metadata_run
- errors
- tsv_per_sample
- sample_metadata_run
- liftoff_outputs (**name configurable with final_liftoff_output_dir)
- final_sample_metadata_file
- errors
- fasta
- liftoff
- tbl
- final_sample_metadata_file
- submission_outputs (**name and path configurable with submission_output_dir)
- individual_sample_batch_info
- biosample_sra
- genbank
- accessions.csv
- terminal_outputs
- commands_used
- individual_sample_batch_info
- liftoffCommand.txt
The pipeline outputs inlcude:
- metadata.tsv files for each sample
- separate fasta files for each sample
- separate gff files for each sample
- separate tbl files containing feature information for each sample
- submission log file
- This output is found in the submission_outputs file in your specified output_directory
- If the file can not be found you can run the only_update_submission entrypoint for the pipeline
Default parameters are given in the nextflow.config file. This table lists the parameters that can be changed to a value, path or true/false. When changing these parameters pay attention to the required inputs and make sure that paths line-up and values are within range. To change a parameter you may change with a flag after the nextflow command or change them within your standard_params.config or standard.yaml file.
- Please note the correct formatting and the default calculation of submission_wait_time at the bottom of the params table.
Param | Description | Input Required |
---|---|---|
--fasta_path | Path to fasta file | Yes (path as string) |
--ref_fasta_path | Reference Sequence file path | Yes (path as string) |
--meta_path | Meta-data file path for samples | Yes (path as string) |
--ref_gff_path | Reference gff file path for annotation | Yes (path as string) |
--env_yml | Path to environment.yml file | Yes (path as string) |
Param | Description | Input Required |
---|---|---|
--scicomp | Flag for whether running on Scicomp or not | Yes (true/false as bool) |
--docker_container | Name of the Docker container | Yes, if running with docker profile (name as string) |
Param | Description | Input Required |
---|---|---|
--run_submission | Toggle for running submission | Yes (true/false as bool) |
--cleanup | Toggle for running cleanup subworkflows | Yes (true/false as bool) |
Param | Description | Input Required |
---|---|---|
--clear_nextflow_log | Clears nextflow work log | Yes (true/false as bool) |
--clear_nextflow_dir | Clears nextflow working directory | Yes (true/false as bool) |
--clear_work_dir | Param to clear work directory created during workflow | Yes (true/false as bool) |
--clear_conda_env | Clears conda environment | Yes (true/false as bool) |
--clear_nf_results | Remove results from nextflow outputs | Yes (true/false as bool) |
Param | Description | Input Required |
---|---|---|
--output_dir | File path to submit outputs from pipeline | Yes (path as string) |
--overwrite_output | Toggle to overwriting output files in directory | Yes (true/false as bool) |
Param | Description | Input Required |
---|---|---|
--val_output_dir | File path for outputs specific to validate sub-workflow | Yes (folder name as string) |
--val_date_format_flag | Flag to change date output | Yes (-s, -o, or -v as string) |
--val_keep_pi | Flag to keep personal identifying info, if provided otherwise it will return an error | Yes (true/false as bool) |
Param | Description | Input Required |
---|---|---|
--final_liftoff_output_dir | File path to liftoff specific sub-workflow outputs | Yes (folder name as string) |
--lift_print_version_exit | Print version and exit the program | Yes (true/false as bool) |
--lift_print_help_exit | Print help and exit the program | Yes (true/false as bool) |
--lift_parallel_processes | # of parallel processes to use for liftoff | Yes (integer) |
--lift_delete_temp_files | Deletes the temporary files after finishing transfer | Yes (true/false as string) |
--lift_child_feature_align_threshold | Only if its child features usually exons/CDS align with sequence identity Γ’β°Β₯S | designate a feature mapped |
--lift_unmapped_feature_file_name | Name of unmapped features file name | Yes (path as string) |
--lift_copy_threshold | Minimum sequence identity in exons/CDS for which a gene is considered a copy; must be greater than -s; default is 1.0 | Yes (float) |
--lift_distance_scaling_factor | Distance scaling factor; by default D =2.0 | Yes (float) |
--lift_flank | Amount of flanking sequence to align as a fraction of gene length | Yes (float between [0.0-1.0]) |
--lift_overlap | Maximum fraction of overlap allowed by 2 features | Yes (float between [0.0-1.0]) |
--lift_mismatch | Mismatch penalty in exons when finding best mapping; by default M=2 | Yes (integer) |
--lift_gap_open | Gap open penalty in exons when finding best mapping; by default GO=2 | Yes (integer) |
--lift_gap_extend | Gap extend penalty in exons when finding best mapping; by default GE=1 | Yes (integer) |
--lift_infer_transcripts | Use if annotation file only includes exon/CDS features and does not include transcripts/mRNA | Yes (True/False as string) |
--lift_copies | Look for extra gene copies in the target genome | Yes (True/False as string) |
--lift_minimap_path | Path to minimap if you did not use conda or pip | Yes (N/A or path as string) |
--lift_feature_database_name | Name of the feature database, if none, then will use ref gff path to construct one | Yes (N/A or name as string) |
Param | Description | Input Required |
---|---|---|
--submission_output_dir | Either name or relative/absolute path for the outputs from submission | Yes (name or path as string) |
--submission_prod_or_test | Whether to submit samples for test or actual production | Yes (prod or test as string) |
--submission_only_meta | Full path directly to the dirs containing validate metadata files | Yes (path as string) |
--submission_only_gff | Full path directly to the directory with reformatted GFFs | Yes (path as string) |
--submission_only_fasta | Full path directly to the directory with split fastas for each sample | Yes (path as string) |
--submission_config | Configuration file for submission to public repos | Yes (path as string) |
--submission_wait_time | Calculated based on sample number (3 * 60 secs * sample_num) | integer (seconds) |
--batch_name | Name of the batch to prefix samples with during submission | Yes (name as string) |
--send_submission_email | Toggle email notification on/off | Yes (true/false as bool) |
--req_col_config | Path to the required_columns.yaml file | Yes (path as string) |
--processed_samples | Path to the directory containing processed samples for update only submission entrypoint (containing <batch_name>.<sample_name> dirs) | Yes (path as string) |
** Important note about send_submission_email
: An email is only triggered if Genbank is being submitted to AND table2asn is the genbank_submission_type. As for the recipient, this must be specified within your submission config file under 'general' as 'notif_email_recipient'
π Anaconda Install: https://docs.anaconda.com/anaconda/install/
π Nextflow Documentation: https://www.nextflow.io/docs/latest/getstarted.html
π SeqSender Documentation: https://github.com/CDCgov/seqsender
π Liftoff Documentation: https://github.com/agshumate/Liftoff
π VADR Documentation: https://github.com/ncbi/vadr.git
π table2asn Documentation: https://github.com/svn2github/NCBI_toolkit/blob/master/src/app/table2asn/table2asn.cpp
Michael Desch | Ethan Hetrick | Nick Johnson | Kristen Knipe | Shatavia Morrison
Yuanyuan Wang | Michael Weigand | Dhwani Batra | Jason Caravas | Ankush Gupta
Kyle O'Connell | Yesh Kulasekarapandian | Cole Tindall | Lynsey Kovar | Hunter Seabolt
Crystal Gigante | Christina Hutson | Brent Jenkins | Yu Li | Ana Litvintseva
Matt Mauldin | Dakota Howard | Ben Rambo-Martin | James Heuser | Justin Lee | Mili Sheth