This Nextflow workflow is a compilation of several subworkflows for different stages of genome annotation. Specifically:
where the overall genome annotation process is:
graph TD
preprocessing[Annotation Preprocessing] --> evidenceAlignment[Evidence alignment]
transcriptAssembly[Transcript Assembly] --> evidenceAlignment
evidenceAlignment --> evidenceMaker[Evidence-based Maker]
denovoRepeatLibrary[De novo Repeat Library] ---> evidenceMaker
transcriptAssembly --> pasa[PASA]
preprocessing --> pasa
pasa --> evidenceMaker
evidenceMaker --> abinitioTraining[Abinitio Training]
abinitioTraining --> abinitioMaker[Abinitio-based Maker]
evidenceMaker --> abinitioMaker
pasa --> functionalAnnotation[Functional Annotation]
abinitioMaker --> functionalAnnotation
functionalAnnotation --> EMBLmyGFF3
The subworkflow is selected using the subworkflow
parameter.
If you use these pipelines in your work, please acknowledge NBIS within your communication according to this example: "Support by NBIS (National Bioinformatics Infrastructure Sweden) is gratefully acknowledged."
These workflows were based on the Bpipe workflows written by Marc Höppner (@marchoeppner) and Jacques Dainat (@Juke34).
Thank you to everyone who contributes to this project.
- Mahesh Binzer-Panchal (@mahesh-panchal)
- Expertise: Nextflow workflow development
- Jacques Dainat (@Juke34)
- Expertise: Genome annotation, Nextflow workflow development
- Lucile Soler (@LucileSol)
- Expertise: Genome Annotation
Requirements:
- Nextflow
- A container platform (recommended) such as Singularity or Docker, or the
conda/mamba package manager if a container platform is not available.
If containers or conda/mamba are unavailable, then tool dependencies
must be accessible from your
PATH
.
Install Nextflow directly:
curl -s https://get.nextflow.io | bash
mv ./nextflow ~/bin
Alternatively, installation can be managed with conda (or mamba) in it's own conda environment:
conda create -c conda-forge -c bioconda -n nextflow-env nextflow
conda activate nextflow-env
See Nextflow: Get started - installation for further details.
A workflow is run in the following way:
nextflow run NBISweden/pipelines-nextflow \
[-profile <profile_name1>[,<profile_name2>,...] ] \
[-c workflow.config ] \
[-resume] \
-params-file workflow_parameters.yml
where -profile
selects from a predefined profile (select here for available profiles),
-c workflow.config
loads a custom configuration for altering existing process settings (defined
in nextflow.config
- loaded by default, such as the
number of cpus, time allocation, memory, output prefixes and tool command-line options ). The
-params-file
is a YAML formatted file listing workflow parameters, e.g.
subworkflow: 'annotation_preprocessing'
genome: '/path/to/genome'
busco_lineage:
- 'eukaryota_odb10'
- 'bacteria_odb10'
outdir: '/path/to/save/results'
Note If running on a compute cluster infrastructure,
nextflow
must be able to communicate with the workload manager at all times, otherwise tasks will be cancelled. The best way to do this is to runnextflow
using ascreen
ortmux
terminal.E.g. Screen
# Open a named screen terminal session screen -S my_nextflow_run # load nextflow with conda conda activate nextflow-env # run nextflow nextflow run -c <config> -profile <profile> <nextflow_script> # "Detach" screen terminal <ctrl + a> <ctrl + d> # list screen sessions screen -ls # "Attach" screen session screen -r my_nextflow_run
- uppmax: A profile for the Uppmax clusters. Tasks are submitted to the SLURM workload manager,
executed within Singularity (unless otherwise noted), and use the
$SNIC_TMP
scratch space. Note: The workflow parameterproject
is manadatory when using Uppmax clusters. - conda: A general purpose profile that uses conda to manage software dependencies.
- mamba: A general purpose profile that uses mamba to manage software dependencies.
- docker: A general purpose profile that uses docker to manage software dependencies.
- singularity: A general purpose profile that uses singularity to manage software dependencies.
- nbis: A profile for the NBIS annotation cluster. Tasks are submitted to the SLURM workload
manager, and use the disk space
/scratch
for task execution. Software should be managed using one of the general purpose profiles above. - gitpod: A profile to set local executor settings in the Gitpod environment.
- test: A profile supplying test data to check if the workflows run on your system.
- pipeline_report: Adds a folder in the
outdir
which include workflow execution reports.
Note
Nextflow is enabled using the module system on Uppmax.
module load bioinfo-tools NextflowThe following configuration in your
workflow.config
is recommended when running workflows on Uppmax.// Set your work directory to a folder in your project directory under nobackup workDir = '/proj/<snic_storage_project>/nobackup/work' // Restart workflows from last successful execution (i.e. use cached results where possible). resume = true // Add any overriding process directives here, e.g., process { withName: 'BLAST_BLASTN' { cpus = 12 time = 2.d } }
Note
Both singularity and conda are installed, however singularity is preferred for speed and reproducibility.
module load SingularityThe following configuration in your
workflow.config
is recommended when running workflows on the annotation cluster.// Set your work directory to a folder on the /active partition workDir = '/active/<project_id>/nobackup/work' // Restart workflows from last successful execution (i.e. use cached results where possible). resume = true // Add any overriding process directives here, e.g., process { withName: 'BLAST_BLASTN' { cpus = 12 time = 2.d } } // Use a shared cache folder singularity images singularity.cacheDir = '/active/nxf_singularity_cachedir' // If using conda, use a shared cache for conda environments conda.cacheDir = '/active/nxf_conda_cachedir' // Use mamba for speed over conda conda.useMamba = trueProject results should be published to
/projects
, work directories should be on/active
, while computations are performed on the local/scratch
partitions.