diff --git a/.github/CONTRIBUTING.md b/.github/CONTRIBUTING.md index f19789aa38..b53404407d 100644 --- a/.github/CONTRIBUTING.md +++ b/.github/CONTRIBUTING.md @@ -18,8 +18,9 @@ If you'd like to write some code for nf-core/sarek, the standard workflow is as 1. Check that there isn't already an issue about your idea in the [nf-core/sarek issues](https://github.com/nf-core/sarek/issues) to avoid duplicating work * If there isn't one already, please create one so that others know you're working on this 2. [Fork](https://help.github.com/en/github/getting-started-with-github/fork-a-repo) the [nf-core/sarek repository](https://github.com/nf-core/sarek) to your GitHub account -3. Make the necessary changes / additions within your forked repository -4. Submit a Pull Request against the `dev` branch and wait for the code to be reviewed and merged +3. Make the necessary changes / additions within your forked repository following [Pipeline conventions](#pipeline-contribution-conventions) +4. Use `nf-core schema build .` and add any new parameters to the pipeline JSON schema (requires [nf-core tools](https://github.com/nf-core/tools) >= 1.10). +5. Submit a Pull Request against the `dev` branch and wait for the code to be reviewed and merged If you're not used to this workflow with git, you can start with some [docs from GitHub](https://help.github.com/en/github/collaborating-with-issues-and-pull-requests) or even their [excellent `git` resources](https://try.github.io/). @@ -30,14 +31,14 @@ Typically, pull-requests are only fully reviewed when these tests are passing, t There are typically two types of tests that run: -### Lint Tests +### Lint tests `nf-core` has a [set of guidelines](https://nf-co.re/developers/guidelines) which all pipelines must adhere to. To enforce these and ensure that all pipelines stay in sync, we have developed a helper tool which runs checks on the pipeline code. This is in the [nf-core/tools repository](https://github.com/nf-core/tools) and once installed can be run locally with the `nf-core lint ` command. If any failures or warnings are encountered, please follow the listed URL for more documentation. -### Pipeline Tests +### Pipeline tests Each `nf-core` pipeline should be set up with a minimal set of test-data. `GitHub Actions` then runs the pipeline on this data to ensure that it exits successfully. @@ -54,4 +55,74 @@ These tests are run both with the latest available version of `Nextflow` and als ## Getting help -For further information/help, please consult the [nf-core/sarek documentation](https://nf-co.re/sarek/docs) and don't hesitate to get in touch on the nf-core Slack [#sarek](https://nfcore.slack.com/channels/sarek) channel ([join our Slack here](https://nf-co.re/join/slack)). +For further information/help, please consult the [nf-core/sarek documentation](https://nf-co.re/sarek/usage) and don't hesitate to get in touch on the nf-core Slack [#sarek](https://nfcore.slack.com/channels/sarek) channel ([join our Slack here](https://nf-co.re/join/slack)). + +## Pipeline contribution conventions + +To make the nf-core/sarek code and processing logic more understandable for new contributors and to ensure quality, we semi-standardise the way the code and other contributions are written. + +### Adding a new step + +If you wish to contribute a new step, please use the following coding standards: + +1. Define the corresponding input channel into your new process from the expected previous process channel +2. Write the process block (see below). +3. Define the output channel if needed (see below). +4. Add any new flags/options to `nextflow.config` with a default (see below). +5. Add any new flags/options to `nextflow_schema.json` with help text (with `nf-core schema build .`) +6. Add any new flags/options to the help message (for integer/text parameters, print to help the corresponding `nextflow.config` parameter). +7. Add sanity checks for all relevant parameters. +8. Add any new software to the `scrape_software_versions.py` script in `bin/` and the version command to the `scrape_software_versions` process in `main.nf`. +9. Do local tests that the new code works properly and as expected. +10. Add a new test command in `.github/workflow/ci.yaml`. +11. If applicable add a [MultiQC](https://https://multiqc.info/) module. +12. Update MultiQC config `assets/multiqc_config.yaml` so relevant suffixes, name clean up, General Statistics Table column order, and module figures are in the right order. +13. Optional: Add any descriptions of MultiQC report sections and output files to `docs/output.md`. + +### Default values + +Parameters should be initialised / defined with default values in `nextflow.config` under the `params` scope. + +Once there, use `nf-core schema build .` to add to `nextflow_schema.json`. + +### Default processes resource requirements + +Sensible defaults for process resource requirements (CPUs / memory / time) for a process should be defined in `conf/base.config`. These should generally be specified generic with `withLabel:` selectors so they can be shared across multiple processes/steps of the pipeline. A nf-core standard set of labels that should be followed where possible can be seen in the [nf-core pipeline template](https://github.com/nf-core/tools/blob/master/nf_core/pipeline-template/%7B%7Bcookiecutter.name_noslash%7D%7D/conf/base.config), which has the default process as a single core-process, and then different levels of multi-core configurations for increasingly large memory requirements defined with standardised labels. + +The process resources can be passed on to the tool dynamically within the process with the `${task.cpu}` and `${task.memory}` variables in the `script:` block. + +### Naming schemes + +Please use the following naming schemes, to make it easy to understand what is going where. + +* initial process channel: `ch_output_from_` +* intermediate and terminal channels: `ch__for_` + +### Nextflow version bumping + +If you are using a new feature from core Nextflow, you may bump the minimum required version of nextflow in the pipeline with: `nf-core bump-version --nextflow . [min-nf-version]` + +### Software version reporting + +If you add a new tool to the pipeline, please ensure you add the information of the tool to the `get_software_version` process. + +Add to the script block of the process, something like the following: + +```bash + --version &> v_.txt 2>&1 || true +``` + +or + +```bash + --help | head -n 1 &> v_.txt 2>&1 || true +``` + +You then need to edit the script `bin/scrape_software_versions.py` to: + +1. Add a Python regex for your tool's `--version` output (as in stored in the `v_.txt` file), to ensure the version is reported as a `v` and the version number e.g. `v2.1.1` +2. Add a HTML entry to the `OrderedDict` for formatting in MultiQC. + +### Images and figures + +For overview images and other documents we follow the nf-core [style guidelines and examples](https://nf-co.re/developers/design_guidelines). diff --git a/.github/ISSUE_TEMPLATE/bug_report.md b/.github/ISSUE_TEMPLATE/bug_report.md index 0d16230bc1..cfbeb5b758 100644 --- a/.github/ISSUE_TEMPLATE/bug_report.md +++ b/.github/ISSUE_TEMPLATE/bug_report.md @@ -1,6 +1,6 @@ --- -name: nf-core/sarek Bug report -about: Create a report to help us improve +name: Bug report +about: Report something that is broken or incorrect title: "[BUG]" labels: bug assignees: MaxUlysse @@ -8,13 +8,21 @@ assignees: MaxUlysse --- +## Check Documentation + +I have checked the following places for your error: + +- [ ] [nf-core website: troubleshooting](https://nf-co.re/usage/troubleshooting) +- [ ] [nf-core/sarek pipeline documentation](https://nf-co.re/nf-core/sarek/usage) + ## Description of the bug @@ -30,6 +38,13 @@ Steps to reproduce the behaviour: +## Log files + +Have you provided the following extra information/files: + +- [ ] The command used to run the pipeline +- [ ] The `.nextflow.log` file + ## System - Hardware: @@ -43,7 +58,7 @@ Steps to reproduce the behaviour: ## Container engine -- Engine: +- Engine: - version: - Image tag: diff --git a/.github/ISSUE_TEMPLATE/config.yml b/.github/ISSUE_TEMPLATE/config.yml new file mode 100644 index 0000000000..f390abc544 --- /dev/null +++ b/.github/ISSUE_TEMPLATE/config.yml @@ -0,0 +1,8 @@ +blank_issues_enabled: false +contact_links: + - name: Join nf-core + url: https://nf-co.re/join + about: Please join the nf-core community here + - name: "Slack #sarek channel" + url: https://nfcore.slack.com/channels/sarek + about: Discussion about the nf-core/sarek pipeline diff --git a/.github/ISSUE_TEMPLATE/feature_request.md b/.github/ISSUE_TEMPLATE/feature_request.md index 8dfa1140dc..cdbe339c4d 100644 --- a/.github/ISSUE_TEMPLATE/feature_request.md +++ b/.github/ISSUE_TEMPLATE/feature_request.md @@ -1,6 +1,6 @@ --- -name: nf-core/sarek Feature request -about: Create a report to help us improve +name: Feature request +about: Suggest an idea for the nf-core/sarek pipeline title: "[FEATURE]" labels: enhancement assignees: MaxUlysse @@ -8,6 +8,8 @@ assignees: MaxUlysse --- -This document describes the output produced by the pipeline. +## warning Please read this documentation on the nf-core website: [https://nf-co.re/sarek/output](https://nf-co.re/sarek/output) -The directories listed below will be created in the results directory after the pipeline has finished. -All paths are relative to the top-level results directory. +> _Documentation of pipeline parameters is generated automatically from the pipeline schema and can no longer be found in markdown files._ + +## Introduction + +This document describes the output produced by the pipeline. Most of the plots are taken from the MultiQC report, which summarises results at the end of the pipeline. + +The directories listed below will be created in the results directory after the pipeline has finished. All paths are relative to the top-level results directory. ## Pipeline overview -The pipeline is built using [Nextflow](https://www.nextflow.io/) and processes data using the following steps: +The pipeline is built using [Nextflow](https://www.nextflow.io/) +and processes data using the following steps: - [Preprocessing](#preprocessing) - [Map to Reference](#map-to-reference) diff --git a/docs/usage.md b/docs/usage.md index 89fdbf5de4..1fb93ca060 100644 --- a/docs/usage.md +++ b/docs/usage.md @@ -1,148 +1,14 @@ -# nf-core/sarek: Usage - -- [Running the pipeline](#running-the-pipeline) - - [Updating the pipeline](#updating-the-pipeline) - - [Reproducibility](#reproducibility) -- [Core Nextflow arguments](#core-nextflow-arguments) - - [-profile](#-profile) - - [-resume](#-resume) - - [-c](#-c) - - [Custom resource requests](#custom-resource-requests) - - [Running in the background](#running-in-the-background) - - [Nextflow memory requirements](#nextflow-memory-requirements) -- [Pipeline specific arguments](#pipeline-specific-arguments) - - [--step](#--step) - - [--input](#--input) - - [--input <FASTQ> --step mapping](#--input-fastq---step-mapping) - - [--input <uBAM> --step mapping](#--input-ubam---step-mapping) - - [--input <sample/> --step mapping](#--input-sample---step-mapping) - - [--input <TSV> --step prepare_recalibration](#--input-tsv---step-prepare_recalibration) - - [--input <TSV> --step prepare_recalibration --skip_markduplicates](#--input-tsv---step-prepare_recalibration---skip_markduplicates) - - [--input <TSV> --step recalibrate](#--input-tsv---step-recalibrate) - - [--input <TSV> --step recalibrate --skip_markduplicates](#--input-tsv---step-recalibrate---skip_markduplicates) - - [--input <TSV> --step variant_calling](#--input-tsv---step-variant_calling) - - [--input <TSV> --step Control-FREEC](#--input-tsv---step-control-freec) - - [--input <VCF> --step annotate](#--input-vcf---step-annotate) - - [--help](#--help) - - [--no_intervals](#--no_intervals) - - [--nucleotides_per_second](#--nucleotides_per_second) - - [--sentieon](#--sentieon) - - [Alignment](#alignment) - - [Germline SNV/INDEL Variant Calling - DNAseq](#germline-snvindel-variant-calling---dnaseq) - - [Germline SNV/INDEL Variant Calling - DNAscope](#germline-snvindel-variant-calling---dnascope) - - [Somatic SNV/INDEL Variant Calling - TNscope](#somatic-snvindel-variant-calling---tnscope) - - [Structural Variant Calling](#structural-variant-calling) - - [--skip_qc](#--skip_qc) - - [--target_bed](#--target_bed) - - [--tools for Variant Calling](#--tools-for-variant-calling) - - [Germline variant calling](#germline-variant-calling) - - [Somatic variant calling with tumor - normal pairs](#somatic-variant-calling-with-tumor---normal-pairs) - - [Somatic variant calling with tumor only samples](#somatic-variant-calling-with-tumor-only-samples) - - [--tools --sentieon](#--tools---sentieon) - - [--tools for Annotation](#--tools-for-annotation) - - [Annotation tools](#annotation-tools) - - [Using genome specific containers](#using-genome-specific-containers) - - [Download cache](#download-cache) - - [Using downloaded cache](#using-downloaded-cache) - - [Using VEP CADD plugin](#using-vep-cadd-plugin) - - [Downloading CADD files](#downloading-cadd-files) - - [Using VEP GeneSplicer plugin](#using-vep-genesplicer-plugin) -- [Modify fastqs (trim/split)](#modify-fastqs-trimsplit) - - [--trim_fastq](#--trim_fastq) - - [--clip_r1](#--clip_r1) - - [--clip_r2](#--clip_r2) - - [--three_prime_clip_r1](#--three_prime_clip_r1) - - [--three_prime_clip_r2](#--three_prime_clip_r2) - - [--trim_nextseq](#--trim_nextseq) - - [--save_trimmed](#--save_trimmed) - - [--split_fastq](#--split_fastq) -- [Preprocessing](#preprocessing) - - [--aligner](#--aligner) - - [--markdup_java_options](#--markdup_java_options) - - [--no_gatk_spark](#--no_gatk_spark) - - [--save_bam_mapped](#--save_bam_mapped) - - [--skip_markduplicates](#--skip_markduplicates) -- [Variant Calling](#variant-calling) - - [--ascat_ploidy](#--ascat_ploidy) - - [--ascat_purity](#--ascat_purity) - - [--cf_coeff](#--cf_coeff) - - [--cf_ploidy](#--cf_ploidy) - - [--cf_window](#--cf_window) - - [--no_gvcf](#--no_gvcf) - - [--no_strelka_bp](#--no_strelka_bp) - - [--pon](#--pon) - - [--pon_index](#--pon_index) - - [--ignore_soft_clipped_bases](#--ignore_soft_clipped_bases) - - [--umi](#--umi) - - [--read_structure1](#--read_structure1) - - [--read_structure2](#--read_structure2) -- [Annotation](#annotation) - - [--annotate_tools](#--annotate_tools) - - [--annotation_cache](#--annotation_cache) - - [--snpeff_cache](#--snpeff_cache) - - [--vep_cache](#--vep_cache) - - [--cadd_cache](#--cadd_cache) - - [--cadd_indels](#--cadd_indels) - - [--cadd_indels_tbi](#--cadd_indels_tbi) - - [--cadd_wg_snvs](#--cadd_wg_snvs) - - [--cadd_wg_snvs_tbi](#--cadd_wg_snvs_tbi) - - [--genesplicer](#--genesplicer) -- [Reference genomes](#reference-genomes) - - [--genome (using iGenomes)](#--genome-using-igenomes) - - [--igenomes_base](#--igenomes_base) - - [--igenomes_ignore](#--igenomes_ignore) - - [--genomes_base](#--genomes_base) - - [--save_reference](#--save_reference) - - [--ac_loci](#--ac_loci) - - [--ac_loci_gc](#--ac_loci_gc) - - [--bwa](#--bwa) - - [--chr_dir](#--chr_dir) - - [--chr_length](#--chr_length) - - [--dbsnp](#--dbsnp) - - [--dbsnp_index](#--dbsnp_index) - - [--dict](#--dict) - - [--fasta](#--fasta) - - [--fasta_fai](#--fasta_fai) - - [--germline_resource](#--germline_resource) - - [--germline_resource_index](#--germline_resource_index) - - [--intervals](#--intervals) - - [--known_indels](#--known_indels) - - [--known_indels_index](#--known_indels_index) - - [--mappability](#--mappability) - - [--snpeff_db](#--snpeff_db) - - [--species](#--species) - - [--vep_cache_version](#--vep_cache_version) -- [Other command line parameters](#other-command-line-parameters) - - [--outdir](#--outdir) - - [--publish_dir_mode](#--publish_dir_mode) - - [--sequencing_center](#--sequencing_center) - - [--multiqc_config](#--multiqc_config) - - [--monochrome_logs](#--monochrome_logs) - - [--email](#--email) - - [--email_on_fail](#--email_on_fail) - - [--plaintext_email](#--plaintext_email) - - [--max_multiqc_email_size](#--max_multiqc_email_size) - - [-name](#-name) - - [--custom_config_version](#--custom_config_version) - - [--custom_config_base](#--custom_config_base) -- [Job resources](#job-resources) - - [Automatic resubmission](#automatic-resubmission) - - [--max_memory](#--max_memory) - - [--max_time](#--max_time) - - [--max_cpus](#--max_cpus) - - [--single_cpu_mem](#--single_cpu_mem) -- [Containers](#containers) - - [Building your owns](#building-your-owns) - - [Build with Conda](#build-with-conda) - - [Build with Docker](#build-with-docker) - - [Pull with Docker](#pull-with-docker) - - [Pull with Singularity](#pull-with-singularity) -- [AWSBatch specific parameters](#awsbatch-specific-parameters) - - [--awsqueue](#--awsqueue) - - [--awsregion](#--awsregion) - - [--awscli](#--awscli) -- [Troubleshooting](#troubleshooting) - - [Spark related issues](#spark) +# nf-core/sarek: Usage + +## :warning: Please read this documentation on the nf-core website: [https://nf-co.re/sarek/usage](https://nf-co.re/sarek/usage) + +> _Documentation of pipeline parameters is generated automatically from the pipeline schema and can no longer be found in markdown files._ + +## Introduction + +Sarek is a workflow designed to detect variants on whole genome or targeted sequencing data. +Initially designed for Human, and Mouse, it can work on any species with a reference genome. +Sarek can also handle tumour / normal pairs and could include additional relapses. ## Running the pipeline @@ -152,8 +18,7 @@ The typical command for running the pipeline is as follows: nextflow run nf-core/sarek --input -profile docker ``` -This will launch the pipeline with the `docker` configuration profile. -See below for more information about profiles. +This will launch the pipeline with the `docker` configuration profile. See below for more information about profiles. Note that the pipeline will create the following files in your working directory: @@ -164,15 +29,9 @@ results # Finished results (configurable, see below) # Other nextflow hidden files, eg. history of pipeline runs and old logs. ``` -The nf-core/sarek pipeline comes with more documentation about running the pipeline, found in the `docs/` directory: - -- [Output and how to interpret the results](output.md) - ### Updating the pipeline -When you run the above command, `Nextflow` automatically pulls the pipeline code from `GitHub` and stores it as a cached version. -When running the pipeline after this, it will always use the cached version if available - even if the pipeline has been updated since. -To make sure that you're running the latest version of the pipeline, make sure that you regularly update the cached version of the pipeline: +When you run the above command, Nextflow automatically pulls the pipeline code from GitHub and stores it as a cached version. When running the pipeline after this, it will always use the cached version if available - even if the pipeline has been updated since. To make sure that you're running the latest version of the pipeline, make sure that you regularly update the cached version of the pipeline: ```bash nextflow pull nf-core/sarek @@ -180,9 +39,7 @@ nextflow pull nf-core/sarek ### Reproducibility -It's a good idea to specify a pipeline version when running the pipeline on your data. -This ensures that a specific version of the pipeline code and software are used when you run your pipeline. -If you keep using the same tag, you'll be running the same version of the pipeline, even if there have been changes to the code since. +It's a good idea to specify a pipeline version when running the pipeline on your data. This ensures that a specific version of the pipeline code and software are used when you run your pipeline. If you keep using the same tag, you'll be running the same version of the pipeline, even if there have been changes to the code since. First, go to the [nf-core/sarek releases page](https://github.com/nf-core/sarek/releases) and find the latest version number - numeric only (eg. `2.6.1`). Then specify this when running the pipeline with `-r` (one hyphen) - eg. `-r 2.6.1`. @@ -191,86 +48,79 @@ This version number will be logged in reports when you run the pipeline, so that ## Core Nextflow arguments -> **NB:** These options are part of `Nextflow` and use a _single_ hyphen (pipeline parameters use a double-hyphen). +> **NB:** These options are part of Nextflow and use a _single_ hyphen (pipeline parameters use a double-hyphen). -### -profile +### `-profile` -Use this parameter to choose a configuration profile. -Profiles can give configuration presets for different compute environments. +Use this parameter to choose a configuration profile. Profiles can give configuration presets for different compute environments. -Several generic profiles are bundled with the pipeline which instruct the pipeline to use software packaged using different methods (`Docker`, `Singularity`, `Conda`) - see below. +Several generic profiles are bundled with the pipeline which instruct the pipeline to use software packaged using different methods (Docker, Singularity, Podman, Conda) - see below. -> We highly recommend the use of `Docker` or `Singularity` containers for full pipeline reproducibility, however when this is not possible, `Conda` is also supported. +> We highly recommend the use of Docker or Singularity containers for full pipeline reproducibility, however when this is not possible, Conda is also supported. -The pipeline also dynamically loads configurations from [github.com/nf-core/configs](https://github.com/nf-core/configs) when it runs, making multiple config profiles for various institutional clusters available at run time. -For more information and to see if your system is available in these configs please see the [nf-core/configs documentation](https://github.com/nf-core/configs#documentation). +The pipeline also dynamically loads configurations from [https://github.com/nf-core/configs](https://github.com/nf-core/configs) when it runs, making multiple config profiles for various institutional clusters available at run time. For more information and to see if your system is available in these configs please see the [nf-core/configs documentation](https://github.com/nf-core/configs#documentation). Note that multiple profiles can be loaded, for example: `-profile test,docker` - the order of arguments is important! They are loaded in sequence, so later profiles can overwrite earlier profiles. -If `-profile` is not specified, the pipeline will run locally and expect all software to be installed and available on the `PATH`. -This is _not_ recommended. - -- `docker` - - A generic configuration profile to be used with [Docker](http://docker.com/) - - Pulls software from DockerHub: [`nfcore/sarek`](http://hub.docker.com/r/nfcore/sarek/) -- `singularity` - - A generic configuration profile to be used with [Singularity](https://sylabs.io/docs/) - - Pulls software from DockerHub: [`nfcore/sarek`](http://hub.docker.com/r/nfcore/sarek/) -- `conda` - - Please only use `Conda` as a last resort i.e. when it's not possible to run the pipeline with `Docker` or `Singularity`. - - A generic configuration profile to be used with [conda](https://conda.io/docs/) - - Pulls most software from [Bioconda](https://bioconda.github.io/) -- `test` - - A profile with a complete configuration for automated testing - - Includes links to test data so needs no other parameters -- `test_annotation` - - A profile with a complete configuration for automated testing - - Input data is a `VCF` for testing annotation -- `test_no_gatk_spark` - - A profile with a complete configuration for automated testing - - Specify `--no_gatk_spark` -- `test_split_fastq` - - A profile with a complete configuration for automated testing - - Specify `--split_fastq 500` -- `test_targeted` - - A profile with a complete configuration for automated testing - - Include link to a target `BED` file and use `Manta` and `Strelka` for Variant Calling -- `test_tool` - - A profile with a complete configuration for automated testing - - Test directly Variant Calling with a specific TSV file and `--step variantcalling` -- `test_trimming` - - A profile with a complete configuration for automated testing - - Test trimming options -- `test_umi_qiaseq` - - A profile with a complete configuration for automated testing - - Test a specific `UMI` structure -- `test_umi_tso` - - A profile with a complete configuration for automated testing - - Test a specific `UMI` structure - -### -resume - -Specify this when restarting a pipeline. -`Nextflow` will used cached results from any pipeline steps where the inputs are the same, continuing from where it got to previously. - -You can also supply a run name or a session ID to resume a specific run: `-resume [run-name/session id]`. -Use the `nextflow log` command to show previous run names and session IDs. - -### -c - -Specify the path to a specific config file (this is a core `Nextflow` command). -See the [nf-core website documentation](https://nf-co.re/usage/configuration) for more information. +If `-profile` is not specified, the pipeline will run locally and expect all software to be installed and available on the `PATH`. This is _not_ recommended. + +* `docker` + * A generic configuration profile to be used with [Docker](https://docker.com/) + * Pulls software from Docker Hub: [`nfcore/sarek`](http://hub.docker.com/r/nfcore/sarek/) +* `singularity` + * A generic configuration profile to be used with [Singularity](https://sylabs.io/docs/) + * Pulls software from Docker Hub: [`nfcore/sarek`](http://hub.docker.com/r/nfcore/sarek/) +* `podman` + * A generic configuration profile to be used with [Podman](https://podman.io/) + * Pulls software from Docker Hub: [`nfcore/sarek`](http://hub.docker.com/r/nfcore/sarek/) +* `conda` + * Please only use Conda as a last resort i.e. when it's not possible to run the pipeline with Docker, Singularity or Podman. + * A generic configuration profile to be used with [Conda](https://conda.io/docs/) + * Pulls most software from [Bioconda](https://bioconda.github.io/) +* `test` + * A profile with a complete configuration for automated testing + * Includes links to test data so needs no other parameters +* `test_annotation` + * A profile with a complete configuration for automated testing + * Input data is a `VCF` for testing annotation +* `test_no_gatk_spark` + * A profile with a complete configuration for automated testing + * Specify `--no_gatk_spark` +* `test_split_fastq` + * A profile with a complete configuration for automated testing + * Specify `--split_fastq 500` +* `test_targeted` + * A profile with a complete configuration for automated testing + * Include link to a target `BED` file and use `Manta` and `Strelka` for Variant Calling +* `test_tool` + * A profile with a complete configuration for automated testing + * Test directly Variant Calling with a specific TSV file and `--step variantcalling` +* `test_trimming` + * A profile with a complete configuration for automated testing + * Test trimming options +* `test_umi_qiaseq` + * A profile with a complete configuration for automated testing + * Test a specific `UMI` structure +* `test_umi_tso` + * A profile with a complete configuration for automated testing + * Test a specific `UMI` structure + +### `-resume` + +Specify this when restarting a pipeline. Nextflow will used cached results from any pipeline steps where the inputs are the same, continuing from where it got to previously. + +You can also supply a run name to resume a specific run: `-resume [run-name]`. Use the `nextflow log` command to show previous run names. + +### `-c` + +Specify the path to a specific config file (this is a core Nextflow command). See the [nf-core website documentation](https://nf-co.re/usage/configuration) for more information. #### Custom resource requests -Each step in the pipeline has a default set of requirements for number of CPUs, memory and time. -For most of the steps in the pipeline, if the job exits with an error code of `143` (exceeded requested resources), it will automatically resubmit with higher requests (2 x original, then 3 x original). -If it still fails after three times then the pipeline is stopped. +Each step in the pipeline has a default set of requirements for number of CPUs, memory and time. For most of the steps in the pipeline, if the job exits with an error code of `143` (exceeded requested resources) it will automatically resubmit with higher requests (2 x original, then 3 x original). If it still fails after three times then the pipeline is stopped. -Whilst these default requirements will hopefully work for most people with most data, you may find that you want to customise the compute resources that the pipeline requests. -You can do this by creating a custom config file. -For example, to give the workflow process `VEP` 32GB of memory, you could use the following config: +Whilst these default requirements will hopefully work for most people with most data, you may find that you want to customise the compute resources that the pipeline requests. You can do this by creating a custom config file. For example, to give the workflow process `star` 32GB of memory, you could use the following config: ```nextflow process { @@ -282,72 +132,54 @@ process { See the main [Nextflow documentation](https://www.nextflow.io/docs/latest/config.html) for more information. -If you are likely to be running `nf-core` pipelines regularly it may be a good idea to request that your custom config file is uploaded to the `nf-core/configs` git repository. -Before you do this please can you test that the config file works with your pipeline of choice using the `-c` parameter see [-c section](#-c). -You can then create a pull request to the `nf-core/configs` repository with the addition of your config file, associated documentation file (see examples in [`nf-core/configs/docs`](https://github.com/nf-core/configs/tree/master/docs)), and amending [`nfcore_custom.config`](https://github.com/nf-core/configs/blob/master/nfcore_custom.config) to include your custom profile. +If you are likely to be running `nf-core` pipelines regularly it may be a good idea to request that your custom config file is uploaded to the `nf-core/configs` git repository. Before you do this please can you test that the config file works with your pipeline of choice using the `-c` parameter (see definition above). You can then create a pull request to the `nf-core/configs` repository with the addition of your config file, associated documentation file (see examples in [`nf-core/configs/docs`](https://github.com/nf-core/configs/tree/master/docs)), and amending [`nfcore_custom.config`](https://github.com/nf-core/configs/blob/master/nfcore_custom.config) to include your custom profile. If you have any questions or issues please send us a message on [Slack](https://nf-co.re/join/slack) on the [`#configs` channel](https://nfcore.slack.com/channels/configs). ### Running in the background -`Nextflow` handles job submissions and supervises the running jobs. -The `Nextflow` process must run until the pipeline is finished. +Nextflow handles job submissions and supervises the running jobs. The Nextflow process must run until the pipeline is finished. -The `Nextflow` `-bg` flag launches Nextflow in the background, detached from your terminal so that the workflow does not stop if you log out of your session. The logs are saved to a file. +The Nextflow `-bg` flag launches Nextflow in the background, detached from your terminal so that the workflow does not stop if you log out of your session. The logs are saved to a file. Alternatively, you can use `screen` / `tmux` or similar tool to create a detached session which you can log back into at a later time. Some HPC setups also allow you to run nextflow within a cluster job submitted your job scheduler (from where it submits more jobs). #### Nextflow memory requirements -In some cases, the `Nextflow` Java virtual machines can start to request a large amount of memory. +In some cases, the Nextflow Java virtual machines can start to request a large amount of memory. We recommend adding the following line to your environment to limit this (typically in `~/.bashrc` or `~./bash_profile`): ```bash NXF_OPTS='-Xms1g -Xmx4g' ``` -## Pipeline specific arguments - -### --step - -> **NB** only one step must be specified - -Use this to specify the starting step - -Default: `mapping` - -Available: `mapping`, `prepare_recalibration`, `recalibrate`, `variant_calling`, `annotate`, `Control-FREEC` - -> **NB** step can be specified with no concern for case, or the presence of `-` or `_` - -### --input +## Troubleshooting -Use this to specify the location of your input `TSV` (Tab Separated Values) file. +### TSV file > **NB** Delimiter is the tab (`\t`) character, and no header is required There are different kinds of `TSV` files that can be used as input, depending on the input files available (`FASTQ`, `unmapped BAM`, `recalibrated BAM`...). -The `TSV` file should correspond to the correct step, see [`--step`](#--step) for more information. +The `TSV` file should correspond to the correct step. For all possible `TSV` files, described in the next sections, here is an explanation of what the columns refer to: `Sarek` auto-generates `TSV` files for all and for each individual samples, depending of the options specified. -- `subject` designates the subject, it should be the ID of the subject, and it must be unique for each subject, but one subject can have multiple samples (e.g. +* `subject` designates the subject, it should be the ID of the subject, and it must be unique for each subject, but one subject can have multiple samples (e.g. normal and tumor) -- `sex` are the sex chromosomes of the subject, (ie `XX`, `XY`...) -- `status` is the status of the measured sample, (`0` for Normal or `1` for Tumor) -- `sample` designates the sample, it should be the ID of the sample (it is possible to have more than one tumor sample for each subject, i.e. -a tumor and a relapse), it must be unique, but samples can have multiple lanes (which will later be merged) -- `lane` is used when the sample is multiplexed on several lanes, it must be unique for each lane in the same sample (but does not need to be the original lane name), and must contain at least one character -- `fastq1` is the path to the first pair of the `FASTQ` file -- `fastq2` is the path to the second pair of the `FASTQ` file -- `bam` is the path to the `BAM` file -- `bai` is the path to the `BAM` index file -- `recaltable` is the path to the recalibration table -- `mpileup` is the path to the mpileup file - -It is recommended to add the absolute path of the files, but relative path should also work. +* `sex` are the sex chromosomes of the subject, (ie `XX`, `XY`...) and will only be used for Copy-Number Variation in a tumor/pair. +* `status` is the status of the measured sample, (`0` for Normal or `1` for Tumor) +* `sample` designates the sample, it should be the ID of the sample (it is possible to have more than one tumor sample for each subject, i.e. a tumor and a relapse), it must be unique, but samples can have multiple lanes (which will later be merged) +* `lane` is used when the sample is multiplexed on several lanes, it must be unique for each lane in the same sample (but does not need to be the original lane name), and must contain at least one character +* `fastq1` is the path to the first pair of the `FASTQ` file +* `fastq2` is the path to the second pair of the `FASTQ` file +* `bam` is the path to the `BAM` file +* `bai` is the path to the `BAM` index file +* `recaltable` is the path to the recalibration table +* `mpileup` is the path to the mpileup file + +It is recommended to use the absolute path of the files, but relative path should also work. If necessary, a tumor sample can be associated to a normal sample as a pair, if specified with the same `subject`and a different `sample`. An additional tumor sample (such as a relapse for example), can be added if specified with the same `subject` and a different `sample`. @@ -360,7 +192,7 @@ Output from Variant Calling and/or Annotation will be in a specific directory fo #### --input <FASTQ> --step mapping -The `TSV` file to start with the step mapping with paired-end `FASTQs` should contain the columns: +The `TSV` file to start with the mapping step (`--step mapping`) with paired-end `FASTQs` should contain the columns: `subject sex status sample lane fastq1 fastq2` @@ -394,9 +226,9 @@ In this example (`example_pair_fastq.tsv`), there are 3 read groups for the norm #### --input <uBAM> --step mapping -The `TSV` file for starting the mapping from `unmapped BAM` files should contain the columns: +The `TSV` file to start with the mapping step (`--step mapping`) with `unmapped BAM` files should contain the columns: -- `subject sex status sample lane bam` +`subject sex status sample lane bam` In this example (`example_ubam.tsv`), there are 3 read groups. @@ -426,64 +258,6 @@ In this example (`example_pair_ubam.tsv`), there are 3 read groups for the norma --input example_pair_ubam.tsv ``` -#### --input <sample/> --step mapping - -Use this to specify the location to a directory with `FASTQ` files for the `mapping` step of a single germline sample only. -For example: - -```bash ---input -``` - -> **NB** All of the found `FASTQ` files are considered to belong to the same sample. - -The input folder, containing the `FASTQ` files for one subject (ID) should be organized into one sub-folder for every sample. -The given directory is searched recursively for `FASTQ` files that are named `*_R1_*.fastq.gz`, and a matching pair with the same name except `_R2_` instead of `_R1_` is expected to exist alongside. -All `FASTQ` files for that sample should be collected here. - -```text -ID -+--sample1 -+------sample1___lane1_R1_1000.fastq.gz -+------sample1___lane1_R2_1000.fastq.gz -+------sample1___lane2_R1_1000.fastq.gz -+------sample1___lane2_R2_1000.fastq.gz -+--sample2 -+------sample2___lane1_R1_1000.fastq.gz -+------sample2___lane1_R2_1000.fastq.gz -+--sample3 -+------sample3___lane1_R1_1000.fastq.gz -+------sample3___lane1_R2_1000.fastq.gz -+------sample3___lane2_R1_1000.fastq.gz -+------sample3___lane2_R2_1000.fastq.gz -``` - -`FASTQ` filename structure: - -- `____R1_.fastq.gz` and -- `____R2_.fastq.gz` - -Where: - -- `sample` = sample id -- `lib` = identifier of library preparation -- `flowcell-index` = identifier of flow cell for the sequencing run -- `lane` = identifier of the lane of the sequencing run - -Read group information will be parsed from `FASTQ` file names according to this: - -- `RGID` = "sample_lib_flowcell_index_lane" -- `RGPL` = "Illumina" -- `PU` = sample -- `RGLB` = lib - -Each `FASTQ` file pair gets its own read group (`@RG`) in the resulting `BAM` file in the following way. - -- The sample name (`SM`) is derived from the the last component of the path given to `--input`. -That is, you should make sure that that directory has a meaningful name! For example, with `--input=/my/fastqs/sample123`, the sample name will be `sample123`. -- The read group id is set to *flowcell.samplename.lane*. -The flowcell id and lane number are auto-detected from the name of the first read in the `FASTQ` file. - #### --input <TSV> --step prepare_recalibration To start from the preparation of the recalibration step (`--step prepare_recalibration`), a `TSV` file needs to be given as input containing the paths to the `non-recalibrated BAM` files. @@ -491,7 +265,7 @@ The `Sarek`-generated `TSV` file is stored under `results/Preprocessing/TSV/dupl The `TSV` contains the following columns: -- `subject sex status sample bam bai` +`subject sex status sample bam bai` | | | | | | | |-|-|-|-|-|-| @@ -527,7 +301,7 @@ The `Sarek`-generated `TSV` file is stored under `results/Preprocessing/TSV/dupl The `TSV` contains the following columns: -- `subject sex status sample bam bai recaltable` +`subject sex status sample bam bai recaltable` | | | | | | | | |-|-|-|-|-|-|-| @@ -563,7 +337,7 @@ The `Sarek`-generated `TSV` file is stored under `results/Preprocessing/TSV/reca The `TSV` file should contain the columns: -- `subject sex status sample bam bai` +`subject sex status sample bam bai` Here is an example for two samples from the same subject: @@ -585,7 +359,7 @@ The `Sarek`-generated `TSV` file is stored under `results/VariantCalling/TSV/con The `TSV` file should contain the columns: -- `subject sex status sample mpileup` +`subject sex status sample mpileup` Here is an example for one normal/tumor pair from one subjects: @@ -594,47 +368,88 @@ Here is an example for one normal/tumor pair from one subjects: |SUBJECT_ID|XX|0|SAMPLE_ID1|/samples/normal.pileup| |SUBJECT_ID|XX|1|SAMPLE_ID2|/samples/tumor.pileup| -#### --input <VCF> --step annotate +### --input <sample/> --step mapping -Input files for Sarek can be specified using the path to a `VCF` file given to the `--input` command only with the annotation step (`--step annotate`). -As `Sarek` will use `bgzip` and `tabix` to compress and index `VCF` files annotated, it expects `VCF` files to be sorted. -Multiple `VCF` files can be specified, using a [glob path](https://docs.oracle.com/javase/tutorial/essential/io/fileOps.html#glob), if enclosed in quotes. +Use this to specify the location to a directory with `FASTQ` files for the `mapping` step of a single germline sample only. For example: ```bash ---step annotate --input "results/VariantCalling/*/{HaplotypeCaller,Manta,Mutect2,Strelka,TIDDIT}/*.vcf.gz" +--input +``` + +> **NB** All of the found `FASTQ` files are considered to belong to the same sample. + +The input folder, containing the `FASTQ` files for one subject (ID) should be organized into one sub-folder for every sample. +The given directory is searched recursively for `FASTQ` files that are named `*_R1_*.fastq.gz`, and a matching pair with the same name except `_R2_` instead of `_R1_` is expected to exist alongside. +All `FASTQ` files for that sample should be collected here. + +```text +ID ++--sample1 ++------sample1___lane1_R1_1000.fastq.gz ++------sample1___lane1_R2_1000.fastq.gz ++------sample1___lane2_R1_1000.fastq.gz ++------sample1___lane2_R2_1000.fastq.gz ++--sample2 ++------sample2___lane1_R1_1000.fastq.gz ++------sample2___lane1_R2_1000.fastq.gz ++--sample3 ++------sample3___lane1_R1_1000.fastq.gz ++------sample3___lane1_R2_1000.fastq.gz ++------sample3___lane2_R1_1000.fastq.gz ++------sample3___lane2_R2_1000.fastq.gz ``` -### --help +`FASTQ` filename structure: + +* `____R1_.fastq.gz` and +* `____R2_.fastq.gz` + +Where: -Will display the help message. +* `sample` = sample id +* `lib` = identifier of library preparation +* `flowcell-index` = identifier of flow cell for the sequencing run +* `lane` = identifier of the lane of the sequencing run -### --no_intervals +Read group information will be parsed from `FASTQ` file names according to this: + +* `RGID` = "sample_lib_flowcell_index_lane" +* `RGPL` = "Illumina" +* `PU` = sample +* `RGLB` = lib + +Each `FASTQ` file pair gets its own read group (`@RG`) in the resulting `BAM` file in the following way. -Disable usage of [`intervals`](#--intervals) file. +* The sample name (`SM`) is derived from the the last component of the path given to `--input`. +That is, you should make sure that that directory has a meaningful name! For example, with `--input=/my/fastqs/sample123`, the sample name will be `sample123`. +* The read group id is set to *flowcell.samplename.lane*. +The flowcell id and lane number are auto-detected from the name of the first read in the `FASTQ` file. -### --nucleotides_per_second +### --input <VCF> --step annotate -Use this to estimate of how many seconds it will take to call variants on any interval, the default value is `1000` when not specified in the [`intervals`](#--intervals) file. +Input files for Sarek can be specified using the path to a `VCF` file given to the `--input` command only with the annotation step (`--step annotate`). +As `Sarek` will use `bgzip` and `tabix` to compress and index `VCF` files annotated, it expects `VCF` files to be sorted. +Multiple `VCF` files can be specified, using a [glob path](https://docs.oracle.com/javase/tutorial/essential/io/fileOps.html#glob), if enclosed in quotes. +For example: -### --sentieon +```bash +--step annotate --input "results/VariantCalling/*/{HaplotypeCaller,Manta,Mutect2,Strelka,TIDDIT}/*.vcf.gz" +``` -[Sentieon](https://www.sentieon.com/) is a commercial solution to process genomics data with high computing efficiency, fast turnaround time, exceptional accuracy, and 100% consistency. +### Sentieon -If [Sentieon](https://www.sentieon.com/) is available, use this `--sentieon` params to enable with `Sarek` to use some `Sentieon Analysis Pipelines & Tools`. -Adds the following tools for the [`--tools`](#--tools---sentieon) options: `DNAseq`, `DNAscope` and `TNscope`. +Sentieon is a commercial solution to process genomics data with high computing efficiency, fast turnaround time, exceptional accuracy, and 100% consistency. Please refer to the [nf-core/configs](https://github.com/nf-core/configs#adding-a-new-pipeline-specific-config) repository on how to make a pipeline-specific configuration file based on the [munin-sarek specific configuration file](https://github.com/nf-core/configs/blob/master/conf/pipeline/sarek/munin.config). Or ask us on the [nf-core Slack](http://nf-co.re/join/slack) on the following channels: [#sarek](https://nfcore.slack.com/channels/sarek) or [#configs](https://nfcore.slack.com/channels/configs). -The following `Sentieon Analysis Pipelines & Tools` are available within `Sarek`: - #### Alignment > Sentieon BWA matches BWA-MEM with > 2X speedup. -This tool is enabled by default within `Sarek` if `--sentieon` is specified and if the pipeline is started with the `mapping` [step](usage.md#--step). +This tool is enabled by default within `Sarek` if both `--sentieon` and `--step mapping` are specified. #### Germline SNV/INDEL Variant Calling - DNAseq @@ -642,14 +457,14 @@ This tool is enabled by default within `Sarek` if `--sentieon` is specified and > Matches GATK 3.3-4.1, and without down-sampling. > Results up to 10x faster and 100% consistent every time. -This tool is enabled within `Sarek` if `--sentieon` is specified and if `--tools DNAseq` is specified cf [--tools --sentieon](#--tools--sentieon). +This tool is enabled within `Sarek` if both `--sentieon` and `--tools DNAseq` are specified. #### Germline SNV/INDEL Variant Calling - DNAscope > Improved accuracy and genome characterization. > Machine learning enhanced filtering producing top variant calling accuracy. -This tool is enabled within `Sarek` if `--sentieon` is specified and if `--tools DNAscope` is specified cf [--tools --sentieon](#--tools--sentieon). +This tool is enabled within `Sarek` if both `--sentieon` and `--tools DNAscope` are specified. #### Somatic SNV/INDEL Variant Calling - TNscope @@ -657,162 +472,73 @@ This tool is enabled within `Sarek` if `--sentieon` is specified and if `--tools > Improved accuracy, machine learning enhanced filtering. > Supports molecular barcodes and unique molecular identifiers. -This tool is enabled within `Sarek` if `--sentieon` is specified and if `--tools TNscope` is specified cf [--tools --sentieon](#--tools--sentieon). +This tool is enabled within `Sarek` if both `--sentieon` and `--tools TNscope` are specified. #### Structural Variant Calling > Germline and somatic SV calling, including translocations, inversions, duplications and large INDELs -This tool is enabled within `Sarek` if `--sentieon` is specified and if `--tools DNAscope` is specified cf [--tools --sentieon](#--tools--sentieon). - -### --skip_qc - -Use this to disable specific QC and Reporting tools. -Multiple tools can be specified, separated by commas. - -Available: `all`, `bamQC`, `BaseRecalibrator`, `BCFtools`, `Documentation`, `FastQC`, `MarkDuplicates`, `MultiQC`, `samtools`, `vcftools`, `versions` - -Default: `None` - -> **NB** `--skip_qc MarkDuplicates` does not skip `MarkDuplicates` but prevent the collection of duplicate metrics that slows down performance - -### --target_bed - -Use this to specify the target `BED` file for targeted or whole exome sequencing. - -The `--target_bed` parameter does _not_ imply that the workflow is running alignment or variant calling only for the supplied targets. -Instead, we are aligning for the whole genome, and selecting variants only at the very end by intersecting with the provided target file. -Adding every exon as an interval in case of `WES` can generate >200K processes or jobs, much more forks, and similar number of directories in the Nextflow work directory. -Furthermore, primers and/or baits are not 100% specific, (certainly not for MHC and KIR, etc.), quite likely there going to be reads mapping to multiple locations. -If you are certain that the target is unique for your genome (all the reads will certainly map to only one location), and aligning to the whole genome is an overkill, better to change the reference itself. - -The recommended flow for targeted sequencing data is to use the workflow as it is, but also provide a `BED` file containing targets for all steps using the `--target_bed` option. -The workflow will pick up these intervals, and activate any `--exome` flag in any tools that allow it to process deeper coverage. -It is advised to pad the variant calling regions (exons or target) to some extent before submitting to the workflow. -To add the target `BED` file configure the command line like: - -```bash ---target_bed -``` - -### --tools for Variant Calling - -Use this parameter to specify the variant calling and annotation tools to be used. -Multiple tools can be specified, separated by commas. -For example: - -```bash ---tools Strelka,mutect2,SnpEff -``` - -Available variant callers: `ASCAT`, `Control-FREEC`, `FreeBayes`, `HaplotypeCaller`, `Manta`, `mpileup`, `MSIsensor`, `Mutect2`, `Strelka`, `TIDDIT`. - -> **NB** Tools can be specified with no concern for case - -For more information on the individual variant callers, and where to find the variant calling results, check the [output](output.md) documentation. - -> **WARNING** Not all variant callers are available for both germline and somatic variant calling. +This tool is enabled within `Sarek` if both `--sentieon` and `--tools DNAscope` are specified. -For more information, please read the following documentation on [Germline variant calling](#germline-variant-calling), [Somatic variant calling with tumor - normal pairs](#somatic-variant-calling-with-tumor---normal-pairs) and [Somatic variant calling with tumor only samples](#somatic-variant-calling-with-tumor-only-samples) +### Containers -#### Germline variant calling - -Using `Sarek`, germline variant calling will always be performed if a variant calling tool with a germline mode is selected. -Germline variant calling can currently only be performed with the following variant callers: - -- *FreeBayes* -- *HaplotypeCaller* -- *Manta* -- *mpileup* -- *Sentieon* -- *Strelka* -- *TIDDIT* - -For more information on the individual variant callers, and where to find the variant calling results, check the [output](output.md) documentation. - -#### Somatic variant calling with tumor - normal pairs - -Using `Sarek`, somatic variant calling will be performed, if your input tsv file contains tumor / normal pairs (see [input](#--input) documentation for more information). -Different samples belonging to the same patient, where at least one is marked as normal (`0` in the `Status` column) and at least one is marked as tumor (`1` in the `Status` column) are treated as tumor / normal pairs. - -If tumor-normal pairs are provided, both germline variant calling and somatic variant calling will be performed, provided that the selected variant caller allows for it. -If the selected variant caller allows only for somatic variant calling, then only somatic variant calling results will be generated. - -Here is a list of the variant calling tools that support somatic variant calling: - -- *ASCAT* -- *Control-FREEC* -- *FreeBayes* -- *Manta* -- *MSIsensor* -- *Mutect2* -- *Sentieon* -- *Strelka* - -For more information on the individual variant callers, and where to find the variant calling results, check the [output](output.md) documentation. - -#### Somatic variant calling with tumor only samples - -Somatic variant calling with only tumor samples (no matching normal sample), is not recommended by the `GATK best practices`. -This is just supported for a limited variant callers. - -Here is a list of the variant calling tools that support tumor-only somatic variant calling: - -- *Manta* -- *mpileup* -- *Mutect2* -- *TIDDIT* - -### --tools --sentieon - -> **WARNING** Only with `--sentieon` - -If [Sentieon](https://www.sentieon.com/) is available, use this `--sentieon` params to enable with `Sarek` to use some `Sentieon Analysis Pipelines & Tools`. -Adds the following tools for the [`--tools`](#--tools) options: `DNAseq`, `DNAscope` and `TNscope`. - -### --tools for Annotation - -Available annotation tools: `VEP`, `SnpEff`, `merge`. -For more details, please check the [annotation](#annotation-tools) documentation. - -#### Annotation tools +`sarek`, our main container is designed using [Conda](https://conda.io/). -With `Sarek`, annotation is done using `snpEff`, `VEP`, or even both consecutively: +[![sarek-docker status](https://img.shields.io/docker/automated/nfcore/sarek.svg)](https://hub.docker.com/r/nfcore/sarek) -- `--tools snpEff` - - To annotate using `snpEff` -- `--tools VEP` - - To annotate using `VEP` -- `--tools snpEff,VEP` - - To annotate using `snpEff` and `VEP` -- `--tools merge` - - To annotate using `snpEff` followed by `VEP` +Based on [nfcore/base:1.12.1](https://hub.docker.com/r/nfcore/base/tags), it contains: + +* **[ASCAT](https://github.com/Crick-CancerGenomics/ascat)** 2.5.2 +* **[AlleleCount](https://github.com/cancerit/alleleCount)** 4.0.2 +* **[BCFTools](https://github.com/samtools/bcftools)** 1.9 +* **[bwa](https://github.com/lh3/bwa)** 0.7.17 +* **[bwa-mem2](https://github.com/bwa-mem2/bwa-mem2)** 2.0 +* **[CNVkit](https://github.com/etal/cnvkit)** 0.9.6 +* **[Control-FREEC](https://github.com/BoevaLab/FREEC)** 11.6 +* **[FastQC](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/)** 0.11.9 +* **[fgbio](https://github.com/fulcrumgenomics/fgbio)** 1.1.0 +* **[FreeBayes](https://github.com/ekg/freebayes)** 1.3.2 +* **[GATK4-spark](https://github.com/broadinstitute/gatk)** 4.1.7.0 +* **[GeneSplicer](https://ccb.jhu.edu/software/genesplicer/)** 1.0 +* **[ggplot2](https://github.com/tidyverse/ggplot2)** 3.3.0 +* **[HTSlib](https://github.com/samtools/htslib)** 1.9 +* **[Manta](https://github.com/Illumina/manta)** 1.6.0 +* **[msisensor](https://github.com/ding-lab/msisensor)** 0.5 +* **[MultiQC](https://github.com/ewels/MultiQC/)** 1.8 +* **[Qualimap](http://qualimap.bioinfo.cipf.es)** 2.2.2d +* **[SAMBLASTER](https://github.com/GregoryFaust/samblaster)** 0.1.24 +* **[samtools](https://github.com/samtools/samtools)** 1.9 +* **[snpEff](http://snpeff.sourceforge.net/)** 4.3.1t +* **[Strelka2](https://github.com/Illumina/strelka)** 2.9.10 +* **[TIDDIT](https://github.com/SciLifeLab/TIDDIT)** 2.7.1 +* **[pigz](https://zlib.net/pigz/)** 2.3.4 +* **[Trim Galore](https://github.com/FelixKrueger/TrimGalore)** 0.6.5 +* **[VCFanno](https://github.com/brentp/vcfanno)** 0.3.2 +* **[VCFtools](https://vcftools.github.io/index.html)** 0.1.16 +* **[VEP](https://github.com/Ensembl/ensembl-vep)** 99.2 + +For annotation, the main container can be used, but then cache has to be downloaded, or additional containers are available with cache. -`VCF` produced by `Sarek` will be annotated if `snpEff` or `VEP` are specified with the `--tools` command. -As Sarek will use `bgzip` and `tabix` to compress and index VCF files annotated, it expects VCF files to be sorted. +`sareksnpeff`, our `snpeff` container is designed using [Conda](https://conda.io/). -In these examples, all command lines will be launched starting with `--step annotate`. -It can of course be started directly from any other step instead. +[![sareksnpeff-docker status](https://img.shields.io/docker/automated/nfcore/sareksnpeff.svg)](https://hub.docker.com/r/nfcore/sareksnpeff) -#### Using genome specific containers +Based on [nfcore/base:1.12.1](https://hub.docker.com/r/nfcore/base/tags), it contains: -`Sarek` has already designed containers with `snpEff` and `VEP` files for Human (`GRCh37`, `GRCh38`), Mouse (`GRCm38`), Dog (`CanFam3.1`) and Roundworm (`WBcel235`). +* **[snpEff](http://snpeff.sourceforge.net/)** 4.3.1t +* Cache for `GRCh37`, `GRCh38`, `GRCm38`, `CanFam3.1` or `WBcel235` -Default settings will run using these containers. -The main `Sarek` container has also `snpEff` and `VEP` installed, but without the cache files that can be downloaded separately. -See [containers documentation](#containers) for more information. +`sarekvep`, our `vep` container is designed using [Conda](https://conda.io/). -#### Download cache +[![sarekvep-docker status](https://img.shields.io/docker/automated/nfcore/sarekvep.svg)](https://hub.docker.com/r/nfcore/sarekvep) -A `Nextflow` helper script has been designed to help downloading `snpEff` and `VEP` caches. -Such files are meant to be shared between multiple users, so this script is mainly meant for people administrating servers, clusters and advanced users. +Based on [nfcore/base:1.12.1](https://hub.docker.com/r/nfcore/base/tags), it contains: -```bash -nextflow run download_cache.nf --snpeff_cache --snpeff_db --genome -nextflow run download_cache.nf --vep_cache --species --vep_cache_version --genome -``` +* **[GeneSplicer](https://ccb.jhu.edu/software/genesplicer/)** 1.0 +* **[VEP](https://github.com/Ensembl/ensembl-vep)** 99.2 +* Cache for `GRCh37`, `GRCh38`, `GRCm38`, `CanFam3.1` or `WBcel235` -#### Using downloaded cache +### Using downloaded cache Both `snpEff` and `VEP` enable usage of cache. If cache is available on the machine where `Sarek` is run, it is possible to run annotation using cache. @@ -826,955 +552,75 @@ nextflow run nf-core/sarek --tools snpEff --step annotate --sample nextflow run nf-core/sarek --tools VEP --step annotate --sample --vep_cache --annotation_cache ``` -#### Using VEP CADD plugin - -To enable the use of the `VEP` `CADD` plugin: - -- Download the `CADD` files -- Specify them (either on the command line, like in the example or in a configuration file) -- use the `--cadd_cache` flag +### Spark related issues -Example: +If you have problems running processes that make use of Spark such as ```MarkDuplicates```. +You are probably experiencing issues with the limit of open files in your system. +You can check your current limit by typing the following: ```bash -nextflow run nf-core/sarek --step annotate --tools VEP --sample --cadd_cache \ - --cadd_indels \ - --cadd_indels_tbi \ - --cadd_wg_snvs \ - --cadd_wg_snvs_tbi +ulimit -n ``` -#### Downloading CADD files +The default limit size is usually 1024 which is quite low to run Spark jobs. +In order to increase the size limit permanently you can: -An helper script has been designed to help downloading `CADD` files. -Such files are meant to be share between multiple users, so this script is mainly meant for people administrating servers, clusters and advanced users. +Edit the file ```/etc/security/limits.conf``` and add the lines: ```bash -nextflow run download_cache.nf --cadd_cache --cadd_version --genome +* soft nofile 65535 +* hard nofile 65535 ``` -#### Using VEP GeneSplicer plugin - -To enable the use of the `VEP` `GeneSplicer` plugin: - -- use the `--genesplicer` flag - -Example: +Edit the file ```/etc/sysctl.conf``` and add the line: ```bash -nextflow run nf-core/sarek --step annotate --tools VEP --sample --genesplicer +fs.file-max = 65535 ``` -## Modify fastqs (trim/split) - -### --trim_fastq - -Use this to perform adapter trimming with [Trim Galore](https://github.com/FelixKrueger/TrimGalore/blob/master/Docs/Trim_Galore_User_Guide.md) - -### --clip_r1 - -Instructs `Trim Galore` to remove a number of bp from the 5' end of read 1 (or single-end reads). -This may be useful if the qualities were very poor, or if there is some sort of unwanted bias at the 5' end. - -### --clip_r2 - -Instructs `Trim Galore` to remove a number of bp from the 5' end of read 2 (paired-end reads only). -This may be useful if the qualities were very poor, or if there is some sort of unwanted bias at the 5' end. - -### --three_prime_clip_r1 - -Instructs `Trim Galore` to remove a number of bp from the 3' end of read 1 (or single-end reads) AFTER adapter/quality trimming has been performed. -This may remove some unwanted bias from the 3' end that is not directly related to adapter sequence or basecall quality. - -### --three_prime_clip_r2 - -Instructs `Trim Galore` to remove a number of bp from the 3' end of read 2 AFTER adapter/quality trimming has been performed. -This may remove some unwanted bias from the 3' end that is not directly related to adapter sequence or basecall quality. - -### --trim_nextseq - -This enables the option `--nextseq-trim=3'CUTOFF` within `Cutadapt`, which will set a quality cutoff (that is normally given with `-q` instead), but qualities of G bases are ignored. -This trimming is common for the `NextSeq` and `NovaSeq` platforms, where basecalls without any signal are called as high-quality G bases. - -### --save_trimmed - -Option to keep trimmed `FASTQs` - -### --split_fastq - -Use the `Nextflow` [`splitFastq`](https://www.nextflow.io/docs/latest/operator.html#splitfastq) operator to specify how many reads should be contained in the split fastq file. -For example: +Edit the file ```/etc/sysconfig/docker``` and add the new limits to OPTIONS like this: ```bash ---split_fastq 10000 +OPTIONS=”—default-ulimit nofile=65535:65535" ``` -## Preprocessing - -### --aligner - -To control which aligner is used for mapping the reads. +Re-start your session. -Available: `bwa-mem` and `bwa-mem2` +Note that the way to increase the open file limit in your system may be slightly different or require additional steps. -Default: `bwa-mem` +### Download cache -Example: +A `Nextflow` helper script has been designed to help downloading `snpEff` and `VEP` caches. +Such files are meant to be shared between multiple users, so this script is mainly meant for people administrating servers, clusters and advanced users. ```bash ---aligner "bwa-mem" +nextflow run download_cache.nf --snpeff_cache --snpeff_db --genome +nextflow run download_cache.nf --vep_cache --species --vep_cache_version --genome ``` -> **WARNING** Current indices for `bwa` in AWS iGenomes are not compatible with `bwa-mem2`. -> Use `--bwa=false` to have `Sarek` build them automatically. -> -> **WARNING** BWA-mem2 is in active development -> Sarek might not be able to require the right amount of resources for it at the moment -> We recommend to use pre-built indexes - -Example: - -```bash ---aligner "bwa-mem2" --bwa=false -``` +### Using VEP CADD plugin -### --markdup_java_options +To enable the use of the `VEP` `CADD` plugin: -To control the java options necessary for the `GATK MarkDuplicates` process, you can set this parameter. +* Download the `CADD` files +* Specify them (either on the command line, like in the example or in a configuration file) +* use the `--cadd_cache` flag -Default: "-Xms4000m -Xmx7g" -For example: +Example: ```bash ---markdup_java_options "-Xms4000m -Xmx7g" +nextflow run nf-core/sarek --step annotate --tools VEP --sample --cadd_cache \ + --cadd_indels \ + --cadd_indels_tbi \ + --cadd_wg_snvs \ + --cadd_wg_snvs_tbi ``` -### --no_gatk_spark - -Use this to disable usage of `Spark` implementation of the `GATK` tools in local mode. - -### --save_bam_mapped - -Will save `mapped BAMs`. - -### --skip_markduplicates - -Will skip `GATK MarkDuplicates`. -This params will also save the `mapped BAMS`, to enable restart from step `prepare_recalibration` - -## Variant Calling - -### --ascat_ploidy - -Use this parameter to overwrite default behavior from `ASCAT` regarding `ploidy`. -Requires that [`--ascat_purity`](#--ascat_purity) is set. - -### --ascat_purity - -Use this parameter to overwrite default behavior from `ASCAT` regarding `purity`. -Requires that [`--ascat_ploidy`](#--ascat_ploidy) is set. - -### --cf_coeff - -Use this parameter to overwrite default behavior from `Control-FREEC` regarding `coefficientOfVariation` +### Downloading CADD files -Default: `0.05` - -### --cf_ploidy - -Use this parameter to overwrite default behavior from `Control-FREEC` regarding `ploidy` - -Default: `2` - -### --cf_window - -Use this parameter to overwrite default behavior from `Control-FREEC` regarding `window size` -It is recommended to use a window size of 0 for exome data. - -Default: Disabled - -### --no_gvcf - -Use this to disable g.vcf output from `GATK HaplotypeCaller`. - -### --no_strelka_bp - -Use this not to use `Manta` `candidateSmallIndels` for `Strelka` (not recommended by Broad Institute's Best Practices). - -### --pon - -When a [panel of normals (PON)](https://gatk.broadinstitute.org/hc/en-us/articles/360035890631-Panel-of-Normals-PON) is defined, it will be use to filter somatic calls. -Without `PON`, there will be no calls with `PASS` in the `INFO` field, only an _unfiltered_ `VCF` is written. -It is recommended to make your own `PON`, as it depends on sequencer and library preparation. -For tests in `iGenomes` there is a dummy `PON` file in the Annotation/GermlineResource directory, but it _should not be used_ as a real panel-of-normals file. -Provide your `PON` with: +An helper script has been designed to help downloading `CADD` files. +Such files are meant to be share between multiple users, so this script is mainly meant for people administrating servers, clusters and advanced users. ```bash ---pon +nextflow run download_cache.nf --cadd_cache --cadd_version --genome ``` - -`PON` file should be bgzipped. - -### --pon_index - -Tabix index of the `PON` bgzipped VCF file. -If none provided, will be generated automatically from the `PON` bgzipped VCF file. - -### --ignore_soft_clipped_bases - -Do not analyze soft clipped bases in the reads for `GATK Mutect2` with the `--dont-use-soft-clipped-bases` params. - -### --umi - -If provided, `Unique Molecular Identifiers (UMIs)` steps will be run to extract and annotate the reads with `UMIs` and create consensus reads. -This part of the pipeline uses [fgbio](https://github.com/fulcrumgenomics/fgbio) to convert the `FASTQ` files into a `unmapped BAM`, where reads are tagged with the `UMIs` extracted from the `FASTQ` sequences. -In order to allow the correct tagging, the `UMI` sequence must be contained in the read sequence itself, and not in the `FASTQ` filename. -Following this step, the `unmapped BAM` is aligned and reads are then grouped based on mapping position and `UMI` tag. -Finally, reads in the same groups are collapsed to create a consensus read. -To create consensus, we have chosen to use the *adjacency method* [ref](https://cgatoxford.wordpress.com/2015/08/14/unique-molecular-identifiers-the-problem-the-solution-and-the-proof/). -In order for the correct tagging to be performed, a read structure needs to be specified as indicated below. - -### --read_structure1 - -When processing `UMIs`, a read structure should always be provided for each of the `FASTQ` files, to allow the correct annotation of the `BAM` file. -If the read does not contain any `UMI`, the structure will be +T (i.e. only template of any length). -The read structure follows a format adopted by different tools, and described [here](https://github.com/fulcrumgenomics/fgbio/wiki/Read-Structures) - -### --read_structure2 - -When processing `UMIs`, a read structure should always be provided for each of the `FASTQ` files, to allow the correct annotation of the `BAM` file. -If the read does not contain any UMI, the structure will be +T (i.e. only template of any length). -The read structure follows a format adopted by different tools, and described [here](https://github.com/fulcrumgenomics/fgbio/wiki/Read-Structures) - -## Annotation - -### --annotate_tools - -Specify from which tools `Sarek` should look for `VCF` files to annotate, only for step `Annotate`. - -Available: `HaplotypeCaller`, `Manta`, `Mutect2`, `Strelka`, `TIDDIT` - -Default: `None` - -### --annotation_cache - -Enable usage of annotation cache, and disable usage of already built containers within `Sarek`. -For more information, follow the [downloaded cache guidelines](#using-downloaded-cache). - -### --snpeff_cache - -To be used conjointly with [`--annotation_cache`](#--annotation_cache), specify the cache `snpEff` directory: - -```bash ---snpeff_cache -``` - -### --vep_cache - -To be used conjointly with [`--annotation_cache`](#--annotation_cache), specify the cache `VEP` directory: - -```bash ---vep_cache -``` - -### --cadd_cache - -Enable `CADD` cache. - -### --cadd_indels - -Path to `CADD InDels` file. - -### --cadd_indels_tbi - -Path to `CADD InDels` index. - -### --cadd_wg_snvs - -Path to `CADD SNVs` file. - -### --cadd_wg_snvs_tbi - -Path to `CADD SNVs` index. - -### --genesplicer - -Enable `genesplicer` within `VEP`. - -## Reference genomes - -The pipeline config files come bundled with paths to the `Illumina iGenomes` reference index files. -The configuration is set up to use the [AWS-iGenomes](https://ewels.github.io/AWS-iGenomes/) resource. - -### --genome (using iGenomes) - -Sarek is using [AWS iGenomes](https://ewels.github.io/AWS-iGenomes/), which facilitate storing and sharing references. -To run the pipeline, you must specify which to use with the `--genome` flag. - -You can find the keys to specify the genomes in the [iGenomes config file](../conf/igenomes.config)). -Genomes that are supported are: - -- Homo sapiens - - `--genome GRCh37` (GATK Bundle) - - `--genome GRCh38` (GATK Bundle) - -- Mus musculus - - `--genome GRCm38` (Ensembl) - -Limited support for: - -- Arabidopsis thaliana - - `--genome TAIR10` (Ensembl) - -- Bacillus subtilis 168 - - `--genome EB2` (Ensembl) - -- Bos taurus - - `--genome UMD3.1` (Ensembl) - - `--genome bosTau8` (UCSC) - -- Caenorhabditis elegans - - `--genome WBcel235` (Ensembl) - - `--genome ce10` (UCSC) - -- Canis familiaris - - `--genome CanFam3.1` (Ensembl) - - `--genome canFam3` (UCSC) - -- Danio rerio - - `--genome GRCz10` (Ensembl) - - `--genome danRer10` (UCSC) - -- Drosophila melanogaster - - `--genome BDGP6` (Ensembl) - - `--genome dm6` (UCSC) - -- Equus caballus - - `--genome EquCab2` (Ensembl) - - `--genome equCab2` (UCSC) - -- Escherichia coli K 12 DH10B - - `--genome EB1` (Ensembl) - -- Gallus gallus - - `--genome Galgal4` (Ensembl) - - `--genome galgal4` (UCSC) - -- Glycine max - - `--genome Gm01` (Ensembl) - -- Homo sapiens - - `--genome hg19` (UCSC) - - `--genome hg38` (UCSC) - -- Macaca mulatta - - `--genome Mmul_1` (Ensembl) - -- Mus musculus - - `--genome mm10` (Ensembl) - -- Oryza sativa japonica - - `--genome IRGSP-1.0` (Ensembl) - -- Pan troglodytes - - `--genome CHIMP2.1.4` (Ensembl) - - `--genome panTro4` (UCSC) - -- Rattus norvegicus - - `--genome Rnor_6.0` (Ensembl) - - `--genome rn6` (UCSC) - -- Saccharomyces cerevisiae - - `--genome R64-1-1` (Ensembl) - - `--genome sacCer3` (UCSC) - -- Schizosaccharomyces pombe - - `--genome EF2` (Ensembl) - -- Sorghum bicolor - - `--genome Sbi1` (Ensembl) - -- Sus scrofa - - `--genome Sscrofa10.2` (Ensembl) - - `--genome susScr3` (UCSC) - -- Zea mays - - `--genome AGPv3` (Ensembl) - -Note that you can use the same configuration setup to save sets of reference files for your own use, even if they are not part of the `AWS iGenomes` resource. -See the [Nextflow documentation](https://www.nextflow.io/docs/latest/config.html) for instructions on where to save such a file. - -The syntax for this reference configuration is as follows: - -```nextflow -params { - genomes { - '' { - ac_loci = '' - ac_loci_gc = '' - bwa = '' - chr_dir = '' - chr_length = '' - dbsnp = '' - dbsnp_index = '' - dict = '' - fasta = '' - fasta_fai = '' - germline_resource = '' - germline_resource_index = '' - intervals = '' - known_indels = '' - known_indels_index = '' - mappability = '' - snpeff_db = '' - species = '' - vep_cache_version = ' -``` - -### --ac_loci_gc - -If you prefer, you can specify the full path to your reference genome when you run the pipeline: - -```bash ---ac_loci_gc -``` - -### --bwa - -If you prefer, you can specify the full path to your reference genome when you run the pipeline: - -> If none provided, will be generated automatically from the fasta reference. - -```bash ---bwa -``` - -### --chr_dir - -If you prefer, you can specify the full path to your reference genome when you run the pipeline: - -```bash ---chr_dir -``` - -### --chr_length - -If you prefer, you can specify the full path to your reference genome when you run the pipeline: - -```bash ---chr_length -``` - -### --dbsnp - -If you prefer, you can specify the full path to your reference genome when you run the pipeline: - -```bash ---dbsnp -``` - -### --dbsnp_index - -If you prefer, you can specify the full path to your reference genome when you run the pipeline: - -> If none provided, will be generated automatically from dbsnp `VCF` file. - -```bash ---dbsnp_index -``` - -### --dict - -If you prefer, you can specify the full path to your reference genome when you run the pipeline: - -> If none provided, will be generated automatically from the fasta reference. - -```bash ---dict -``` - -### --fasta - -If you prefer, you can specify the full path to your reference genome when you run the pipeline: - -```bash ---fasta -``` - -### --fasta_fai - -> If none provided, will be generated automatically from the fasta reference. - -If you prefer, you can specify the full path to your reference genome when you run the pipeline: - -```bash ---fasta_fai -``` - -### --germline_resource - -The [germline resource VCF file](https://software.broadinstitute.org/gatk/documentation/tooldocs/current/org_broadinstitute_hellbender_tools_walkers_mutect_Mutect2.php#--germline-resource) (bgzipped and tabixed) needed by GATK4 Mutect2 is a collection of calls that are likely present in the sample, with allele frequencies. -The `AF` info field must be present. -You can find a smaller, stripped gnomAD `VCF` file (most of the annotation is removed and only calls signed by PASS are stored) in the `AWS iGenomes` `Annotation/GermlineResource` folder. -If you prefer, you can specify the full path to your reference genome when you run the pipeline: - -```bash ---germline_resource -``` - -### --germline_resource_index - -Tabix index of the germline resource specified at [`--germline_resource`](#--germline_resource). -If you prefer, you can specify the full path to your reference genome when you run the pipeline: - -> If none provided, will be generated automatically from the germline resource `VCF` file. - -```bash ---germline_resource_index -``` - -### --intervals - -To speed up some preprocessing and variant calling processes, the reference is chopped into smaller pieces. -The intervals are chromosomes cut at their centromeres (so each chromosome arm processed separately) also additional unassigned contigs. -We are ignoring the `hs37d5` contig that contains concatenated decoy sequences. -Parts of preprocessing and variant calling are done by these intervals, and the different resulting files are then merged. -This can parallelize processes, and push down wall clock time significantly. - -The calling intervals can be defined using a `.list` or a `.bed` file. -A `.list` file contains one interval per line in the format `chromosome:start-end` (1-based coordinates). - -When the intervals file is in `BED` format, the file must be a tab-separated text file with one interval per line. -There must be at least three columns: chromosome, start, and end. -In `BED` format, the coordinates are 0-based, so the interval `chrom:1-10` becomes `chrom010`. - -Additionally, the `score` column of the `BED` file can be used to provide an estimate of how many seconds it will take to call variants on that interval. -The fourth column remains unused. - -|||||| -|-|-|-|-|-| -|chr1|10000|207666|NA|47.3| - -This indicates that variant calling on the interval chr1:10001-207666 takes approximately 47.3 seconds. - -The runtime estimate is used in two different ways. -First, when there are multiple consecutive intervals in the file that take little time to compute, they are processed as a single job, thus reducing the number of processes that needs to be spawned. -Second, the jobs with largest processing time are started first, which reduces wall-clock time. -If no runtime is given, a time of 1000 nucleotides per second is assumed. -Actual figures vary from 2 nucleotides/second to 30000 nucleotides/second. -If you prefer, you can specify the full path to your reference genome when you run the pipeline: - -> If none provided, will be generated automatically from the fasta reference. -> Use [--no_intervals](#--no_intervals) to disable automatic generation - -```bash ---intervals -``` - -### --known_indels - -If you prefer, you can specify the full path to your reference genome when you run the pipeline: - -```bash ---known_indels -``` - -### --known_indels_index - -If you prefer, you can specify the full path to your reference genome when you run the pipeline: - -> If none provided, will be generated automatically from the known indels `VCF` file. - -```bash ---known_indels_index -``` - -### --mappability - -If you prefer, you can specify the full path to your Control-FREEC mappability when you run the pipeline: - -```bash ---mappability -``` - -### --snpeff_db - -If you prefer, you can specify the DB version when you run the pipeline: - -```bash ---snpeff_db -``` - -### --species - -This specifies the species used for running `VEP` annotation. -If you use iGenomes or a local resource with `genomes.conf`, this has already been set for you appropriately. - -```bash ---species -``` - -### --vep_cache_version - -If you prefer, you can specify the cache version when you run the pipeline: - -```bash ---vep_cache_version -``` - -## Other command line parameters - -### --outdir - -The output directory where the results will be saved. - -Default: `results/` - -### --publish_dir_mode - -The file publishing method. - -Available: `symlink`, `rellink`, `link`, `copy`, `copyNoFollow`, `move` - -Default: `copy` - -### --sequencing_center - -The sequencing center that will be used in the `BAM` `CN` field - -### --multiqc_config - -Specify a path to a custom `MultiQC` configuration file. - -### --monochrome_logs - -Set to disable colourful command line output and live life in monochrome. - -### --email - -Set this parameter to your e-mail address to get a summary e-mail with details of the run sent to you when the workflow exits. -If set in your user config file (`~/.nextflow/config`) then you don't need to specify this on the command line for every run. - -### --email_on_fail - -This works exactly as with `--email`, except emails are only sent if the workflow is not successful. - -### --plaintext_email - -Set to receive plain-text e-mails instead of HTML formatted. - -### --max_multiqc_email_size - -Threshold size for `MultiQC` report to be attached in notification email. -If file generated by pipeline exceeds the threshold, it will not be attached. - -Default: `25MB`. - -### -name - -Name for the pipeline run. -If not specified, `Nextflow` will automatically generate a random mnemonic. - -This is used in the `MultiQC` report (if not default) and in the summary HTML / e-mail (always). - -**NB:** Single hyphen (core `Nextflow` option) - -### --custom_config_version - -Provide git commit id for custom Institutional configs hosted at `nf-core/configs`. -This was implemented for reproducibility purposes. - -Default is set to `master`. - -```bash -## Download and use config file with following git commid id ---custom_config_version d52db660777c4bf36546ddb188ec530c3ada1b96 -``` - -### --custom_config_base - -If you're running offline, `Nextflow` will not be able to fetch the institutional config files -from the internet. -If you don't need them, then this is not a problem. -If you do need them, you should download the files from the repository and tell `Nextflow` where to find them with the `custom_config_base` option. -For example: - -```bash -NXF_OPTS='-Xms1g -Xmx4g' -``` - -> Note that the nf-core/tools helper package has a `download` command to download all required pipeline files + singularity containers + institutional configs in one go for you, to make this process easier. - -## Job resources - -### Automatic resubmission - -Each step in the pipeline has a default set of requirements for number of CPUs, memory and time. -For most of the steps in the pipeline, if the job exits with an error code of `143` (exceeded requested resources) it will automatically resubmit with higher requests (2 x original, then 3 x original). -If it still fails after three times then the pipeline is stopped. - -### --max_memory - -Use to set a top-limit for the default memory requirement for each process. -Should be a string in the format integer-unit eg. `--max_memory '8.GB'` - -### --max_time - -Use to set a top-limit for the default time requirement for each process. -Should be a string in the format integer-unit eg. `--max_time '2.h'` - -### --max_cpus - -Use to set a top-limit for the default CPU requirement for each process. -Should be a string in the format integer-unit eg. `--max_cpus 1` - -### --single_cpu_mem - -Use to set memory for a single CPU. -Should be a string in the format integer-unit eg. `--single_cpu_mem '8.GB'` - -## Containers - -`sarek`, our main container is designed using [Conda](https://conda.io/). - -[![sarek-docker status](https://img.shields.io/docker/automated/nfcore/sarek.svg)](https://hub.docker.com/r/nfcore/sarek) - -Based on [nfcore/base:1.10.2](https://hub.docker.com/r/nfcore/base/tags), it contains: - -- **[ASCAT](https://github.com/Crick-CancerGenomics/ascat)** 2.5.2 -- **[AlleleCount](https://github.com/cancerit/alleleCount)** 4.0.2 -- **[BCFTools](https://github.com/samtools/bcftools)** 1.9 -- **[bwa](https://github.com/lh3/bwa)** 0.7.17 -- **[bwa-mem2](https://github.com/bwa-mem2/bwa-mem2)** 2.0 -- **[CNVkit](https://github.com/etal/cnvkit)** 0.9.6 -- **[Control-FREEC](https://github.com/BoevaLab/FREEC)** 11.5 -- **[FastQC](http://www.bioinformatics.babraham.ac.uk/projects/fastqc/)** 0.11.9 -- **[fgbio](https://github.com/fulcrumgenomics/fgbio)** 1.1.0 -- **[FreeBayes](https://github.com/ekg/freebayes)** 1.3.2 -- **[GATK4-spark](https://github.com/broadinstitute/gatk)** 4.1.6.0 -- **[GeneSplicer](https://ccb.jhu.edu/software/genesplicer/)** 1.0 -- **[ggplot2](https://github.com/tidyverse/ggplot2)** 3.3.0 -- **[HTSlib](https://github.com/samtools/htslib)** 1.9 -- **[Manta](https://github.com/Illumina/manta)** 1.6.0 -- **[msisensor](https://github.com/ding-lab/msisensor)** 0.5 -- **[MultiQC](https://github.com/ewels/MultiQC/)** 1.8 -- **[Qualimap](http://qualimap.bioinfo.cipf.es)** 2.2.2d -- **[SAMBLASTER](https://github.com/GregoryFaust/samblaster)** 0.1.24 -- **[samtools](https://github.com/samtools/samtools)** 1.9 -- **[snpEff](http://snpeff.sourceforge.net/)** 4.3.1t -- **[Strelka2](https://github.com/Illumina/strelka)** 2.9.10 -- **[TIDDIT](https://github.com/SciLifeLab/TIDDIT)** 2.7.1 -- **[pigz](https://zlib.net/pigz/)** 2.3.4 -- **[Trim Galore](https://github.com/FelixKrueger/TrimGalore)** 0.6.5 -- **[VCFanno](https://github.com/brentp/vcfanno)** 0.3.2 -- **[VCFtools](https://vcftools.github.io/index.html)** 0.1.16 -- **[VEP](https://github.com/Ensembl/ensembl-vep)** 99.2 - -For annotation, the main container can be used, but the cache has to be downloaded, or additional containers are available with cache (see [annotation documentation](#using-downloaded-cache)): - -`sareksnpeff`, our `snpeff` container is designed using [Conda](https://conda.io/). - -[![sareksnpeff-docker status](https://img.shields.io/docker/automated/nfcore/sareksnpeff.svg)](https://hub.docker.com/r/nfcore/sareksnpeff) - -Based on [nfcore/base:1.10.2](https://hub.docker.com/r/nfcore/base/tags), it contains: - -- **[snpEff](http://snpeff.sourceforge.net/)** 4.3.1t -- Cache for `GRCh37`, `GRCh38`, `GRCm38`, `CanFam3.1` or `WBcel235` - -`sarekvep`, our `vep` container is designed using [Conda](https://conda.io/). - -[![sarekvep-docker status](https://img.shields.io/docker/automated/nfcore/sarekvep.svg)](https://hub.docker.com/r/nfcore/sarekvep) - -Based on [nfcore/base:1.10.2](https://hub.docker.com/r/nfcore/base/tags), it contains: - -- **[GeneSplicer](https://ccb.jhu.edu/software/genesplicer/)** 1.0 -- **[VEP](https://github.com/Ensembl/ensembl-vep)** 99.2 -- Cache for `GRCh37`, `GRCh38`, `GRCm38`, `CanFam3.1` or `WBcel235` - -### Building your owns - -Our containers are designed using [Conda](https://conda.io/). -The [`environment.yml`](../environment.yml) file can be modified if particular versions of tools are more suited to your needs. - -The following commands can be used to build/download containers on your own system: - -- Adjust `VERSION` for sarek version (typically a release or `dev`). - -#### Build with Conda - -```Bash -conda env create -f environment.yml -``` - -#### Build with Docker - -- `sarek` - -```Bash -docker build -t nfcore/sarek: . -``` - -- `sareksnpeff` - -Adjust arguments for `GENOME` version and snpEff `CACHE_VERSION` - -```Bash -docker build -t nfcore/sareksnpeff:. containers/snpeff/. --build-arg GENOME= --build-arg CACHE_VERSION= -``` - -- `sarekvep` - -Adjust arguments for `GENOME` version, `SPECIES` name and VEP `VEP_VERSION` - -```Bash -docker build -t nfcore/sarekvep:. containers/vep/. --build-arg GENOME= --build-arg SPECIES= --build-arg VEP_VERSION= -``` - -#### Pull with Docker - -- `sarek` - -```Bash -docker pull nfcore/sarek: -``` - -- `sareksnpeff` - -Adjust arguments for `GENOME` version - -```Bash -docker pull nfcore/sareksnpeff:. -``` - -- `sarekvep` - -Adjust arguments for `GENOME` version - -```Bash -docker pull nfcore/sarekvep:. -``` - -#### Pull with Singularity - -You can directly pull singularity image, in the path used by the Nextflow ENV variable `NXF_SINGULARITY_CACHEDIR`, ie: - -```Bash -cd $NXF_SINGULARITY_CACHEDIR -singularity build ... -``` - -- `sarek` - -```Bash -singularity build nfcore-sarek-.img docker://nfcore/sarek: -``` - -- `sareksnpeff` - -Adjust arguments for `GENOME` version - -```Bash -singularity build nfcore-sareksnpeff-..img docker://nfcore/sareksnpeff:. -``` - -- `sarekvep` - -Adjust arguments for `GENOME` version - -```Bash -singularity build nfcore-sarekvep-..img docker://nfcore/sarekvep:. -``` - -## AWSBatch specific parameters - -Running the pipeline on AWSBatch requires a couple of specific parameters to be set according to your AWSBatch configuration. -Please use [`-profile awsbatch`](https://github.com/nf-core/configs/blob/master/conf/awsbatch.config) and then specify all of the following parameters. - -### --awsqueue - -The JobQueue that you intend to use on AWSBatch. - -### --awsregion - -The AWS region to run your job in. - -Default is set to `eu-west-1` but can be adjusted to your needs. - -### --awscli - -The [AWS CLI](https://www.nextflow.io/docs/latest/awscloud.html#aws-cli-installation) path in your custom AMI. - -Default: `/home/ec2-user/miniconda/bin/aws`. - -Please make sure to also set the `-w/--work-dir` and `--outdir` parameters to a S3 storage bucket of your choice - you'll get an error message notifying you if you didn't. - -## Troubleshooting - -### Spark related issues - -If you have problems running processes that make use of Spark such as ```MarkDuplicates```. -You are probably experiencing issues with the limit of open files in your system. -You can check your current limit by typing the following: - -```bash -ulimit -n -``` - -The default limit size is usually 1024 which is quite low to run Spark jobs. -In order to increase the size limit permanently you can: - -Edit the file ```/etc/security/limits.conf``` and add the lines: - -```bash -* soft nofile 65535 -* hard nofile 65535 -``` - -Edit the file ```/etc/sysctl.conf``` and add the line: - -```bash -fs.file-max = 65535 -``` - -Edit the file ```/etc/sysconfig/docker``` and add the new limits to OPTIONS like this: - -```bash -OPTIONS=”—default-ulimit nofile=65535:65535" -``` - -Re-start your session. - -Note that the way to increase the open file limit in your system may be slightly different or require additional steps. diff --git a/environment.yml b/environment.yml index 7b7e46560c..f16b2d11e9 100644 --- a/environment.yml +++ b/environment.yml @@ -1,6 +1,6 @@ # You can use this file to create a conda environment for this pipeline: # conda env create -f environment.yml -name: nf-core-sarek-3.0dev +name: nf-core-sarek-2.7dev channels: - conda-forge - bioconda diff --git a/main.nf b/main.nf index 3721d2a541..4ffb187953 100644 --- a/main.nf +++ b/main.nf @@ -174,7 +174,7 @@ def helpMessage() { AWSBatch options: --awsqueue [str] The AWSBatch JobQueue that needs to be set when running on AWSBatch - --awsregion [str] The AWS Region for your AWSBatch job to run on + --awsregion [str] The AWS Region for your AWS Batch job to run on --awscli [str] Path to the AWS CLI tool """.stripIndent() } @@ -226,6 +226,7 @@ if (params.umi && !(params.read_structure1 && params.read_structure2)) exit 1, ' custom_runName = params.name if (!(workflow.runName ==~ /[a-z]+_[a-z]+/)) custom_runName = workflow.runName +// Check AWS batch settings if (workflow.profile.contains('awsbatch')) { // AWSBatch sanity checking if (!params.awsqueue || !params.awsregion) exit 1, "Specify correct --awsqueue and --awsregion parameters on AWSBatch!" @@ -238,10 +239,10 @@ if (workflow.profile.contains('awsbatch')) { // MultiQC // Stage config files -ch_multiqc_config = file("$baseDir/assets/multiqc_config.yaml", checkIfExists: true) +ch_multiqc_config = file("$projectDir/assets/multiqc_config.yaml", checkIfExists: true) ch_multiqc_custom_config = params.multiqc_config ? Channel.fromPath(params.multiqc_config, checkIfExists: true) : Channel.empty() -ch_output_docs = file("$baseDir/docs/output.md", checkIfExists: true) -ch_output_docs_images = file("$baseDir/docs/images/", checkIfExists: true) +ch_output_docs = file("$projectDir/docs/output.md", checkIfExists: true) +ch_output_docs_images = file("$projectDir/docs/images/", checkIfExists: true) // Handle input tsvPath = null @@ -496,11 +497,8 @@ if (params.config_profile_description) summary['Config Description'] = params.co if (params.config_profile_contact) summary['Config Contact'] = params.config_profile_contact if (params.config_profile_url) summary['Config URL'] = params.config_profile_url -if (params.email || params.email_on_fail) { - summary['E-mail Address'] = params.email - summary['E-mail on failure'] = params.email_on_fail - summary['MultiQC maxsize'] = params.max_multiqc_email_size -} +summary['Config Files'] = workflow.configFiles.join(', ') + if (workflow.profile.contains('awsbatch')) { summary['AWS Region'] = params.awsregion @@ -508,9 +506,13 @@ if (workflow.profile.contains('awsbatch')) { summary['AWS CLI'] = params.awscli } -log.info summary.collect { k, v -> "${k.padRight(18)}: $v" }.join("\n") -if (params.monochrome_logs) log.info "----------------------------------------------------" -else log.info "-\033[2m--------------------------------------------------\033[0m-" +if (params.email || params.email_on_fail) { + summary['E-mail Address'] = params.email + summary['E-mail on failure'] = params.email_on_fail + summary['MultiQC maxsize'] = params.max_multiqc_email_size +} +log.info summary.collect { k,v -> "${k.padRight(18)}: $v" }.join("\n") +log.info "-\033[2m--------------------------------------------------\033[0m-" if ('mutect2' in tools && !(params.pon)) log.warn "[nf-core/sarek] Mutect2 was requested, but as no panel of normals were given, results will not be optimal" if (params.sentieon) log.warn "[nf-core/sarek] Sentieon will be used, only works if Sentieon is available where nf-core/sarek is run" @@ -3115,7 +3117,7 @@ process Ascat { --normalbaf ${bafNormal} \ --normallogr ${logrNormal} \ --tumorname ${idSampleTumor} \ - --basedir ${baseDir} \ + --basedir ${projectDir} \ --gcfile ${acLociGC} \ --gender ${gender} \ ${purity_ploidy} @@ -3967,18 +3969,18 @@ workflow.onComplete { // Render the TXT template def engine = new groovy.text.GStringTemplateEngine() - def tf = new File("$baseDir/assets/email_template.txt") + def tf = new File("$projectDir/assets/email_template.txt") def txt_template = engine.createTemplate(tf).make(email_fields) def email_txt = txt_template.toString() // Render the HTML template - def hf = new File("$baseDir/assets/email_template.html") + def hf = new File("$projectDir/assets/email_template.html") def html_template = engine.createTemplate(hf).make(email_fields) def email_html = html_template.toString() // Render the sendmail template - def smail_fields = [ email: email_address, subject: subject, email_txt: email_txt, email_html: email_html, baseDir: "$baseDir", mqcFile: mqc_report, mqcMaxSize: params.max_multiqc_email_size.toBytes() ] - def sf = new File("$baseDir/assets/sendmail_template.txt") + def smail_fields = [ email: email_address, subject: subject, email_txt: email_txt, email_html: email_html, projectDir: "$projectDir", mqcFile: mqc_report, mqcMaxSize: params.max_multiqc_email_size.toBytes() ] + def sf = new File("$projectDir/assets/sendmail_template.txt") def sendmail_template = engine.createTemplate(sf).make(smail_fields) def sendmail_html = sendmail_template.toString() diff --git a/nextflow.config b/nextflow.config index 20e0c51558..b1c7019d03 100644 --- a/nextflow.config +++ b/nextflow.config @@ -127,24 +127,19 @@ try { } profiles { - conda { - docker.enabled = false - process.conda = "$baseDir/environment.yml" - singularity.enabled = false - } + conda { process.conda = "$projectDir/environment.yml" } debug { process.beforeScript = 'echo $HOSTNAME' } docker { - docker { - enabled = true - fixOwnership = true - } - singularity.enabled = false + docker.enabled = true + docker.fixOwnership = true } singularity { - docker.enabled = false singularity.autoMounts = true singularity.enabled = true } + podman { + podman.enabled = true + } test { includeConfig 'conf/test.config' } test_annotation { includeConfig 'conf/test_annotation.config' } test_no_gatk_spark { includeConfig 'conf/test_no_gatk_spark.config' } @@ -170,6 +165,13 @@ env { R_ENVIRON_USER = "/.Renviron" } +// Export these variables to prevent local Python/R libraries from conflicting with those in the container +env { + PYTHONNOUSERSITE = 1 + R_PROFILE_USER = "/.Rprofile" + R_ENVIRON_USER = "/.Renviron" +} + // Capture exit codes from upstream processes when piping process.shell = ['/bin/bash', '-euo', 'pipefail'] @@ -196,8 +198,8 @@ manifest { homePage = 'https://github.com/nf-core/sarek' description = 'An open-source analysis pipeline to detect germline or somatic variants from whole genome or targeted sequencing' mainScript = 'main.nf' - nextflowVersion = '>=19.10.0' - version = '3.0dev' + nextflowVersion = '>=20.04.0' + version = '2.7dev' } // Return the minimum between requirements and a maximum limit to ensure that resource requirements don't go over diff --git a/nextflow_schema.json b/nextflow_schema.json index 97801d75b3..7f278d29a9 100644 --- a/nextflow_schema.json +++ b/nextflow_schema.json @@ -18,15 +18,15 @@ "input": { "type": "string", "fa_icon": "fas fa-dna", - "description": "Path to input file.", - "help_text": "Use this to specify the location of your input TSV file. Input TSV file on `mapping`, `prepare_recalibration`, `recalibrate`, `variant_calling` and `Control-FREEC` steps\nMultiple TSV files can be specified with quotes\nWorks also with the path to a directory on `mapping` step with a single germline sample only\nAlternatively, path to VCF input file on `annotate` step\nMultiple VCF files can be specified with quotes." + "description": "Path to input file(s).", + "help_text": "Use this to specify the location of your input TSV file on `mapping`, `prepare_recalibration`, `recalibrate`, `variant_calling` and `Control-FREEC` steps (multiple files can be specified with quotes).\nIt can also be used to specify the path to a directory on `mapping` step with a single germline sample only.\nAlternatively, it can be used to specify the path to VCF input file on `annotate` step (multiple files can be specified with quotes).\n\n> **NB** " }, "step": { "type": "string", "default": "mapping", "fa_icon": "fas fa-play", - "description": "Starting step.", - "help_text": "(only one)", + "description": "The starting step.", + "help_text": "Only one step must be specified.\n> **NB** step can be specified with no concern for case, or the presence of `-` or `_`\n", "enum": [ "mapping", "prepare_recalibration", @@ -39,15 +39,8 @@ "outdir": { "type": "string", "description": "The output directory where the results will be saved.", - "default": "./results", + "default": "./results", "fa_icon": "fas fa-folder-open" - }, - "email": { - "type": "string", - "description": "Email address for completion summary.", - "fa_icon": "fas fa-envelope", - "help_text": "Set this parameter to your e-mail address to get a summary e-mail with details of the run sent to you when the workflow exits. If set in your user config file (`~/.nextflow/config`) then you don't need to specify this on the command line for every run.", - "pattern": "^([a-zA-Z0-9_\\-\\.]+)@([a-zA-Z0-9_\\-\\.]+)\\.([a-zA-Z]{2,5})$" } } }, @@ -61,8 +54,8 @@ "type": "string", "default": "null", "fa_icon": "fas fa-toolbox", - "description": "Specify tools to use for variant calling and/or for annotation.", - "help_text": "Multiple separated with commas\n\n`DNAseq`, `DNAscope` and `TNscope` are only available with `--sentieon`", + "description": "Tools to use for variant calling and/or for annotation.", + "help_text": "Multiple separated with commas.\n\nGermline variant calling can currently only be performed with the following variant callers:\n- FreeBayes, HaplotypeCaller, Manta, mpileup, Strelka, TIDDIT\n\nSomatic variant calling can currently only be performed with the following variant callers:\n- ASCAT, Control-FREEC, FreeBayes, Manta, MSIsensor, Mutect2, Strelka\n\nTumor-only somatic variant calling can currently only be performed with the following variant callers:\n- Control-FREEC, Manta, mpileup, Mutect2, TIDDIT\n\nAnnotation is done using snpEff, VEP, or even both consecutively.\n\n> **NB** As Sarek will use bgzip and tabix to compress and index VCF files annotated, it expects VCF files to be sorted.\n\n\n\n`DNAseq`, `DNAscope` and `TNscope` are only available with `--sentieon`\n\n> **NB** tools can be specified with no concern for case, or the presence of `-` or `_`\n", "enum": [ "null", "ASCAT", @@ -101,13 +94,13 @@ "type": "boolean", "fa_icon": "fas fa-tools", "description": "Enable Sentieon if available.", - "help_text": "Adds the following options for --tools: DNAseq, DNAscope and TNscope" + "help_text": "Sentieon is a commercial solution to process genomics data with high computing efficiency, fast turnaround time, exceptional accuracy, and 100% consistency.\n\n> **NB** Adds the following tools for the `--tools` options: `DNAseq`, `DNAscope` and `TNscope`." }, "skip_qc": { "type": "string", "fa_icon": "fas fa-forward", - "description": "Specify which QC tools to skip.", - "help_text": "Multiple separated with commas\n\n`--skip_qc BaseRecalibrator` does not skip the process, but is actually just not saving the reports", + "description": "Disable specified QC and Reporting tools.", + "help_text": "Multiple tools can be specified, separated by commas.\n\n> **NB** `--skip_qc BaseRecalibrator` is actually just not saving the reports.\n> **NB** `--skip_qc MarkDuplicates` does not skip `MarkDuplicates` but prevent the collection of duplicate metrics that slows down performance.", "enum": [ "null", "all", @@ -116,6 +109,7 @@ "BCFtools", "Documentation", "FastQC", + "MarkDuplicates", "MultiQC", "samtools", "vcftools", @@ -126,7 +120,8 @@ "target_bed": { "type": "string", "fa_icon": "fas fa-crosshairs", - "description": "Target BED file for whole exome or targeted sequencing." + "description": "Target BED file for whole exome or targeted sequencing.", + "help_text": "This parameter does _not_ imply that the workflow is running alignment or variant calling only for the supplied targets.\nInstead, we are aligning for the whole genome, and selecting variants only at the very end by intersecting with the provided target file.\nAdding every exon as an interval in case of `WES` can generate >200K processes or jobs, much more forks, and similar number of directories in the Nextflow work directory.\nFurthermore, primers and/or baits are not 100% specific, (certainly not for MHC and KIR, etc.), quite likely there going to be reads mapping to multiple locations.\nIf you are certain that the target is unique for your genome (all the reads will certainly map to only one location), and aligning to the whole genome is an overkill, it is actually better to change the reference itself.\n\nThe recommended flow for targeted sequencing data is to use the workflow as it is, but also provide a `BED` file containing targets for all steps using the `--target_bed` option.\nThe workflow will pick up these intervals, and activate any `--exome` flag in any tools that allow it to process deeper coverage.\nIt is advised to pad the variant calling regions (exons or target) to some extent before submitting to the workflow." } }, "fa_icon": "fas fa-user-cog" @@ -141,56 +136,56 @@ "trim_fastq": { "type": "boolean", "fa_icon": "fas fa-cut", - "description": "Run Trim Galore", - "hidden": true + "description": "Run Trim Galore.", + "hidden": true, + "help_text": "Use this to perform adapter trimming with Trim Galore.\ncf https://github.com/FelixKrueger/TrimGalore/blob/master/Docs/Trim_Galore_User_Guide.md" }, "clip_r1": { "type": "integer", "fa_icon": "fas fa-cut", - "description": "Remove bp from the 5' end of read 1", - "help_text": "With Trim Galore", + "description": "Remove bp from the 5' end of read 1.", + "help_text": "This may be useful if the qualities were very poor, or if there is some sort of unwanted bias at the 5' end.\n", "hidden": true }, "clip_r2": { "type": "integer", - "description": "Remove bp from the 5' end of read 5", - "help_text": "With Trim Galore", + "description": "Remove bp from the 5' end of read 2.", + "help_text": "This may be useful if the qualities were very poor, or if there is some sort of unwanted bias at the 5' end.\n", "fa_icon": "fas fa-cut", "hidden": true }, "three_prime_clip_r1": { "type": "integer", "fa_icon": "fas fa-cut", - "description": "Remove bp from the 3' end of read 1 AFTER adapter/quality trimming has been performed", - "help_text": "With Trim Galore", + "description": "Remove bp from the 3' end of read 1 AFTER adapter/quality trimming has been performed.", + "help_text": "This may remove some unwanted bias from the 3' end that is not directly related to adapter sequence or basecall quality.\n", "hidden": true }, "three_prime_clip_r2": { "type": "integer", "fa_icon": "fas fa-cut", - "description": "Remove bp from the 3' end of read 2 AFTER adapter/quality trimming has been performed", - "help_text": "With Trim Galore", + "description": "Remove bp from the 3' end of read 2 AFTER adapter/quality trimming has been performed.", + "help_text": "This may remove some unwanted bias from the 3' end that is not directly related to adapter sequence or basecall quality.\n", "hidden": true }, "trim_nextseq": { "type": "integer", "fa_icon": "fas fa-cut", - "description": "Apply the --nextseq=X option, to trim based on quality after removing poly-G tails", - "help_text": "With Trim Galore", + "description": "Apply the --nextseq=X option, to trim based on quality after removing poly-G tails.", + "help_text": "This may remove some unwanted bias from the 3' end that is not directly related to adapter sequence or basecall quality.", "hidden": true }, "save_trimmed": { "type": "boolean", "fa_icon": "fas fa-save", "description": "Save trimmed FastQ file intermediates", - "help_text": "If none specified, FastQs won't be split", "hidden": true }, "split_fastq": { "type": "number", "fa_icon": "fas fa-cut", "description": "Specify how many reads should be contained in the split FastQ file", - "help_text": "If not used, FastQs won't be split", + "help_text": "Use the Nextflow splitFastq operator to specify how many reads should be contained in the split FASTQ file.\ncf https://www.nextflow.io/docs/latest/operator.html#splitfastq", "hidden": true } } @@ -210,8 +205,8 @@ "bwa-mem", "bwa-mem2" ], - "description": "Specify which aligner to be used to map reads to reference genome", - "help_text": "> **WARNING** Current indices for `bwa` in AWS iGenomes are not compatible with `bwa-mem2`.\n> Use `--bwa=false` to have `Sarek` build them automatically.", + "description": "Specify aligner to be used to map reads to reference genome.", + "help_text": "> **WARNING** Current indices for `bwa` in AWS iGenomes are not compatible with `bwa-mem2`.\n> Use `--bwa=false` to have `Sarek` build them automatically.\n\n> **WARNING** BWA-mem2 is in active development\n> Sarek might not be able to require the right amount of resources for it at the moment\n> We recommend to use pre-built indexes", "hidden": true }, "markdup_java_options": { @@ -236,7 +231,8 @@ "skip_markduplicates": { "type": "boolean", "fa_icon": "fas fa-forward", - "description": "Skip GATK MarkDuplicates" + "description": "Skip GATK MarkDuplicates", + "help_text": "This params will also save the mapped BAMS, to enable restart from step prepare_recalibration" } } }, @@ -276,7 +272,8 @@ "cf_window": { "type": "number", "fa_icon": "fas fa-wrench", - "description": "Overwrite Control-FREEC window size" + "description": "Overwrite Control-FREEC window size", + "help_text": "It is recommended to use a window size of 0 for exome data" }, "no_gvcf": { "type": "boolean", @@ -293,37 +290,39 @@ "type": "string", "fa_icon": "fas fa-file", "description": "Panel-of-normals VCF (bgzipped) for GATK Mutect2 / Sentieon TNscope", - "help_text": "See https://gatk.broadinstitute.org/hc/en-us/articles/360042479112-CreateSomaticPanelOfNormals-BETA" + "help_text": "Without PON, there will be no calls with PASS in the INFO field, only an unfiltered VCF is written.\nIt is recommended to make your own PON, as it depends on sequencer and library preparation.\nFor tests in iGenomes there is a dummy PON file in the Annotation/GermlineResource directory, but it should not be used as a real PON file.\n\nSee https://gatk.broadinstitute.org/hc/en-us/articles/360042479112-CreateSomaticPanelOfNormals-BETA\n> **NB** PON file should be bgzipped." }, "pon_index": { "type": "string", "fa_icon": "fas fa-file", "description": "Index of PON panel-of-normals VCF", - "help_text": "If none provided, will be generated automatically from the PON" + "help_text": "If none provided, will be generated automatically from the PON bgzipped VCF file." }, "ignore_soft_clipped_bases": { "type": "boolean", "fa_icon": "fas fa-ban", - "description": "Do not analyze soft clipped bases in the reads for GATK Mutect2" + "description": "Do not analyze soft clipped bases in the reads for GATK Mutect2", + "help_text": "use the --dont-use-soft-clipped-bases params with GATK." }, "umi": { "type": "boolean", "fa_icon": "fas fa-tape", - "description": "If provided, UMIs steps will be run to extract and annotate the reads with UMI and create consensus reads" + "description": "If provided, UMIs steps will be run to extract and annotate the reads with UMI and create consensus reads", + "help_text": "This part of the pipeline uses fgbio to convert the FASTQ files into a unmapped BAM, where reads are tagged with the UMIs extracted from the FASTQ sequences.\nIn order to allow the correct tagging, the UMI sequence must be contained in the read sequence itself, and not in the FASTQ filename.\nFollowing this step, the unmapped BAM is aligned and reads are then grouped based on mapping position and UMI tag.\nFinally, reads in the same groups are collapsed to create a consensus read.\nTo create consensus, we have chosen to use the adjacency method\n\ncf https://github.com/fulcrumgenomics/fgbio\ncf https://cgatoxford.wordpress.com/2015/08/14/unique-molecular-identifiers-the-problem-the-solution-and-the-proof/\n\n> **NB** In order for the correct tagging to be performed, a read structure needs to be specified with --read_structure1 and --readstructure2" }, "read_structure1": { "type": "string", "default": "null", "fa_icon": "fas fa-clipboard-list", "description": "When processing UMIs, a read structure should always be provided for each of the fastq files.", - "help_text": "If the read does not contain any UMI, the structure will be +T (i.e. only template of any length).\nhttps://github.com/fulcrumgenomics/fgbio/wiki/Read-Structures" + "help_text": "If the read does not contain any UMI, the structure will be +T (i.e. only template of any length).\nThe read structure follows a format adopted by different tools and described in the fgbio documentation\ncf https://github.com/fulcrumgenomics/fgbio/wiki/Read-Structures" }, "read_structure2": { "type": "string", "default": "null", "fa_icon": "fas fa-clipboard-list", "description": "When processing UMIs, a read structure should always be provided for each of the fastq files.", - "help_text": "If the read does not contain any UMI, the structure will be +T (i.e. only template of any length).\nhttps://github.com/fulcrumgenomics/fgbio/wiki/Read-Structures" + "help_text": "If the read does not contain any UMI, the structure will be +T (i.e. only template of any length).\nThe read structure follows a format adopted by different tools and described in the fgbio documentation\ncf https://github.com/fulcrumgenomics/fgbio/wiki/Read-Structures" } } }, @@ -354,49 +353,50 @@ "type": "boolean", "fa_icon": "fas fa-database", "description": "Enable the use of cache for annotation", - "help_text": "To be used with `--snpeff_cache` and/or `--vep_cache`", + "help_text": "And disable usage of Sarek snpeff and vep specific containers for annotation\n\nTo be used with `--snpeff_cache` and/or `--vep_cache`", "hidden": true }, "cadd_cache": { "type": "string", "default": "null", "fa_icon": "fas fa-database", - "description": "Enable CADD cache", + "description": "Enable CADD cache.", "hidden": true }, "cadd_indels": { "type": "string", "default": "null", "fa_icon": "fas fa-file", - "description": "Path to CADD InDels file", + "description": "Path to CADD InDels file.", "hidden": true }, "cadd_indels_tbi": { "type": "string", "default": "null", "fa_icon": "fas fa-file", - "description": "Path to CADD InDels index", + "description": "Path to CADD InDels index.", "hidden": true }, "cadd_wg_snvs": { "type": "string", "default": "null", "fa_icon": "fas fa-file", - "description": "Path to CADD SNVs file", + "description": "Path to CADD SNVs file.", "hidden": true }, "cadd_wg_snvs_tbi": { "type": "string", "default": "null", "fa_icon": "fas fa-file", - "description": "Path to CADD SNVs index", + "description": "Path to CADD SNVs index.", "hidden": true }, "genesplicer": { "type": "boolean", "fa_icon": "fas fa-gavel", - "description": "Enable genesplicer within VEP", - "hidden": true + "description": "Enable the use of the VEP GeneSplicer plugin.", + "hidden": true, + "help_text": "```bash\nnextflow run nf-core/sarek --step annotate --tools VEP --sample --genesplicer\n```" }, "snpeff_cache": { "type": "string", @@ -420,45 +420,56 @@ "title": "Reference genome options", "type": "object", "fa_icon": "fas fa-dna", - "description": "Options for the reference genome indices used to align reads.", + "description": "Options for the reference genome files", "properties": { "genome": { "type": "string", "description": "Name of iGenomes reference.", "fa_icon": "fas fa-book", - "help_text": "If using a reference genome configured in the pipeline using iGenomes, use this parameter to give the ID for the reference. This is then used to build the full paths for all required reference genome files e.g. `--genome GRCh38`.\n\nSee the [nf-core website docs](https://nf-co.re/usage/reference_genomes) for more details." + "help_text": "If using a reference genome configured in the pipeline using iGenomes, use this parameter to give the ID for the reference. This is then used to build the full paths for all required reference genome files e.g. `--genome GRCh38`.\n\nSee the [nf-core website docs](https://nf-co.re/usage/reference_genomes) for more details.\n" }, "ac_loci": { "type": "string", - "fa_icon": "fas fa-file" + "fa_icon": "fas fa-file", + "description": "Path to ASCAT loci file." }, "ac_loci_gc": { "type": "string", - "fa_icon": "fas fa-file" + "fa_icon": "fas fa-file", + "description": "Path to ASCAT GC correction file." }, "bwa": { "type": "string", - "fa_icon": "fas fa-copy" + "fa_icon": "fas fa-copy", + "description": "Path to BWA mem indices.", + "help_text": "> **NB** If none provided, will be generated automatically from the FASTA reference." }, "chr_dir": { "type": "string", - "fa_icon": "fas fa-folder-open" + "fa_icon": "fas fa-folder-open", + "description": "Path to chromosomes folder." }, "chr_length": { "type": "string", - "fa_icon": "fas fa-file" + "fa_icon": "fas fa-file", + "description": "Path to chromosomes length file." }, "dbsnp": { "type": "string", - "fa_icon": "fas fa-file" + "fa_icon": "fas fa-file", + "description": "Path to dbsnp file." }, "dbsnp_index": { "type": "string", - "fa_icon": "fas fa-file" + "fa_icon": "fas fa-file", + "description": "Path to dbsnp index.", + "help_text": "> **NB** If none provided, will be generated automatically from the dbsnp file, if provided" }, "dict": { "type": "string", - "fa_icon": "fas fa-file" + "fa_icon": "fas fa-file", + "description": "Path to FASTA dictionary file.", + "help_text": "> **NB** If none provided, will be generated automatically from the FASTA reference." }, "fasta": { "type": "string", @@ -468,43 +479,59 @@ }, "fasta_fai": { "type": "string", - "fa_icon": "fas fa-file" + "fa_icon": "fas fa-file", + "help_text": "> **NB** If none provided, will be generated automatically from the FASTA reference", + "description": "Path to FASTA reference index." }, "germline_resource": { "type": "string", - "fa_icon": "fas fa-file" + "fa_icon": "fas fa-file", + "description": "Path to GATK Mutect2 Germline Resource File", + "help_text": "The germline resource VCF file (bgzipped and tabixed) needed by GATK4 Mutect2 is a collection of calls that are likely present in the sample, with allele frequencies.\nThe AF info field must be present.\nYou can find a smaller, stripped gnomAD VCF file (most of the annotation is removed and only calls signed by PASS are stored) in the AWS iGenomes Annotation/GermlineResource folder." }, "germline_resource_index": { "type": "string", - "fa_icon": "fas fa-file" + "fa_icon": "fas fa-file", + "description": "Path to GATK Mutect2 Germline Resource Index", + "help_text": "> **NB** If none provided, will be generated automatically from the Germline Resource file, if provided" }, "intervals": { "type": "string", - "fa_icon": "fas fa-file-alt" + "fa_icon": "fas fa-file-alt", + "help_text": "To speed up some preprocessing and variant calling processes, the reference is chopped into smaller pieces.\nThe intervals are chromosomes cut at their centromeres (so each chromosome arm processed separately) also additional unassigned contigs.\nWe are ignoring the `hs37d5` contig that contains concatenated decoy sequences.\nParts of preprocessing and variant calling are done by these intervals, and the different resulting files are then merged.\nThis can parallelize processes, and push down wall clock time significantly.\n\nThe calling intervals can be defined using a .list or a BED file.\nA .list file contains one interval per line in the format `chromosome:start-end` (1-based coordinates).\nA BED file must be a tab-separated text file with one interval per line.\nThere must be at least three columns: chromosome, start, and end (0-based coordinates).\nAdditionally, the score column of the BED file can be used to provide an estimate of how many seconds it will take to call variants on that interval.\nThe fourth column remains unused.\n\n```\n|chr1|10000|207666|NA|47.3|\n```\nThis indicates that variant calling on the interval chr1:10001-207666 takes approximately 47.3 seconds.\n\nThe runtime estimate is used in two different ways.\nFirst, when there are multiple consecutive intervals in the file that take little time to compute, they are processed as a single job, thus reducing the number of processes that needs to be spawned.\nSecond, the jobs with largest processing time are started first, which reduces wall-clock time.\nIf no runtime is given, a time of 1000 nucleotides per second is assumed.\nActual figures vary from 2 nucleotides/second to 30000 nucleotides/second.\nIf you prefer, you can specify the full path to your reference genome when you run the pipeline:\n\n> **NB** If none provided, will be generated automatically from the FASTA reference\n> **NB** Use --no_intervals to disable automatic generation", + "description": "Path to intervals file" }, "known_indels": { "type": "string", - "fa_icon": "fas fa-copy" + "fa_icon": "fas fa-copy", + "description": "Path to known indels file" }, "known_indels_index": { "type": "string", - "fa_icon": "fas fa-copy" + "fa_icon": "fas fa-copy", + "description": "Path to known indels file index", + "help_text": "> **NB** If none provided, will be generated automatically from the known index file, if provided" }, "mappability": { "type": "string", - "fa_icon": "fas fa-file" + "fa_icon": "fas fa-file", + "description": "Path to Control-FREEC mappability file" }, "snpeff_db": { "type": "string", - "fa_icon": "fas fa-database" + "fa_icon": "fas fa-database", + "description": "snpEff DB version" }, "species": { "type": "string", - "fa_icon": "fas fa-microscope" + "fa_icon": "fas fa-microscope", + "description": "snpEff species", + "help_text": "If you use AWS iGenomes or a local resource with genomes.conf, this has already been set for you appropriately." }, "vep_cache_version": { "type": "string", - "fa_icon": "fas fa-tag" + "fa_icon": "fas fa-tag", + "description": "VEP cache version" }, "save_reference": { "type": "boolean", @@ -514,7 +541,7 @@ "igenomes_base": { "type": "string", "description": "Directory / URL base for iGenomes references.", - "default": "s3://ngi-igenomes/igenomes/", + "default": "s3://ngi-igenomes/igenomes/", "fa_icon": "fas fa-cloud-download-alt", "hidden": true }, @@ -531,9 +558,10 @@ "description": "Do not load the iGenomes reference config.", "fa_icon": "fas fa-ban", "hidden": true, - "help_text": "Do not load `igenomes.config` when running the pipeline. You may choose this option if you observe clashes between custom parameters and those supplied in `igenomes.config`." + "help_text": "Do not load `igenomes.config` when running the pipeline.\nYou may choose this option if you observe clashes between custom parameters and those supplied in `igenomes.config`.\nThis option will load the `genomes.config` file instead.\n\n> **NB** You can then specify the genome custom and specify at least a FASTA genome file." } - } + }, + "help_text": "The pipeline config files come bundled with paths to the Illumina iGenomes reference index files.\nThe configuration is set up to use the AWS-iGenomes resource\ncf https://ewels.github.io/AWS-iGenomes/\n" }, "generic_options": { "title": "Generic options", @@ -546,7 +574,8 @@ "type": "boolean", "description": "Display help text.", "hidden": true, - "fa_icon": "fas fa-question-circle" + "fa_icon": "fas fa-question-circle", + "help_text": "You're reading it." }, "publish_dir_mode": { "type": "string", @@ -561,7 +590,7 @@ "link", "copy", "copyNoFollow", - "mov" + "move" ] }, "name": { @@ -571,6 +600,13 @@ "hidden": true, "help_text": "A custom name for the pipeline run. Unlike the core nextflow `-name` option with one hyphen this parameter can be reused multiple times, for example if using `-resume`. Passed through to steps such as MultiQC and used for things like report filenames and titles." }, + "email": { + "type": "string", + "description": "Email address for completion summary.", + "fa_icon": "fas fa-envelope", + "help_text": "Set this parameter to your e-mail address to get a summary e-mail with details of the run sent to you when the workflow exits. If set in your user config file (`~/.nextflow/config`) then you don't need to specify this on the command line for every run.", + "pattern": "^([a-zA-Z0-9_\\-\\.]+)@([a-zA-Z0-9_\\-\\.]+)\\.([a-zA-Z]{2,5})$" + }, "email_on_fail": { "type": "string", "description": "Email address for completion summary, only when pipeline fails.", @@ -603,14 +639,14 @@ }, "multiqc_config": { "type": "string", - "description": "Custom config file to supply to MultiQC.", + "description": "Path to MultiQC custom config file.", "fa_icon": "fas fa-cog", "hidden": true }, "tracedir": { "type": "string", "description": "Directory to keep pipeline Nextflow logs and reports.", - "default": "${params.outdir}/pipeline_info", + "default": "${params.outdir}/pipeline_info", "fa_icon": "fas fa-cogs", "hidden": true }, @@ -618,7 +654,8 @@ "type": "string", "default": "null", "fa_icon": "fas fa-university", - "description": "Name of sequencing center to be displayed in BAM file" + "description": "Name of sequencing center to be displayed in BAM file", + "help_text": "It will be in the CN field" } } }, @@ -632,12 +669,15 @@ "cpus": { "type": "integer", "default": 8, - "fa_icon": "fas fa-microchip" + "fa_icon": "fas fa-microchip", + "help_text": "Should be an integer e.g. `--cpus 7`" }, "single_cpu_mem": { "type": "string", "default": "7 GB", - "fa_icon": "fas fa-sd-card" + "fa_icon": "fas fa-sd-card", + "description": "Use to set memory for a single CPU.", + "help_text": "Should be a string in the format integer-unit eg. `--single_cpu_mem '8.GB'`" }, "max_cpus": { "type": "integer", @@ -683,7 +723,7 @@ "custom_config_base": { "type": "string", "description": "Base directory for Institutional configs.", - "default": "https://raw.githubusercontent.com/nf-core/configs/master", + "default": "https://raw.githubusercontent.com/nf-core/configs/master", "hidden": true, "help_text": "If you're running offline, nextflow will not be able to fetch the institutional config files from the internet. If you don't need them, then this is not a problem. If you do need them, you should download the files from the repo and tell nextflow where to find them with the `custom_config_base` option. For example:\n\n```bash\n## Download and unzip the config files\ncd /path/to/my/configs\nwget https://github.com/nf-core/configs/archive/master.zip\nunzip master.zip\n\n## Run the pipeline\ncd /path/to/my/data\nnextflow run /path/to/pipeline/ --custom_config_base /path/to/my/configs/configs-master/\n```\n\n> Note that the nf-core/tools helper package has a `download` command to download all required pipeline files + singularity containers + institutional configs in one go for you, to make this process easier.", "fa_icon": "fas fa-users-cog" @@ -747,4 +787,4 @@ "$ref": "#/definitions/institutional_config_options" } ] -} +} \ No newline at end of file