Merge pull request #664 from maxulysse/dev_3.0_PR

Update comments merged on the 3.0 PR back to dev
nf-core · Jul 20, 2022 · 467b2e6 · 467b2e6
2 parents cff880f + fc09504
commit 467b2e6
Show file tree

Hide file tree

Showing 4 changed files with 116 additions and 35 deletions.
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -5,7 +5,17 @@ All notable changes to this project will be documented in this file.
 The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
 and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0.html).
 
-## [dev](https://github.com/nf-core/sarek/tree/dev)
+## [2.7.2](https://github.com/nf-core/sarek/releases/tag/2.7.2) - Áhkká
+
+Áhkká is one of the massifs just outside of the Sarek National Park.
+
+### Fixed
+
+- [#566](https://github.com/nf-core/sarek/pull/566) - Fix caching bug affecting a variable number of `MapReads` jobs due to non-deterministic state of `statusMap` during caching evaluation
+
+## [2.7.1](https://github.com/nf-core/sarek/releases/tag/2.7.1) - Pårtejekna
+
+Pårtejekna is one of glaciers of the Pårte Massif.
 
 ### Added
 

diff --git a/README.md b/README.md
@@ -85,7 +85,6 @@ Friederike Hanssen and Gisela Gabernet at [QBiC](https://www.qbic.uni-tuebingen.
 
 Main authors:
 
-- [Gisela Gabernet](https://github.com/ggabernet)
 - [Maxime Garcia](https://github.com/maxulysse)
 - [Friederike Hanssen](https://github.com/FriederikeHanssen)
 - [Szilveszter Juhos](https://github.com/szilvajuhos)
@@ -99,6 +98,7 @@ We thank the following people for their extensive assistance in the development
 - [Chela James](https://github.com/chelauk)
 - [David Mas-Ponte](https://github.com/davidmasp)
 - [Francesco L](https://github.com/nibscles)
+- [Gisela Gabernet](https://github.com/ggabernet)
 - [Harshil Patel](https://github.com/drpatelh)
 - [James A. Fellows Yates](https://github.com/jfy133)
 - [Jesper Eisfeldt](https://github.com/J35P312)
@@ -108,7 +108,7 @@ We thank the following people for their extensive assistance in the development
 - [Lucia Conde](https://github.com/lconde-ucl)
 - [Malin Larsson](https://github.com/malinlarsson)
 - [Marcel Martin](https://github.com/marcelm)
-- [Nick Smith](https://github,com/nickhsmith)
+- [Nick Smith](https://github.com/nickhsmith)
 - [Nilesh Tawari](https://github.com/nilesh-tawari)
 - [Olga Botvinnik](https://github.com/olgabot)
 - [Oskar Wacker](https://github.com/WackerO)

diff --git a/docs/usage.md b/docs/usage.md
@@ -680,36 +680,6 @@ Recent updates to Samtools have been introduced, which can speed-up performance
 The current workflow does not handle duplex UMIs (i.e. where opposite strands of a duplex molecule have been tagged with a different UMI), and best practices have been proposed to process this type of data.
 Both changes will be implemented in a future release.
 
-## How to run sarek when no(t all) reference files are in igenomes
-
-For common genomes, such as GRCh38 and GRCh37, the pipeline is shipped with (almost) all necessary reference files. However, sometimes it is necessary to use custom references for some or all files:
-
-### No igenomes reference files are used
-
-If none of your required genome files are in igenomes, `--igenomes_ignore` must be set to ignore any igenomes input and `--genome null`. The `fasta` file is the only required input file and must be provided to run the pipeline. All other possible reference file can be provided in addition. For details, see the paramter documentation.
-
-Minimal example for custom genomes:
-
-```
-nextflow run nf-core/sarek --genome null --igenomes_ignore --fasta <custom.fasta>
-```
-
-### Overwrite specific reference files
-
-If you don't want to use some of the provided reference genomes, they can be overwritten by either providing a new file or setting the respective file parameter to `false`, if it should be ignored:
-
-Example for using a custom known indels file:
-
-```
-nextflow run nf-core/sarek --known_indels <my_known_indels.vcf.gz> --genome GRCh38.GATK
-```
-
-Example for not using known indels, but all other provided reference file:
-
-```
-nextflow run nf-core/sarek --known_indels false --genome GRCh38.GATK
-```
-
 ### Where do the used reference genomes originate from
 
 _under construction - help needed_
@@ -744,6 +714,36 @@ GATK.GRCh38:
 | vep_genome            |                                                                                                                                                                                                                                                                                                                                                                                                                                                      | 'GRCh38'                                                                                                                                                                                |                                                                                      |
 | chr_dir               |                                                                                                                                                                                                                                                                                                                                                                                                                                                      | "${params.igenomes_base}/Homo_sapiens/GATK/GRCh38/Sequence/Chromosomes"                                                                                                                 |                                                                                      |
 
+## How to run sarek when no(t all) reference files are in igenomes
+
+For common genomes, such as GRCh38 and GRCh37, the pipeline is shipped with (almost) all necessary reference files. However, sometimes it is necessary to use custom references for some or all files:
+
+### No igenomes reference files are used
+
+If none of your required genome files are in igenomes, `--igenomes_ignore` must be set to ignore any igenomes input and `--genome null`. The `fasta` file is the only required input file and must be provided to run the pipeline. All other possible reference file can be provided in addition. For details, see the paramter documentation.
+
+Minimal example for custom genomes:
+
+```
+nextflow run nf-core/sarek --genome null --igenomes_ignore --fasta <custom.fasta>
+```
+
+### Overwrite specific reference files
+
+If you don't want to use some of the provided reference genomes, they can be overwritten by either providing a new file or setting the respective file parameter to `false`, if it should be ignored:
+
+Example for using a custom known indels file:
+
+```
+nextflow run nf-core/sarek --known_indels <my_known_indels.vcf.gz> --genome GRCh38.GATK
+```
+
+Example for not using known indels, but all other provided reference files:
+
+```
+nextflow run nf-core/sarek --known_indels false --genome GRCh38.GATK
+```
+
 ## How to customise SnpEff and VEP annotation
 
 _under construction help needed_
@@ -784,7 +784,7 @@ Based on [nfcore/base:1.12.1](https://hub.docker.com/r/nfcore/base/tags), it con
 ### Using downloaded cache
 
 Both `snpEff` and `VEP` enable usage of cache, if no pre-build container is available.
-The cache needs to made available on the machine where Sarek is run.
+The cache needs to be made available on the machine where Sarek is run.
 You need to specify the cache directory using `--snpeff_cache` and `--vep_cache` in the command lines or within configuration files.
 
 Example:
@@ -844,6 +844,77 @@ nextflow run download_cache.nf --cadd_cache </path/to/CADD/cache> --cadd_version
 Resource requests are difficult to generalize and are often dependent on input data size. Currently, the number of cpus and memory requested by default were adapted from tests on 5 ICGC paired whole-genome sequencing samples with approximately 40X and 80X depth.
 For targeted data analysis, this is overshooting by a lot. In this case resources for each process can be limited by either setting `--max_memory` and `-max_cpus` or tailoring the request by process name as described [here](#resource-requests). If you are using sarek for a certain data type regulary, and would like to make these requests available to others on your system, an institution-specific, pipeline-specific config file can be added [here](https://github.com/nf-core/configs/tree/master/conf/pipeline/sarek).
 
+## Spark related issues
+
+If you have problems running processes that make use of Spark such, for instance, as `MarkDuplicates`, then that might be due to a limit on the number of simultaneously open files on your system.
+You can check your current limit by typing the following:
+
+```bash
+ulimit -n
+```
+
+The default limit size is usually 1024 which is quite low to run Spark jobs.
+In order to increase the size limit permanently you can:
+
+Edit the file `/etc/security/limits.conf` and add the lines:
+
+```bash
+*     soft   nofile  65535
+*     hard   nofile  65535
+```
+
+Edit the file `/etc/sysctl.conf` and add the line:
+
+```bash
+fs.file-max = 65535
+```
+
+Edit the file `/etc/sysconfig/docker` and add the new limits to OPTIONS like this:
+
+```bash
+OPTIONS=”—default-ulimit nofile=65535:65535"
+```
+
+Re-start your session.
+
+Note that the way to increase the open file limit in your system may be slightly different or require additional steps.
+
+### Cannot delete work folder when using docker + Spark
+
+Currently, when running spark-based tools in combination with docker, it is required to set `docker.userEmulation = false`. This can unfortunately cause permission issues when `work/` is being written with root permissions. In case this happens, you might need to configure docker to run without `userEmulation` (see [here](https://github.com/Midnighter/nf-core-adr/blob/main/docs/adr/0008-refrain-from-using-docker-useremulation-in-nextflow.md)).
+
+## How to handle UMIs
+
+Sarek can process UMI-reads, using [fgbio](http://fulcrumgenomics.github.io/fgbio/tools/latest/) tools.
+
+In order to use reads containing UMI tags as your initial input, you need to include `--umi_read_structure <UMI_string>` in your parameters.
+
+This will enable pre-processing of the reads and UMI consensus reads calling, which will then be used to continue the workflow from the mapping steps. For post-UMI processing depending on the experimental setup, duplicate marking and base quality recalibration can be skipped with `--skip_tools`.
+
+### UMI Read Structure
+
+This parameter is a string, which follows a [convention](https://github.com/fulcrumgenomics/fgbio/wiki/Read-Structures) to describe the structure of the umi.
+If your reads contain a UMI only on one end, the string should only represent one structure (i.e. "2M11S+T"); should your reads contain a UMI on both ends, the string will contain two structures separated by a blank space (i.e. "2M11S+T 2M11S+T").
+
+### Limitations and future updates
+
+Recent updates to Samtools have been introduced, which can speed-up performance of fgbio tools used in this workflow.
+The current workflow does not handle duplex UMIs (i.e. where opposite strands of a duplex molecule have been tagged with a different UMI), and best practices have been proposed to process this type of data.
+Both changes will be implemented in a future release.
+
+## MultiQC related issues
+
+### Plots for SnpEff are missing
+
+When plots are missing, it is possible that the fasta and the custom SnpEff database are not matching https://pcingola.github.io/SnpEff/se_faq/#error_chromosome_not_found-details.
+The SnpEff completes without throwing an error causing nextflow to complete successfully. An indication for the error are these lines in the `.command` files:
+
+```
+ERRORS: Some errors were detected
+Error type      Number of errors
+ERROR_CHROMOSOME_NOT_FOUND      17522411
+```
+
 ## How to set sarek up to use sentieon
 
 Sarek 3.0 is currently not supporting sentieon. It is planned for the upcoming release 3.1. In the meantime, please revert to the last release 2.7.2.
diff --git a/workflows/sarek.nf b/workflows/sarek.nf
@@ -1234,7 +1234,7 @@ def extract_csv(csv_file) {
                 System.exit(1)
             }
         } else {
-            log.warn "Missing or unknown field in csv file header. Please check your samplesheet"
+            log.error "Missing or unknown field in csv file header. Please check your samplesheet"
             System.exit(1)
         }
     }