[BUG] Sarek 2.7.1 doesn't reuse all cached `MapReads` jobs #555

BrunoGrandePhD · 2022-05-19T23:01:31Z

Check Documentation

I have checked the following places for your error, but I didn't find anything related to caching.

Description of the bug

When I relaunch a Sarek workflow, a seemingly random subset of MapReads jobs (ranging from 5 to 50%) are re-run despite cached results being available. This causes all downstream jobs for the affected samples to also be re-run. For large datasets, especially those being processed in the cloud, this bug causes a significant cost to be incurred if the workflow doesn't complete successfully on the first try, especially if most of the workflow is complete before an error arises.

I tried using previous versions of Sarek (specifically, 2.6.1, 2.5.2, and 2.5), and they all seem to run into this bug.

Steps to reproduce

I adapted the input file from the test profile by artificially increasing the number of rows to 100. This makes it easier to spot this issue, which doesn't always occur with only a few samples. It's also easier to notice the issue if you monitor the workflow in Nextflow Tower.

Run the following Nextflow command where params.yml contains the parameters copied below. It should complete successfully. On a local machine, this command took less than 2 minutes.
```
nextflow run 'https://github.com/nf-core/sarek' -r '2.7.1' -dsl1 -profile 'test,docker' -params-file 'params.yml'
```
Re-run the same Nextflow command with -resume. It should also complete successfully, but a subset of MapReads jobs should have been re-run instead of being cached.

Parameters

The parameters aside from input are included to skip steps that aren't relevant to the reproducible example.

input: https://gist.githubusercontent.com/BrunoGrandePhD/6869bf506e6acf7b920a22b666f4e443/raw/ff7ea0757ebbd324cd125b1d48f54e6ec9f969c9/sarek-test.tsv
skip_qc: "bamqc,baserecalibrator,bcftools,documentation,fastqc,markduplicates,multiqc,samtools,sentieon,vcftools,versions"
no_intervals: true
skip_markduplicates: true
known_indels: null

Expected behaviour

I expect that workflow re-runs will leverage cached results unless there's a reason to believe that the job needs to be updated. In the situation I describe above, nothing changes between the two runs, so there shouldn't be a reason for jobs to be re-run, especially not a seemingly random subset of jobs.

Log files

Have you provided the following extra information/files:

The command used to run the pipeline (see "Steps to reproduce" above)
The .nextflow.log files

The Nextflow logs below correspond to my walkthrough of the steps I listed in "Steps to reproduce".

firstpass-nextflow.log
secondpass-nextflow.log

System

Hardware: AWS EC2 instances
Executor: Tried on both local EC2 instances and on AWS Batch
OS: Amazon Linux
Version: 2 (Kernel: Linux 4.14.232-176.381.amzn2.x86_64)

Nextflow Installation

Version: 22.04.2

Container engine

Engine: Docker
Version: 20.10.13
Image tag: nfcore/sarek:2.7.1

Additional context

The text was updated successfully, but these errors were encountered:

Related to #555

BrunoGrandePhD · 2022-05-20T19:16:11Z

In order to isolate the root cause of this issue, I created a stripped-down version of Sarek 2.7.1 with only two processes, FastQCFQ and MapReads. I picked these two processes because in my experience, the caching issue only affects MapReads, so I wanted to make sure that any code I removed left FastQCFQ intact.

I was able to reproduce the issue with the stripped-down version (see commands and output below). Note how the second run with -resume only uses 35 cached jobs instead of all 100. Note that I'm using a slightly updated params.yml file (included at the bottom of this comment) to skip reference file generation (since I removed those processes from main.nf).

$ nextflow run main.nf -dsl1 -profile 'test,docker' -params-file 'params.yml' 
[...]
[d6/fea675] process > FastQCFQ (S096-S096-9876T_M1) [100%] 100 of 100 ✔
[f8/84d5f3] process > MapReads (S098-S098-9876T_M3) [100%] 100 of 100 ✔

$ nextflow run main.nf -dsl1 -profile 'test,docker' -params-file 'params.yml' -resume 
[...]
[7f/b623d2] process > FastQCFQ (S094-S094-1234N_M6) [100%] 100 of 100, cached: 100 ✔
[0d/e88ddd] process > MapReads (S080-S080-1234N_M1) [100%] 100 of 100, cached: 35 ✔

After some fiddling with the stripped-down version, I think that I've isolated the root cause to the inclusion of statusMap in the script directive in MapReads. If I comment out the relevant line, caching works as expected. In the first run, all of the MapReads jobs are re-run because of the change to the script directive, which invalidates the previously cached results. In the second run, all 100 jobs are cached as expected.

$ nextflow run main.nf -dsl1 -profile 'test,docker' -params-file 'params.yml' -resume 
[...]
[32/f54766] process > FastQCFQ (S089-S089-9876T_M4) [100%] 100 of 100, cached: 100 ✔
[1b/11dfc5] process > MapReads (S097-S097-9876T_M2) [100%] 100 of 100 ✔

$ nextflow run main.nf -dsl1 -profile 'test,docker' -params-file 'params.yml' -resume 
[...]
[d6/fea675] process > FastQCFQ (S096-S096-9876T_M1) [100%] 100 of 100, cached: 100 ✔
[4d/5e1024] process > MapReads (S096-S096-9876T_M1) [100%] 100 of 100, cached: 100 ✔

I confirmed that a similar change in the full version of Sarek 2.7.1 also fixes the caching issue.

$ nextflow run main.nf -dsl1 -profile 'test,docker' -params-file 'params.yml' | grep "> MapReads" | tail -n 1 
ws[95/68f3b7] process > MapReads (S095-S095-1234N_M7)  [100%] 100 of 100 ✔

$ nextflow run main.nf -dsl1 -profile 'test,docker' -params-file 'params.yml' -resume | grep "> MapReads" | tail -n 1 
[6b/8adbd3] process > MapReads (S097-S097-9876T_M2)  [100%] 100 of 100, cached: 100 ✔

My next step is to figure out why this is happening and more importantly, what's the minimal change required to fix the issue while ensuring that the results remain the same.

Updated Parameters

input: https://gist.githubusercontent.com/BrunoGrandePhD/6869bf506e6acf7b920a22b666f4e443/raw/ff7ea0757ebbd324cd125b1d48f54e6ec9f969c9/sarek-test.tsv
skip_qc: "bamqc,baserecalibrator,bcftools,documentation,markduplicates,multiqc,samtools,sentieon,vcftools,versions"
no_intervals: true
skip_markduplicates: true
known_indels: null
fasta: /home/ssm-user/ref/fai/human_g1k_v37_decoy.small.fasta
fasta_fai: /home/ssm-user/ref/fai/human_g1k_v37_decoy.small.fasta.fai
bwa: /home/ssm-user/ref/bwa/human_g1k_v37_decoy.small.fasta.{amb,ann,bwt,pac,sa}
dict: /home/ssm-user/ref/dict/human_g1k_v37_decoy.small.dict
dbsnp: /home/ssm-user/ref/dbsnp/dbsnp_138.b37.small.vcf.gz
dbsnp_index: /home/ssm-user/ref/dbsnp/dbsnp_138.b37.small.vcf.gz.tbi

BrunoGrandePhD · 2022-05-26T18:27:12Z

Using my stripped-down version of Sarek, I’m comparing the hashes for a job that was re-run despite -resume being used (see attachments below). I can confirm that the overall cache hash is different (hence the re-run), and the specific “sub-hash” that was different was for statusMap.

< INFO  nextflow.processor.TaskProcessor - [MapReads (S005-S005-1234N_M7)] cache hash: d8ad399be1e7065e50e02cc88513effc; mode: STANDARD; entries: 
---
> INFO  nextflow.processor.TaskProcessor - [MapReads (S005-S005-1234N_M7)] cache hash: 515d26d1f3d7a5d82129f328131b7b46; mode: STANDARD; entries: 

<   d2e0e780b8df5e88047f20236e6c5534 [java.util.HashMap$EntrySet] [statusMap={[S000, S000-1234N]=0, [S001, S001-1234N]=0, [S002, S002-1234N]=0, [S003, S003-1234N]=0, [S004, S004-1234N]=0, [S005, S005-1234N]=0}, task.cpus=2] 
---
>   f8df28f90d16aceac48daa7004411db4 [java.util.HashMap$EntrySet] [statusMap={[S000, S000-1234N]=0, [S001, S001-1234N]=0, [S002, S002-1234N]=0, [S003, S003-1234N]=0, [S004, S004-1234N]=0, [S005, S005-1234N]=0, [S006, S006-9876T]=1}, task.cpus=2]

Indeed, if you compare the values in the statusMap, there is one additional element: [S006, S006-9876T]=1 while all other elements are identical. This makes me wonder if the statusMap isn't done being populated as the MapReads` jobs are being assessed.

{[S000, S000-1234N]=0, [S001, S001-1234N]=0, [S002, S002-1234N]=0, [S003, S003-1234N]=0, [S004, S004-1234N]=0, [S005, S005-1234N]=0}
{[S000, S000-1234N]=0, [S001, S001-1234N]=0, [S002, S002-1234N]=0, [S003, S003-1234N]=0, [S004, S004-1234N]=0, [S005, S005-1234N]=0, [S006, S006-9876T]=1}

Attachments

first.txt
second.txt

BrunoGrandePhD · 2022-05-26T18:35:52Z

Adding a sleep(5000) to ensure that statusMap is fully populated seems to do the trick in ensuring that all MapReads jobs are cached. I need to find a better method to achieve this though. Basically, I need to wait until the channel is fully processed before allowing Nextflow to determine which MapReads jobs can be run.

def extractInfos(channel) {
    def genderMap = [:]
    def statusMap = [:]
    channel = channel.map{ it ->
        def idPatient = it[0]
        def gender = it[1]
        def status = it[2]
        def idSample = it[3]
        genderMap[idPatient] = gender
        statusMap[idPatient, idSample] = status
        [idPatient] + it[3..-1]
    }
    sleep(5000)
    [genderMap, statusMap, channel]
}

$ nextflow run main.nf -dsl1 -profile 'test,docker' -params-file 'params.yml' -resume
[...]
[83/94d427] process > FastQCFQ (S097-S097-9876T_M2) [100%] 100 of 100, cached: 100 ✔
[b3/022ea7] process > MapReads (S046-S046-9876T_M1) [100%] 100 of 100, cached: 50 ✔

$ nextflow run main.nf -dsl1 -profile 'test,docker' -params-file 'params.yml' -resume
[...]
[aa/b841d0] process > FastQCFQ (S098-S098-9876T_M3) [100%] 100 of 100, cached: 100 ✔
[85/1589cc] process > MapReads (S094-S094-1234N_M6) [100%] 100 of 100, cached: 100 ✔

BrunoGrandePhD · 2022-05-26T23:07:54Z

I was also able to fix the caching issue by including a little snippet immediately before the return value, which forces the function to wait until the full list accumulates before re-creating a channel and allowing the function to continue. This doesn't feel conventional, but I'm still waiting to hear back about a better solution.

def extractInfos(channel) {
    def genderMap = [:]
    def statusMap = [:]
    channel = channel.map{ it ->
        def idPatient = it[0]
        def gender = it[1]
        def status = it[2]
        def idSample = it[3]
        genderMap[idPatient] = gender
        statusMap[idPatient, idSample] = status
        [idPatient] + it[3..-1]
    }
    channel = Channel.fromList(channel.toList().val)
    [genderMap, statusMap, channel]
}

Edit: Just confirming that the above fix works as expected in Sarek 2.7.1:

$ nextflow run main.nf -dsl1 -profile 'test,docker' -params-file 'params.yml' -resume | grep "> MapReads" | tail -n 1 
[f8/4c89a3] process > MapReads (S099-S099-9876T_M4)  [100%] 100 of 100 ✔

$ nextflow run main.nf -dsl1 -profile 'test,docker' -params-file 'params.yml' -resume | grep "> MapReads" | tail -n 1 
[ca/57702f] process > MapReads (S098-S098-9876T_M3)  [100%] 100 of 100, cached: 100 ✔

BrunoGrandePhD added the bug Something isn't working label May 19, 2022

BrunoGrandePhD assigned maxulysse May 19, 2022

BrunoGrandePhD pushed a commit that referenced this issue May 20, 2022

Strip down Sarek 2.7.1 to minimum for reprex

e56ae63

Related to #555

BrunoGrandePhD pushed a commit to Sage-Bionetworks-Workflows/sarek that referenced this issue May 26, 2022

Fix MapReads caching bug (nf-core#555)

48960d4

This was referenced May 26, 2022

Fix MapReads caching bug #566

Merged

Fix MapReads caching bug Sage-Bionetworks-Workflows/sarek#2

Merged

FriederikeHanssen closed this as completed in #566 Jun 10, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] Sarek 2.7.1 doesn't reuse all cached `MapReads` jobs #555

[BUG] Sarek 2.7.1 doesn't reuse all cached `MapReads` jobs #555

BrunoGrandePhD commented May 19, 2022 •

edited

Loading

BrunoGrandePhD commented May 20, 2022

BrunoGrandePhD commented May 26, 2022

BrunoGrandePhD commented May 26, 2022 •

edited

Loading

BrunoGrandePhD commented May 26, 2022 •

edited

Loading

[BUG] Sarek 2.7.1 doesn't reuse all cached MapReads jobs #555

[BUG] Sarek 2.7.1 doesn't reuse all cached MapReads jobs #555

Comments

BrunoGrandePhD commented May 19, 2022 • edited Loading

Check Documentation

Description of the bug

Steps to reproduce

Expected behaviour

Log files

System

Nextflow Installation

Container engine

Additional context

BrunoGrandePhD commented May 20, 2022

BrunoGrandePhD commented May 26, 2022

Attachments

BrunoGrandePhD commented May 26, 2022 • edited Loading

BrunoGrandePhD commented May 26, 2022 • edited Loading

[BUG] Sarek 2.7.1 doesn't reuse all cached `MapReads` jobs #555

[BUG] Sarek 2.7.1 doesn't reuse all cached `MapReads` jobs #555

BrunoGrandePhD commented May 19, 2022 •

edited

Loading

BrunoGrandePhD commented May 26, 2022 •

edited

Loading

BrunoGrandePhD commented May 26, 2022 •

edited

Loading