Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Sarek 2.7.1 doesn't reuse all cached MapReads jobs #555

Closed
4 tasks done
BrunoGrandePhD opened this issue May 19, 2022 · 4 comments · Fixed by #566
Closed
4 tasks done

[BUG] Sarek 2.7.1 doesn't reuse all cached MapReads jobs #555

BrunoGrandePhD opened this issue May 19, 2022 · 4 comments · Fixed by #566
Assignees
Labels
bug Something isn't working

Comments

@BrunoGrandePhD
Copy link

BrunoGrandePhD commented May 19, 2022

Check Documentation

I have checked the following places for your error, but I didn't find anything related to caching.

Description of the bug

When I relaunch a Sarek workflow, a seemingly random subset of MapReads jobs (ranging from 5 to 50%) are re-run despite cached results being available. This causes all downstream jobs for the affected samples to also be re-run. For large datasets, especially those being processed in the cloud, this bug causes a significant cost to be incurred if the workflow doesn't complete successfully on the first try, especially if most of the workflow is complete before an error arises.

I tried using previous versions of Sarek (specifically, 2.6.1, 2.5.2, and 2.5), and they all seem to run into this bug.

image

Steps to reproduce

I adapted the input file from the test profile by artificially increasing the number of rows to 100. This makes it easier to spot this issue, which doesn't always occur with only a few samples. It's also easier to notice the issue if you monitor the workflow in Nextflow Tower.

  1. Run the following Nextflow command where params.yml contains the parameters copied below. It should complete successfully. On a local machine, this command took less than 2 minutes.
    nextflow run 'https://github.com/nf-core/sarek' -r '2.7.1' -dsl1 -profile 'test,docker' -params-file 'params.yml'
  2. Re-run the same Nextflow command with -resume. It should also complete successfully, but a subset of MapReads jobs should have been re-run instead of being cached.

Parameters

The parameters aside from input are included to skip steps that aren't relevant to the reproducible example.

input: https://gist.githubusercontent.com/BrunoGrandePhD/6869bf506e6acf7b920a22b666f4e443/raw/ff7ea0757ebbd324cd125b1d48f54e6ec9f969c9/sarek-test.tsv
skip_qc: "bamqc,baserecalibrator,bcftools,documentation,fastqc,markduplicates,multiqc,samtools,sentieon,vcftools,versions"
no_intervals: true
skip_markduplicates: true
known_indels: null

Expected behaviour

I expect that workflow re-runs will leverage cached results unless there's a reason to believe that the job needs to be updated. In the situation I describe above, nothing changes between the two runs, so there shouldn't be a reason for jobs to be re-run, especially not a seemingly random subset of jobs.

Log files

Have you provided the following extra information/files:

  • The command used to run the pipeline (see "Steps to reproduce" above)
  • The .nextflow.log files

The Nextflow logs below correspond to my walkthrough of the steps I listed in "Steps to reproduce".

firstpass-nextflow.log
secondpass-nextflow.log

System

  • Hardware: AWS EC2 instances
  • Executor: Tried on both local EC2 instances and on AWS Batch
  • OS: Amazon Linux
  • Version: 2 (Kernel: Linux 4.14.232-176.381.amzn2.x86_64)

Nextflow Installation

  • Version: 22.04.2

Container engine

  • Engine: Docker
  • Version: 20.10.13
  • Image tag: nfcore/sarek:2.7.1

Additional context

@BrunoGrandePhD BrunoGrandePhD added the bug Something isn't working label May 19, 2022
BrunoGrandePhD pushed a commit that referenced this issue May 20, 2022
@BrunoGrandePhD
Copy link
Author

In order to isolate the root cause of this issue, I created a stripped-down version of Sarek 2.7.1 with only two processes, FastQCFQ and MapReads. I picked these two processes because in my experience, the caching issue only affects MapReads, so I wanted to make sure that any code I removed left FastQCFQ intact.

I was able to reproduce the issue with the stripped-down version (see commands and output below). Note how the second run with -resume only uses 35 cached jobs instead of all 100. Note that I'm using a slightly updated params.yml file (included at the bottom of this comment) to skip reference file generation (since I removed those processes from main.nf).

$ nextflow run main.nf -dsl1 -profile 'test,docker' -params-file 'params.yml' 
[...]
[d6/fea675] process > FastQCFQ (S096-S096-9876T_M1) [100%] 100 of 100 ✔
[f8/84d5f3] process > MapReads (S098-S098-9876T_M3) [100%] 100 of 100 ✔

$ nextflow run main.nf -dsl1 -profile 'test,docker' -params-file 'params.yml' -resume 
[...]
[7f/b623d2] process > FastQCFQ (S094-S094-1234N_M6) [100%] 100 of 100, cached: 100 ✔
[0d/e88ddd] process > MapReads (S080-S080-1234N_M1) [100%] 100 of 100, cached: 35 ✔

After some fiddling with the stripped-down version, I think that I've isolated the root cause to the inclusion of statusMap in the script directive in MapReads. If I comment out the relevant line, caching works as expected. In the first run, all of the MapReads jobs are re-run because of the change to the script directive, which invalidates the previously cached results. In the second run, all 100 jobs are cached as expected.

$ nextflow run main.nf -dsl1 -profile 'test,docker' -params-file 'params.yml' -resume 
[...]
[32/f54766] process > FastQCFQ (S089-S089-9876T_M4) [100%] 100 of 100, cached: 100 ✔
[1b/11dfc5] process > MapReads (S097-S097-9876T_M2) [100%] 100 of 100 ✔

$ nextflow run main.nf -dsl1 -profile 'test,docker' -params-file 'params.yml' -resume 
[...]
[d6/fea675] process > FastQCFQ (S096-S096-9876T_M1) [100%] 100 of 100, cached: 100 ✔
[4d/5e1024] process > MapReads (S096-S096-9876T_M1) [100%] 100 of 100, cached: 100 ✔

I confirmed that a similar change in the full version of Sarek 2.7.1 also fixes the caching issue.

$ nextflow run main.nf -dsl1 -profile 'test,docker' -params-file 'params.yml' | grep "> MapReads" | tail -n 1 
ws[95/68f3b7] process > MapReads (S095-S095-1234N_M7)  [100%] 100 of 100 ✔

$ nextflow run main.nf -dsl1 -profile 'test,docker' -params-file 'params.yml' -resume | grep "> MapReads" | tail -n 1 
[6b/8adbd3] process > MapReads (S097-S097-9876T_M2)  [100%] 100 of 100, cached: 100 ✔

My next step is to figure out why this is happening and more importantly, what's the minimal change required to fix the issue while ensuring that the results remain the same.

Updated Parameters

input: https://gist.githubusercontent.com/BrunoGrandePhD/6869bf506e6acf7b920a22b666f4e443/raw/ff7ea0757ebbd324cd125b1d48f54e6ec9f969c9/sarek-test.tsv
skip_qc: "bamqc,baserecalibrator,bcftools,documentation,markduplicates,multiqc,samtools,sentieon,vcftools,versions"
no_intervals: true
skip_markduplicates: true
known_indels: null
fasta: /home/ssm-user/ref/fai/human_g1k_v37_decoy.small.fasta
fasta_fai: /home/ssm-user/ref/fai/human_g1k_v37_decoy.small.fasta.fai
bwa: /home/ssm-user/ref/bwa/human_g1k_v37_decoy.small.fasta.{amb,ann,bwt,pac,sa}
dict: /home/ssm-user/ref/dict/human_g1k_v37_decoy.small.dict
dbsnp: /home/ssm-user/ref/dbsnp/dbsnp_138.b37.small.vcf.gz
dbsnp_index: /home/ssm-user/ref/dbsnp/dbsnp_138.b37.small.vcf.gz.tbi

@BrunoGrandePhD
Copy link
Author

Using my stripped-down version of Sarek, I’m comparing the hashes for a job that was re-run despite -resume being used (see attachments below). I can confirm that the overall cache hash is different (hence the re-run), and the specific “sub-hash” that was different was for statusMap.

< INFO  nextflow.processor.TaskProcessor - [MapReads (S005-S005-1234N_M7)] cache hash: d8ad399be1e7065e50e02cc88513effc; mode: STANDARD; entries: 
---
> INFO  nextflow.processor.TaskProcessor - [MapReads (S005-S005-1234N_M7)] cache hash: 515d26d1f3d7a5d82129f328131b7b46; mode: STANDARD; entries: 

<   d2e0e780b8df5e88047f20236e6c5534 [java.util.HashMap$EntrySet] [statusMap={[S000, S000-1234N]=0, [S001, S001-1234N]=0, [S002, S002-1234N]=0, [S003, S003-1234N]=0, [S004, S004-1234N]=0, [S005, S005-1234N]=0}, task.cpus=2] 
---
>   f8df28f90d16aceac48daa7004411db4 [java.util.HashMap$EntrySet] [statusMap={[S000, S000-1234N]=0, [S001, S001-1234N]=0, [S002, S002-1234N]=0, [S003, S003-1234N]=0, [S004, S004-1234N]=0, [S005, S005-1234N]=0, [S006, S006-9876T]=1}, task.cpus=2]

Indeed, if you compare the values in the statusMap, there is one additional element: [S006, S006-9876T]=1 while all other elements are identical. This makes me wonder if the statusMap isn't done being populated as the MapReads` jobs are being assessed.

{[S000, S000-1234N]=0, [S001, S001-1234N]=0, [S002, S002-1234N]=0, [S003, S003-1234N]=0, [S004, S004-1234N]=0, [S005, S005-1234N]=0}
{[S000, S000-1234N]=0, [S001, S001-1234N]=0, [S002, S002-1234N]=0, [S003, S003-1234N]=0, [S004, S004-1234N]=0, [S005, S005-1234N]=0, [S006, S006-9876T]=1}

Attachments

first.txt
second.txt

@BrunoGrandePhD
Copy link
Author

BrunoGrandePhD commented May 26, 2022

Adding a sleep(5000) to ensure that statusMap is fully populated seems to do the trick in ensuring that all MapReads jobs are cached. I need to find a better method to achieve this though. Basically, I need to wait until the channel is fully processed before allowing Nextflow to determine which MapReads jobs can be run.

def extractInfos(channel) {
    def genderMap = [:]
    def statusMap = [:]
    channel = channel.map{ it ->
        def idPatient = it[0]
        def gender = it[1]
        def status = it[2]
        def idSample = it[3]
        genderMap[idPatient] = gender
        statusMap[idPatient, idSample] = status
        [idPatient] + it[3..-1]
    }
    sleep(5000)
    [genderMap, statusMap, channel]
}
$ nextflow run main.nf -dsl1 -profile 'test,docker' -params-file 'params.yml' -resume
[...]
[83/94d427] process > FastQCFQ (S097-S097-9876T_M2) [100%] 100 of 100, cached: 100 ✔
[b3/022ea7] process > MapReads (S046-S046-9876T_M1) [100%] 100 of 100, cached: 50 ✔

$ nextflow run main.nf -dsl1 -profile 'test,docker' -params-file 'params.yml' -resume
[...]
[aa/b841d0] process > FastQCFQ (S098-S098-9876T_M3) [100%] 100 of 100, cached: 100 ✔
[85/1589cc] process > MapReads (S094-S094-1234N_M6) [100%] 100 of 100, cached: 100 ✔

@BrunoGrandePhD
Copy link
Author

BrunoGrandePhD commented May 26, 2022

I was also able to fix the caching issue by including a little snippet immediately before the return value, which forces the function to wait until the full list accumulates before re-creating a channel and allowing the function to continue. This doesn't feel conventional, but I'm still waiting to hear back about a better solution.

def extractInfos(channel) {
    def genderMap = [:]
    def statusMap = [:]
    channel = channel.map{ it ->
        def idPatient = it[0]
        def gender = it[1]
        def status = it[2]
        def idSample = it[3]
        genderMap[idPatient] = gender
        statusMap[idPatient, idSample] = status
        [idPatient] + it[3..-1]
    }
    channel = Channel.fromList(channel.toList().val)
    [genderMap, statusMap, channel]
}

Edit: Just confirming that the above fix works as expected in Sarek 2.7.1:

$ nextflow run main.nf -dsl1 -profile 'test,docker' -params-file 'params.yml' -resume | grep "> MapReads" | tail -n 1 
[f8/4c89a3] process > MapReads (S099-S099-9876T_M4)  [100%] 100 of 100 ✔

$ nextflow run main.nf -dsl1 -profile 'test,docker' -params-file 'params.yml' -resume | grep "> MapReads" | tail -n 1 
[ca/57702f] process > MapReads (S098-S098-9876T_M3)  [100%] 100 of 100, cached: 100 ✔

BrunoGrandePhD pushed a commit to Sage-Bionetworks-Workflows/sarek that referenced this issue May 26, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants