-
Notifications
You must be signed in to change notification settings - Fork 421
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Sarek 2.7.1 doesn't reuse all cached MapReads
jobs
#555
Comments
In order to isolate the root cause of this issue, I created a stripped-down version of Sarek 2.7.1 with only two processes, I was able to reproduce the issue with the stripped-down version (see commands and output below). Note how the second run with $ nextflow run main.nf -dsl1 -profile 'test,docker' -params-file 'params.yml'
[...]
[d6/fea675] process > FastQCFQ (S096-S096-9876T_M1) [100%] 100 of 100 ✔
[f8/84d5f3] process > MapReads (S098-S098-9876T_M3) [100%] 100 of 100 ✔
$ nextflow run main.nf -dsl1 -profile 'test,docker' -params-file 'params.yml' -resume
[...]
[7f/b623d2] process > FastQCFQ (S094-S094-1234N_M6) [100%] 100 of 100, cached: 100 ✔
[0d/e88ddd] process > MapReads (S080-S080-1234N_M1) [100%] 100 of 100, cached: 35 ✔ After some fiddling with the stripped-down version, I think that I've isolated the root cause to the inclusion of $ nextflow run main.nf -dsl1 -profile 'test,docker' -params-file 'params.yml' -resume
[...]
[32/f54766] process > FastQCFQ (S089-S089-9876T_M4) [100%] 100 of 100, cached: 100 ✔
[1b/11dfc5] process > MapReads (S097-S097-9876T_M2) [100%] 100 of 100 ✔
$ nextflow run main.nf -dsl1 -profile 'test,docker' -params-file 'params.yml' -resume
[...]
[d6/fea675] process > FastQCFQ (S096-S096-9876T_M1) [100%] 100 of 100, cached: 100 ✔
[4d/5e1024] process > MapReads (S096-S096-9876T_M1) [100%] 100 of 100, cached: 100 ✔ I confirmed that a similar change in the full version of Sarek 2.7.1 also fixes the caching issue. $ nextflow run main.nf -dsl1 -profile 'test,docker' -params-file 'params.yml' | grep "> MapReads" | tail -n 1
ws[95/68f3b7] process > MapReads (S095-S095-1234N_M7) [100%] 100 of 100 ✔
$ nextflow run main.nf -dsl1 -profile 'test,docker' -params-file 'params.yml' -resume | grep "> MapReads" | tail -n 1
[6b/8adbd3] process > MapReads (S097-S097-9876T_M2) [100%] 100 of 100, cached: 100 ✔ My next step is to figure out why this is happening and more importantly, what's the minimal change required to fix the issue while ensuring that the results remain the same. Updated Parameters input: https://gist.githubusercontent.com/BrunoGrandePhD/6869bf506e6acf7b920a22b666f4e443/raw/ff7ea0757ebbd324cd125b1d48f54e6ec9f969c9/sarek-test.tsv
skip_qc: "bamqc,baserecalibrator,bcftools,documentation,markduplicates,multiqc,samtools,sentieon,vcftools,versions"
no_intervals: true
skip_markduplicates: true
known_indels: null
fasta: /home/ssm-user/ref/fai/human_g1k_v37_decoy.small.fasta
fasta_fai: /home/ssm-user/ref/fai/human_g1k_v37_decoy.small.fasta.fai
bwa: /home/ssm-user/ref/bwa/human_g1k_v37_decoy.small.fasta.{amb,ann,bwt,pac,sa}
dict: /home/ssm-user/ref/dict/human_g1k_v37_decoy.small.dict
dbsnp: /home/ssm-user/ref/dbsnp/dbsnp_138.b37.small.vcf.gz
dbsnp_index: /home/ssm-user/ref/dbsnp/dbsnp_138.b37.small.vcf.gz.tbi |
Using my stripped-down version of Sarek, I’m comparing the hashes for a job that was re-run despite
Indeed, if you compare the values in the statusMap, there is one additional element:
Attachments |
Adding a def extractInfos(channel) {
def genderMap = [:]
def statusMap = [:]
channel = channel.map{ it ->
def idPatient = it[0]
def gender = it[1]
def status = it[2]
def idSample = it[3]
genderMap[idPatient] = gender
statusMap[idPatient, idSample] = status
[idPatient] + it[3..-1]
}
sleep(5000)
[genderMap, statusMap, channel]
} $ nextflow run main.nf -dsl1 -profile 'test,docker' -params-file 'params.yml' -resume
[...]
[83/94d427] process > FastQCFQ (S097-S097-9876T_M2) [100%] 100 of 100, cached: 100 ✔
[b3/022ea7] process > MapReads (S046-S046-9876T_M1) [100%] 100 of 100, cached: 50 ✔
$ nextflow run main.nf -dsl1 -profile 'test,docker' -params-file 'params.yml' -resume
[...]
[aa/b841d0] process > FastQCFQ (S098-S098-9876T_M3) [100%] 100 of 100, cached: 100 ✔
[85/1589cc] process > MapReads (S094-S094-1234N_M6) [100%] 100 of 100, cached: 100 ✔ |
I was also able to fix the caching issue by including a little snippet immediately before the return value, which forces the function to wait until the full list accumulates before re-creating a channel and allowing the function to continue. This doesn't feel conventional, but I'm still waiting to hear back about a better solution. def extractInfos(channel) {
def genderMap = [:]
def statusMap = [:]
channel = channel.map{ it ->
def idPatient = it[0]
def gender = it[1]
def status = it[2]
def idSample = it[3]
genderMap[idPatient] = gender
statusMap[idPatient, idSample] = status
[idPatient] + it[3..-1]
}
channel = Channel.fromList(channel.toList().val)
[genderMap, statusMap, channel]
} Edit: Just confirming that the above fix works as expected in Sarek 2.7.1: $ nextflow run main.nf -dsl1 -profile 'test,docker' -params-file 'params.yml' -resume | grep "> MapReads" | tail -n 1
[f8/4c89a3] process > MapReads (S099-S099-9876T_M4) [100%] 100 of 100 ✔
$ nextflow run main.nf -dsl1 -profile 'test,docker' -params-file 'params.yml' -resume | grep "> MapReads" | tail -n 1
[ca/57702f] process > MapReads (S098-S098-9876T_M3) [100%] 100 of 100, cached: 100 ✔ |
Check Documentation
I have checked the following places for your error, but I didn't find anything related to caching.
Description of the bug
When I relaunch a Sarek workflow, a seemingly random subset of
MapReads
jobs (ranging from 5 to 50%) are re-run despite cached results being available. This causes all downstream jobs for the affected samples to also be re-run. For large datasets, especially those being processed in the cloud, this bug causes a significant cost to be incurred if the workflow doesn't complete successfully on the first try, especially if most of the workflow is complete before an error arises.I tried using previous versions of Sarek (specifically,
2.6.1
,2.5.2
, and2.5
), and they all seem to run into this bug.Steps to reproduce
I adapted the input file from the
test
profile by artificially increasing the number of rows to 100. This makes it easier to spot this issue, which doesn't always occur with only a few samples. It's also easier to notice the issue if you monitor the workflow in Nextflow Tower.params.yml
contains the parameters copied below. It should complete successfully. On a local machine, this command took less than 2 minutes.nextflow run 'https://github.com/nf-core/sarek' -r '2.7.1' -dsl1 -profile 'test,docker' -params-file 'params.yml'
-resume
. It should also complete successfully, but a subset ofMapReads
jobs should have been re-run instead of being cached.Parameters
The parameters aside from
input
are included to skip steps that aren't relevant to the reproducible example.Expected behaviour
I expect that workflow re-runs will leverage cached results unless there's a reason to believe that the job needs to be updated. In the situation I describe above, nothing changes between the two runs, so there shouldn't be a reason for jobs to be re-run, especially not a seemingly random subset of jobs.
Log files
Have you provided the following extra information/files:
.nextflow.log
filesThe Nextflow logs below correspond to my walkthrough of the steps I listed in "Steps to reproduce".
firstpass-nextflow.log
secondpass-nextflow.log
System
Linux 4.14.232-176.381.amzn2.x86_64
)Nextflow Installation
22.04.2
Container engine
20.10.13
nfcore/sarek:2.7.1
Additional context
The text was updated successfully, but these errors were encountered: