-
Notifications
You must be signed in to change notification settings - Fork 634
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Pipeline execution hang when native task fail to be submitted #2060
Comments
mamma mia .. which is the 2.1k the failing task? |
The module file for the native Groovy process is here. It will always only just takes a single GTF file so no idea why so many jobs are spawned. It is called in the main workflow here. As I mentioned above, this happens with a completely unrelated native Groovy process too so some sort of pattern there. Let me know if you want me to test anything else. |
I mean, there are 2.1k failed tasks, what's the name of the tasks failing? |
Yup, it says 0 of 1 but when you hover over that particular task in Tower it shows |
Not able to replicate. Do you have nextflow log file? you should be able to download from the tower UI. |
Ok. Was able to reproduce and shared the run with you via Tower. So what is weird is that NF raises an error but still carries on submitting more and more jobs 🤔 Uploading local `bin` scripts folder to s3://nf-core-awsmegatests/rnaseq/dev/work/tmp/e9/13da243989d816fbca30b9b8e80979/bin
WARN: Process 'SRA_DOWNLOAD:SRA_TO_SAMPLESHEET' cannot be executed by 'awsbatch' executor -- Using 'local' executor instead
WARN: Local executor only supports default file system -- Check work directory: s3://nf-core-awsmegatests/rnaseq/dev/work
Monitor the execution with Nextflow Tower using this url https://tower.nf/watch/2BhBHVOqsYejv0
[44/7513ef] Submitted process > SRA_DOWNLOAD:SRA_IDS_TO_RUNINFO (SRR11140746)
[56/41fd5d] Submitted process > SRA_DOWNLOAD:SRA_IDS_TO_RUNINFO (SRR11140744)
[37/27990b] Submitted process > SRA_DOWNLOAD:SRA_RUNINFO_TO_FTP (1)
[26/e62c35] Submitted process > SRA_DOWNLOAD:SRA_RUNINFO_TO_FTP (2)
[c2/33ed2f] Submitted process > SRA_DOWNLOAD:SRA_FASTQ_FTP (SRX7777164_T1)
[d3/ec9f21] Submitted process > SRA_DOWNLOAD:SRA_FASTQ_FTP (SRX7777166_T1)
Error executing process > 'SRA_DOWNLOAD:SRA_TO_SAMPLESHEET (SRX7777164_T1)'
Caused by:
Process requirement exceed available memory -- req: 6 GB; avail: 1 GB |
The error is raised because there's a requirement (global?) of 6GB, instead the task being native run on the head node that only has 1.GB. Not sure what's happening, then .. |
Ok, I'm able to replicate. Now I need to find a solution 😬 |
Ok, pushed a patch. Thanks for reporting. |
Thank you! |
Note that the causing problem was the excess memory request |
Yup. What is the best way to customise this via Tower? Do we need to change the default head node we are using? |
it's better to limit request for native (ie local) tasks, makes no sense to use 6 GB for that. |
ps. the amount of mem for the Tower head job can be set in the compute env settings "Head Job memory" under Advanced settings. |
Signed-off-by: yonghao <yyhao1@gmail.com>
I am testing the nf-core/rnaseq pipeline via Tower on AWS and it appears that native Groovy processes that should only be executed once are somehow caught up in some sort of recursion that keeps on spawning more and more jobs. As you can see in the screenshot below I killed the pipeline execution after 2144 jobs were spawned for that single process.
Running the command locally and via Github Actions works perfectly fine and only spawns a single
process
:nextflow run nf-core/rnaseq -profile test -r dev
You should be able to reproduce this in Tower with an AWS set-up:
The process is called in the main workflow here and the module file for the process is here.
I have another workflow in the pipeline that uses a native Groovy process and I observed the same issue.
If it's something quite low-level I am happy to try and find an alternative solution so we can get the pipeline out :)
Thanks a bunch!
The text was updated successfully, but these errors were encountered: