Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Validation of task outputs #3711

Open
robsyme opened this issue Mar 2, 2023 · 5 comments · May be fixed by #5345
Open

Validation of task outputs #3711

robsyme opened this issue Mar 2, 2023 · 5 comments · May be fixed by #5345
Assignees
Milestone

Comments

@robsyme
Copy link
Collaborator

robsyme commented Mar 2, 2023

Fail tasks where stageOut fails

This is similar to existing issue #3372, but instead of being concerned with the publishDir step (run work directory to publish output location), this issue is concerned with the output step (task work directory to run work directory)

Usage scenario

When task outputs are copied from the task working directory to the run work directory using the aws cli, the aws s3 cp process can fail. Nextflow ignores this failure with a || true, which means that in the event of failure, the downstream tasks expecting this input will fail. This makes debugging harder than it should be, as users will be looking for the error in the downstream task rather than in the task that actually failed.

Suggest implementation

I would recommend that we remove the || true on L187 here:

return """\
IFS=\$'\\n'
for name in \$(eval "ls -1d ${escape.join(' ')}" | sort | uniq); do
${stageOutCommand('$name', targetDir, mode)} || true
done
unset IFS""".stripIndent(true)
}

In addition, it may be helpful to provide a more descriptive warning in the event of an upload failure, but I'm not 100% sure how best to communicate that to the Nextflow process.

@pditommaso
Copy link
Member

This is somehow related to this comment. @mribeirodantas tried removing || true but reported no difference.

Some Bash hacker may want to give it another try

@stale
Copy link

stale bot commented Aug 12, 2023

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@jchorl
Copy link
Contributor

jchorl commented Aug 12, 2024

I've hit this as well, when a cp fails for any reason and it silently fails to stage out. Even worse if the output is optional, because nextflow can't differentiate between not-produced and failed-to-stage-out. Seeing the linked issues here would indicate it's somewhat important.

This is somehow related to this comment. @mribeirodantas tried removing || true but reported no difference.

Some Bash hacker may want to give it another try

Do you recall why? in theory if set -e is set and a cp operation fails, the exit code should be non-zero and it should fail.

$ cp doesntexist foo
cp: cannot stat 'doesntexist': No such file or directory
$ echo $?
1

@pditommaso
Copy link
Member

Here there's a tentative solution but it was aborted

@bentsherman bentsherman assigned bentsherman and jorgee and unassigned bentsherman Sep 25, 2024
@jorgee
Copy link
Contributor

jorgee commented Sep 27, 2024

I think the @pditommaso solution #3858 (add set-e before nxf_unstage()) and the change to wait all nxf_parallel processes #4050 should detect the exit codes. The statement wait $p returns the exit code and the set -e is failing the process when the wait returns something different to 0.

I have tested with Azure forcing an error in the stage out. It is detecting the failure and setting the exit code of the azcopy process. I need to check with the other clouds.

Another issue that I see is the print of the errors. Stage-out errors are in the .command.log but not in the .command.err, and it is not printed when there is an error in a process execution.

@jorgee jorgee linked a pull request Sep 30, 2024 that will close this issue
@bentsherman bentsherman added this to the 24.10.0 milestone Oct 4, 2024
@bentsherman bentsherman modified the milestones: 24.10.0, 25.04.0 Nov 14, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
5 participants