Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Adding new choice to --on-error #1974

Open
wants to merge 15 commits into
base: main
Choose a base branch
from

Conversation

AlexTate
Copy link
Contributor

@AlexTate AlexTate commented Feb 4, 2024

Summary

This pull request introduces a new choice, kill, for the --on-error parameter.

Motivation

There currently isn't a way to have cwltool immediately stop parallel jobs when one of them fails. One might expect --on-error stop to accomplish this, but the help string is specific and accurate: "do not submit any more steps". Since scatter and subworkflow are treated as single "steps" within the parent workflow, this means cwltool is not wrong to wait for the rest of the step's parallel jobs to finish when --on-error stop. However, sometimes individual scatter jobs take a long time to complete, so if one of them fails early on, cwltool might wait great lengths of time for the other scatter jobs to complete before terminating the workflow. With --on-error kill, all running jobs are quickly notified and self-terminate upon one job's failure.

Demonstration of the Issue

When running the following workflow with cwltool --parallel --on-error stop, the total runtime is ~33 seconds despite one of the scatterstep tasks terminating unexpectedly. Ideally the workflow would terminate immediately. --on-error kill accomplishes that.

#!/usr/bin/env cwl-runner

class: Workflow
cwlVersion: v1.2

inputs:
  sleeptime:
    type: int[]
    default: [ 33, 33, 33, 33, 33 ]
outputs: { }
requirements:
  - class: ScatterFeatureRequirement

steps:
  scatterstep:
    in: { sleeptime: sleeptime }
    out: [ ]
    scatter: sleeptime
    run:
      class: CommandLineTool
      baseCommand: sleep
      inputs:
        sleeptime: { type: int, inputBinding: { position: 1 } }
      outputs: { }
  kill:
    in: { }
    out: [ ]
    run:
      class: CommandLineTool
      baseCommand: [ 'bash', '-c' ]
      arguments:
        - |
          # Wait 1 second for scatter to spin up then select a random sleep process to kill
          sleep 1
          ps -ef | grep 'sleep 33' | grep -v grep | awk '{print $2}' | shuf | head -n 1 | xargs kill -9
      inputs: { }
      outputs: { }

Forum Post

https://cwl.discourse.group/t/how-to-fail-fast-during-parallel-scatter/868

Concerns

  • workflow_eval_lock.release() had to be moved to the finally block in MultithreadedJobExecutor.run_jobs()
  • Are any important steps skipped in JobBase._execute() due to if runtimeContext.kill_switch.is_set(): return? For that matter, shouldn't there be a finally block to contain some of these steps such as deleting runtime-generated files containing secrets?
  • The kill switch response in TaskQueue is fairly loose. Since the response is primarily handled at the job level, any tasks that start after the kill switch is activated will take care of themselves and self terminate

@cwl-bot
Copy link

cwl-bot commented Feb 4, 2024

This pull request has been mentioned on Common Workflow Language Discourse. There might be relevant details there:

https://cwl.discourse.group/t/how-to-fail-fast-during-parallel-scatter/868/5

cwltool/job.py Outdated
nonlocal ks_tm
if kill_switch.is_set():
_logger.error("[job %s] terminating by kill switch", self.name)
if sproc.stdin: sproc.stdin.close()
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be two lines, run make dev cleanup (or just make cleanup if you already run make dev) to fix that automatically

Copy link

codecov bot commented Apr 17, 2024

Codecov Report

Attention: Patch coverage is 57.25191% with 56 lines in your changes missing coverage. Please review.

Project coverage is 77.06%. Comparing base (73b742f) to head (105fee9).

Files Patch % Lines
cwltool/job.py 53.65% 29 Missing and 9 partials ⚠️
cwltool/task_queue.py 41.66% 5 Missing and 2 partials ⚠️
cwltool/executors.py 37.50% 4 Missing and 1 partial ⚠️
cwltool/errors.py 57.14% 3 Missing ⚠️
cwltool/workflow_job.py 84.61% 0 Missing and 2 partials ⚠️
cwltool/workflow.py 83.33% 0 Missing and 1 partial ⚠️

❗ There is a different number of reports uploaded between BASE (73b742f) and HEAD (105fee9). Click for more details.

HEAD has 5 uploads less than BASE
Flag BASE (73b742f) HEAD (105fee9)
17 12
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #1974      +/-   ##
==========================================
- Coverage   83.81%   77.06%   -6.76%     
==========================================
  Files          46       46              
  Lines        8262     8333      +71     
  Branches     2199     2120      -79     
==========================================
- Hits         6925     6422     -503     
- Misses        856     1350     +494     
- Partials      481      561      +80     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

AlexTate and others added 6 commits April 20, 2024 20:32
… runtimeContext.on_error = "kill", then the switch is activated. WorkflowKillSwitch is raised so it can be handled at the workflow and executor levels
…ch's status in the monitor function. The monitor function, up to this point, has been for gathering memory usage statistics via a timer thread. A second timer thread now monitors the kill switch.
…revent pending tasks from starting by simply draining the queue. This is a very loose policy, but since kill switch response is handled at the job level, any tasks that start after the kill switch is activated will take care of themselves and self terminate
… an executor. The workflow_eval_lock release had to be moved to the finally block in MultithreadedJobExecutor.run_jobs(). Otherwise, TaskQueue threads running MultithreadedJobExecutor._runner() will never join() because _runner() waits indefinitely for the workflow_eval_lock in its own finally block.
So that the runtime_context object can still be pickled.

Other cleanups
…askQueue. This helps to better synchronize the kill switch event and avoid adding/executing tasks after the switch has been set.

This approach is tighter than my previous draft, but a race condition still exists where a task might be started after the kill switch has been set and announced. If this happens then the leaked job's monitor function will kill it and the subprocess' lifespan will be a maximum of the monitor's timer interval (currently 1 second). So when this rare event happens, the console output will be potentially confusing since it will show a new job starting after the kill switch has been announced.
… when exiting due to kill switch. Those actions have been placed under a `finally` block so that they are executed by both the "switching" job and the "responding" jobs.

However, some of these post actions added a lot of redundant and unhelpful terminal output when handling jobs killed DUE TO the kill switch. The verbose output obscured the error's cause which isn't helpful. Two new process statuses have been added in order to better handle the event:
- indeterminant: a default value for processStatus.
- killed: the job was killed due to the kill switch being set.

This approach also means that partial outputs aren't collected from jobs that have been killed.
1) Once a job has been terminated, all other parallel jobs should also terminate. In this test, the runtime of the workflow indicates whether the kill switch has been handled correctly. If the kill switch is successful then the workflow's runtime should be significantly shorter than sleep_time.

2) Outputs produced by a successful step should still be collected. In this case, the completed step is make_array.

To be frank, this test could be simplified by using a ToolTimeLimit requirement rather than process_roulette.cwl
…to this issue. Other changes were offered by the tool, but they are outside the scope of this issue.
@AlexTate AlexTate marked this pull request as ready for review August 7, 2024 02:20
Copy link
Member

@mr-c mr-c left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you, again, @AlexTate for your PR!

tests/test_parallel.py::test_on_error_kill is unfortunately failing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants