Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reliable timeout for Prefect tasks/deployments #159

Open
sgoodm opened this issue Feb 1, 2023 · 0 comments
Open

Reliable timeout for Prefect tasks/deployments #159

sgoodm opened this issue Feb 1, 2023 · 0 comments

Comments

@sgoodm
Copy link
Member

sgoodm commented Feb 1, 2023

This may ultimately be tied to several different issues within Prefect, but I think addressing one or two scenarios would cover the majority of our related issues.

Scenario 1: A task crashes while all others complete but the run hangs. This seems like the crashed task does not exit properly (no retries are made), and a timeout could help to end the task (and allow Prefect to retry or at least recover the broader flow and exit in a crashed state).

  • Incorporating a task timeout might resolve this
  • This could also potentially be tied to our handling of task futures in datasets.py

Scenario 2: Issues with the scheduler/agent result in run / tasks being "lost" and hanging. The direct causes of this are a bit nebulous and can seemingly range from network communication lapses, the agent being overwhelmed by tasks running directly on the agent (via dask taskrunner rather than hpc), or simply very long running jobs.

  • Adhering to best practices for deployments - such as avoiding heavy overlaps, only using hpc task runners, managing errors better in deployments themselves - may help significantly minimize this scenario.
  • Without a clear root cause, it might be worth exploring a way to detect and address runs which are stuck as a result of this scenario so that they can be cancelled and restarted
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant