Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce timeout in retire_workers #6252

Open
crusaderky opened this issue Apr 29, 2022 · 1 comment
Open

Introduce timeout in retire_workers #6252

crusaderky opened this issue Apr 29, 2022 · 1 comment
Labels

Comments

@crusaderky
Copy link
Collaborator

crusaderky commented Apr 29, 2022

Follow-up from #6223

At the moment, retire_workers waits indefinitely, until either

  • all in-memory tasks also exist somewhere else (worker transitions from closing_gracefully to closing), or
  • there are no other workers on the cluster which can accept the unique data and are neither in paused nor closing_gracefully state (worker transitions from closing_gracefully back to running).

There is an implicit timeout where the Active Memory Manager tries to replicate keys from the retiring worker to other workers, but the recipients are unresponsive and the scheduler hasn't noticed yet: the AMM will repeat the replication request every 2 seconds, until the unresponsive recipient finally hits timeout and is removed. Note that this behaviour has changed very recently (#6200). When this happens, the AMM will try replicating towards a different recipient, or abort retirement if no more recipients are available.

Notably, the above system will not protect you in case of

  • Deadlocked recipient, which however is still regularly sending heartbeats to the scheduler
  • Yet to be discovered deadlocks within the AMM. In particular, in early development stages I faced a lot of problems with tasks "ping-ponging" between workers at each AMM iteration, and fixes them only through fairly delicate logic.

It would be healthy to have a timeout, e.g. 60s, in retire_workers to mitigate this.
Such a timeout should be scheduler-side.

It is important to consider the use case of a worker that is retired while it holds tens of GBs of spilled data, and the spill disk is not very performant. This can easily mean it will take way more than 60s. #5999 will aggravate this use case, since it will extend the mechanism that chokes the outgoing comms of a paused worker to workers in closing_gracefully state which are above the pause threshold.
To deal with this, I would suggest to define the timeout as "the number of unique keys did not decrease compared to 60 seconds ago".

CC @fjetter @gjoseph92

@mrocklin
Copy link
Member

mrocklin commented Apr 29, 2022 via email

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants