Introduce timeout in retire_workers #6252

crusaderky · 2022-04-29T16:48:06Z

Follow-up from #6223

At the moment, retire_workers waits indefinitely, until either

all in-memory tasks also exist somewhere else (worker transitions from closing_gracefully to closing), or
there are no other workers on the cluster which can accept the unique data and are neither in paused nor closing_gracefully state (worker transitions from closing_gracefully back to running).

There is an implicit timeout where the Active Memory Manager tries to replicate keys from the retiring worker to other workers, but the recipients are unresponsive and the scheduler hasn't noticed yet: the AMM will repeat the replication request every 2 seconds, until the unresponsive recipient finally hits timeout and is removed. Note that this behaviour has changed very recently (#6200). When this happens, the AMM will try replicating towards a different recipient, or abort retirement if no more recipients are available.

Notably, the above system will not protect you in case of

Deadlocked recipient, which however is still regularly sending heartbeats to the scheduler
Yet to be discovered deadlocks within the AMM. In particular, in early development stages I faced a lot of problems with tasks "ping-ponging" between workers at each AMM iteration, and fixes them only through fairly delicate logic.

It would be healthy to have a timeout, e.g. 60s, in retire_workers to mitigate this.
Such a timeout should be scheduler-side.

It is important to consider the use case of a worker that is retired while it holds tens of GBs of spilled data, and the spill disk is not very performant. This can easily mean it will take way more than 60s. #5999 will aggravate this use case, since it will extend the mechanism that chokes the outgoing comms of a paused worker to workers in closing_gracefully state which are above the pause threshold.
To deal with this, I would suggest to define the timeout as "the number of unique keys did not decrease compared to 60 seconds ago".

CC @fjetter @gjoseph92

The text was updated successfully, but these errors were encountered:

mrocklin · 2022-04-29T16:58:43Z

I'm fine with a "still making progress" timeout. I think that we should also have a "time to be done" timeout. Many systems will want to kick us out after two minutes (hpc job schedulers, AWS spot) and my guess is that it's better if we do this ourselves. Regardless, it'll be easy to make this configurable.

…

On Fri, Apr 29, 2022, 11:48 AM crusaderky ***@***.***> wrote: Follow-up from #6223 <#6223> At the moment, retire_workers waits indefinitely, until either - all in-memory tasks also exists somewhere else (worker transitions from closing_gracefully to closing), or - there are no other workers on the cluster which can accept the unique data and are neither in paused nor closing_gracefully state (worker transitions from closing_gracefully back to running). There is an *implicit* timeout where the Active Memory Manager tries to replicate keys from the retiring worker to other workers, but the recipient are unresponsive and the scheduler hasn't noticed yet: the AMM will repeat the replication request every 2 seconds, until the unresponsive recipient finally hits timeout and is removed. Note that this behaviour has changed very recently (#6200 <#6200>). Notably, the above system will *not* protect you in case of - Deadlocked recipient, which however is still regularly sending heartbeats to the scheduler - Yet to be discovered deadlocks within the AMM. In particular, in early development stages I faced a lot of problems with tasks "ping-ponging" between workers at each AMM iteration, and fixes them only through fairly delicate logic. It would be healthy to have a timeout, e.g. 60s, in retire_workers to mitigate this. Such a timeout should be scheduler-side. It is important to consider the use case of a worker that is retired while it holds tens of GBs of spilled data, and the spill disk is not very performant. This can easily mean it will take way more than 60s. #5999 <#5999> will aggravate this use case, since it will extend the mechanism that chokes the outgoing comms of a paused worker to workers in closing_gracefully state which are above the pause threshold. To deal with this, I would suggest to define the timeout as "the number of unique keys did not decrease compared to 60 seconds ago". — Reply to this email directly, view it on GitHub <#6252>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/AACKZTCRAPVAUX37RNHPFWDVHQHFFANCNFSM5UWUEMQA> . You are receiving this because you are subscribed to this thread.Message ID: ***@***.***>

crusaderky added the memory label Apr 29, 2022

crusaderky mentioned this issue Jul 7, 2022

Restart paused workers after a certain timeout #5999

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce timeout in retire_workers #6252

Introduce timeout in retire_workers #6252

crusaderky commented Apr 29, 2022 •

edited

Loading

mrocklin commented Apr 29, 2022 via email

Introduce timeout in retire_workers #6252

Introduce timeout in retire_workers #6252

Comments

crusaderky commented Apr 29, 2022 • edited Loading

mrocklin commented Apr 29, 2022 via email

crusaderky commented Apr 29, 2022 •

edited

Loading