-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce timeout in retire_workers #6252
Labels
Comments
I'm fine with a "still making progress" timeout. I think that we should
also have a "time to be done" timeout. Many systems will want to kick us
out after two minutes (hpc job schedulers, AWS spot) and my guess is that
it's better if we do this ourselves.
Regardless, it'll be easy to make this configurable.
…On Fri, Apr 29, 2022, 11:48 AM crusaderky ***@***.***> wrote:
Follow-up from #6223 <#6223>
At the moment, retire_workers waits indefinitely, until either
- all in-memory tasks also exists somewhere else (worker transitions
from closing_gracefully to closing), or
- there are no other workers on the cluster which can accept the
unique data and are neither in paused nor closing_gracefully state (worker
transitions from closing_gracefully back to running).
There is an *implicit* timeout where the Active Memory Manager tries to
replicate keys from the retiring worker to other workers, but the recipient
are unresponsive and the scheduler hasn't noticed yet: the AMM will repeat
the replication request every 2 seconds, until the unresponsive recipient
finally hits timeout and is removed. Note that this behaviour has changed
very recently (#6200 <#6200>).
Notably, the above system will *not* protect you in case of
- Deadlocked recipient, which however is still regularly sending
heartbeats to the scheduler
- Yet to be discovered deadlocks within the AMM. In particular, in
early development stages I faced a lot of problems with tasks
"ping-ponging" between workers at each AMM iteration, and fixes them only
through fairly delicate logic.
It would be healthy to have a timeout, e.g. 60s, in retire_workers to
mitigate this.
Such a timeout should be scheduler-side.
It is important to consider the use case of a worker that is retired while
it holds tens of GBs of spilled data, and the spill disk is not very
performant. This can easily mean it will take way more than 60s. #5999
<#5999> will aggravate this use
case, since it will extend the mechanism that chokes the outgoing comms of
a paused worker to workers in closing_gracefully state which are above the
pause threshold.
To deal with this, I would suggest to define the timeout as "the number of
unique keys did not decrease compared to 60 seconds ago".
—
Reply to this email directly, view it on GitHub
<#6252>, or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AACKZTCRAPVAUX37RNHPFWDVHQHFFANCNFSM5UWUEMQA>
.
You are receiving this because you are subscribed to this thread.Message
ID: ***@***.***>
|
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Follow-up from #6223
At the moment,
retire_workers
waits indefinitely, until eitherThere is an implicit timeout where the Active Memory Manager tries to replicate keys from the retiring worker to other workers, but the recipients are unresponsive and the scheduler hasn't noticed yet: the AMM will repeat the replication request every 2 seconds, until the unresponsive recipient finally hits timeout and is removed. Note that this behaviour has changed very recently (#6200). When this happens, the AMM will try replicating towards a different recipient, or abort retirement if no more recipients are available.
Notably, the above system will not protect you in case of
It would be healthy to have a timeout, e.g. 60s, in retire_workers to mitigate this.
Such a timeout should be scheduler-side.
It is important to consider the use case of a worker that is retired while it holds tens of GBs of spilled data, and the spill disk is not very performant. This can easily mean it will take way more than 60s. #5999 will aggravate this use case, since it will extend the mechanism that chokes the outgoing comms of a paused worker to workers in closing_gracefully state which are above the pause threshold.
To deal with this, I would suggest to define the timeout as "the number of unique keys did not decrease compared to 60 seconds ago".
CC @fjetter @gjoseph92
The text was updated successfully, but these errors were encountered: