-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
AMM can leave a forgotten task forever in missing state #6479
Comments
Same problem for tasks in flight: @gen_cluster(client=True, nthreads=[("", 1)], timeout=3)
async def test_forget_acquire_replicas_flight(c, s, a):
"""If a dependency fetch finishes on a worker after the scheduler already released
everything, the worker might be stuck with a redundant replica which is never
cleaned up.
"""
async with BlockedGatherDep(s.address) as b:
x = c.submit(inc, 1, key="x", workers=[a.address])
await x
s.request_acquire_replicas(b.address, ["x"], stimulus_id="test")
await b.in_gather_dep.wait()
assert b.tasks["x"].state == "flight"
x.release()
while "x" in s.tasks:
await asyncio.sleep(0.01)
b.block_gather_dep.set()
while b.tasks:
await asyncio.sleep(0.01) XREF test_forget_data_not_supposed_to_have for tasks acquired through compute-task |
I remember having similar logic on the scheduler before with "I don't know about this task, please forget it" in various circumstances. This led to deadlocks due to race conditions Wouldn't a better fix be for the scheduler be instead to remember who it asked to acquire a replica? |
Everything goes through the same batched comms, sequentially. The RPC commands that can cause a race condition are:
[EDIT] see epic #6604
It's possible, but also unnecessarily complicated IMHO. |
@fjetter are you happy for me to proceed with my design? |
Yes, sounds good. I just wanted to highlight the possibility. if batchedsend ordering protects us, that's good news 👍 |
Reproducer:
The above test times out on the last line.
Proposed design
At the moment, the scheduler silently ignores missing keys in the
request-refresh-who-has
message.The scheduler should instead respond stating "I don't know about this key, you should forget about it too". This would trigger the key to be forgotten on the worker too.
Blockers
CC @fjetter
The text was updated successfully, but these errors were encountered: