-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Zombie tasks after missing->released transition #5316
Conversation
2a7c2c0
to
92ed7c7
Compare
Ensure that no fetch/flight tasks are left in the task dict of a | ||
worker after everything was released | ||
""" | ||
a.total_in_connections = 0 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
a.total_in_connections = 0 | |
# Cause f1 to remain in "fetch" state on b forever | |
a.total_in_connections = 0 |
f2 = c.submit(inc, f1, key="f2", workers=[b.address]) | ||
|
||
while f1.key not in b.tasks: | ||
await asyncio.sleep(0) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
await asyncio.sleep(0) | |
await asyncio.sleep(0.01) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The zero sleep is actually deliberate to avoid any timing related problems. fetch->flight->memory can be very fast and I don't want to miss the transition. My sleeping 0
we're checking this upon every loop cycle. This is a bit excessive but guarantees us not to miss this
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't a.total_in_connections = 0
cause the state on b to be stuck in "fetch" forever?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No, the tasks will transition fo flight
and trigger a get_data
to worker A. Based on the total_in_connections
it will then limit the number of concurrent outgoing connections (outgoing_current_count
) and will reply with a {"status": "busy"}
message.
This is a downstream issue introduced by #5046
After a task is transitioned to missing and is released afterwards, the transition
missing->released
is leaving a zombie task, i.e. an unreferenced task. There is no data connected and no execution or fetch scheduled afterwards so this is not a critical issue.The fix is already functional but I'll keep the WIP until the primary PR is merged