Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clean up nanny WorkerProcess.kill #6972

Merged
merged 1 commit into from
Aug 31, 2022
Merged

Conversation

gjoseph92
Copy link
Collaborator

I'm hoping that this resolves #6971, though it may not.

This fixes a couple things:

  1. When the subprocess exits, it'll call mark_stopped, which sets child_stop_q to None among other things. So between self.child_stop_q.put and self.child_stop_q.close, self.child_stop_q could have become None, since there's an await in between. That caused the AttributeError: 'NoneType' object has no attribute 'close'.
  2. Once that's resolved, we'd get to try to call process.join, after process.close has been called (in mark_stopped). That will cause join to raise a ValueError. This is fine—it means we can just exit—but we should catch that. I went for a string match on the ValueError since ValueError feels like too generic of an exception type to just ignore IMO.

I have no idea how to test this. The condition we're looking for is the subprocess exiting in between an await sleep(0). That's extremely hard to trigger. (I'm also not sure it still needs to be there??)

Closes #6971

  • Tests added / passed
  • Passes pre-commit run --all-files

@fjetter
Copy link
Member

fjetter commented Aug 31, 2022

FWIW I started cleaning up all the queues in #6655 but never got around to finish it. I removed the sleep(0) IIRC

Copy link
Member

@fjetter fjetter left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't love the ValueError handling but the rest is a straight forward change so if there is a chance this will help, that's great.

@fjetter fjetter merged commit f07f384 into dask:main Aug 31, 2022
@gjoseph92 gjoseph92 deleted the cleanup-nanny-kill branch August 31, 2022 15:04
gjoseph92 added a commit to gjoseph92/distributed that referenced this pull request Oct 31, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Flaky test_worker_start_exception
2 participants