-
-
Notifications
You must be signed in to change notification settings - Fork 367
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fixed socket binding race conditions #412
Conversation
Not sure what broke the asyncio operations. I'll debug this over the weekend to figure out what's wrong. |
@Carreau I'm going to look into the async test failure tonight, but I was hoping maybe we could get this prioritized during the sprints and possibly get a release? The race condition here gets hit frequently with the new parallel execution tests in nbconvert. |
@minrk @Carreau If one of you has inisght, I think there's some low level issues the tests are hitting that I'm struggling to put together. The change above fixes the race I was seeing in nbconvert as far as I can tell, but the test suite has test_shutdown failing to teardown. I also see the test suite hang after completing, waiting for a thread to terminate. A cnt-C brings this stack trace out:
If the atexit change isn't make I see similar errors in other tests around
Considering I had to add the commented atexit cleanup back in to get the async tests to pass it looks suspiciously like zmq cleanup is failing in the tests due to a process or thread hanging indefinitely. Changing the Thoughts or places to investigate more with the zmq stack traces? |
ipykernel/kernelapp.py
Outdated
# Uncomment this to try closing the context. | ||
# atexit.register(context.term) | ||
atexit.register(context.term) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is the source of the hang. Calling context.term
in atexit is terminating the context prior to closing sockets, which causes a hang. This isn't safe to do without closing all sockets first. While this is a good thing to do, I don't think you need to do it in this PR. We would have to safely notify all sockets in all threads to be closed and then finally terminate the context.
atexit.register(context.term) | |
# atexit.register(context.term) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@minrk Makes sense. If I turn that back off I get Kernel Stopped Responding in the async tests. Though it seems more just tests running later in the pool of tests start failing with timeouts. It seemed like the tests weren't cleaning up sockets and then later tests are trying to find a free socket and timing out. Running any test individually passes but the full pytest
pass starts failing (see build for first commit of PR). Ideas on what needs to be done there?
avoids hangs during context termination
055179e
to
3a63c51
Compare
because travis can be real slow
Those changes do look good. Any inclination for why the shutdown failed on just python 3.5 suite? I'll have a little time this weekend to look at this again. |
Sorry, I meant to make a comment about the changes, not silently update your branch! I went through and added some explicit cleanup of zmq sockets and contexts. The hang was being caused by some garbage collection of a context not explicitly cleaned up by a prior test, leaving its sockets hanging around, which hangs. I don't really understand why this would happen, since sockets have a reference to the context but not vice versa, so gc order shouldn't allow contexts to delete before sockets 🤷♂, but it's always a good idea to clean up explicitly anyway. I'm trying to think about shutdown. My hunch is that the eventloop in the kernel is not stopping, or possibly something in zmq is not cleaning up in the right order. First thing I'd look at is if any package versions are different on 3.5. If it's random and different versions of py3 >= 3.5 hang on re-run, then I'd be almost certain it's garbage collection of zmq objects that's the cause, because Python 3.5 introduced forced random gc order at process shutdown. In that case, making sure we cleanup sockets at process shutdown should help. |
might fix garbage collection-related hangs on kernel exit
I added one more commit that performs explicit cleanup at exit. If gc issues were the problem, I'd expect this to fix it (unless there are more sockets somewhere we aren't tracking) edit: this is essentially restoring the cleanup-at-exit change you made earlier, but with the extra bookkeeping of ensuring everything is cleaned up in the right order. |
Awesome! Thanks for the help on the PR. Anything else needed for this to merge? |
I'd say let's go ! |
And we shoudl try to do a release soon. |
Thanks! And yes a release would be helpful so the nbconvert release doesn't have to note a caveat about ipykernels occasionally crashing when parallelizing conversions. |
I have no objections; though I haven't been involved in ipykernel developement in a while so would prefer someone more involve with it to do one. Not even sure if that woudl be a 5.2 release, a 5.1.2 or a 6.0 |
@Carreau @minrk Given these commits: v5.1.1...master The changes are all forward compatible and fixing minor to niche problems, so I think a patch release is reasonable with a 5.1.2 release if this is roughly following semver. A 5.2.0 would probably be fine as well if we wanted to denote the parallel execution support better. |
@MSeal Sorry for the break-in. I have an issue when I use
I have set up a new virtualenv and pip installed the notebook 6.0.0 version before I upgrade the |
Hmm... Something added in the changes to get the tests to pass broke the kernel when parallelized. I can reproduce by running the nbconvert test suite with this change installed. @minrk I won't be able to dig into fixing this for a bit fyi. |
#419 should fix the shutdown cleanup issue |
Following fixes in jupyter_client around zeroMQ thread safety issues I found and closed a couple races in the ipykernel wrapper as well that I found during testing of concurrent patterns with nbconvert.
The first is zeroMQ contexts using
.instance()
which is causes a shared object that's not thread / multiprocessing safe to use. Forcing new context objects to be allocated rather than using a global instance costs a small memory footprint to avoid race condition failures around those context objects.The second race I found was rare but enough travis builds gave the stack trace below. This points to port binding where a port is picked first, then used in a later call. But that port which was picked could be picked by a concurrent execution at the same time, resulting in a port binding error for one half of the race. To fix this I added some logic to wrap port binding with retry on address taken exceptions. I've tested an early version of this fix extensively in tracking down concurrency issues in notebook executions in nbconvert and upstream libraries, so I'm quite confident this port binding issue is what caused the error seen in travis builds. 100 retries, rather than infinite retries, seemed like a safe compromise as port bind conflicts are difficult to trigger and if you get more than a couple dozen something more serious is wrong.