-
Notifications
You must be signed in to change notification settings - Fork 221
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
JupyterLab 3.0.0 + Enterprise Gateway: kernel fails to communicate #903
Comments
As noted on gitter, I'm still unable to start a kernel using JL 3. I see the initialization of the websocket logged in the JL3 instance but I can't determine if it's local rather than coming from the gateway handler. I don't ever see the same log on the EG server, so I suspect the former and somehow the initialization of the handlers in the jupyter_server startup is not working correctly. When the I will focus on that next but have essentially run out of time to spend on this today. At this moment the gateway integration with JL 3 (and jupyter_server) doesn't work. It might be worth trying to get the nbclassic server extension working, which essentially uses the Notebook front-end on jupyter_server. Sorry about this. |
I noticed that since _connect became async (native async without @gen decorator used in jupyter/notebook version of the gateway websocket handler) it needed to be properly awaited. It seems that Tornado's IOLoop requires native async functions to be added to its event loop using spawn_callback (https://www.tornadoweb.org/en/stable/guide/coroutines.html#how-to-call-a-coroutine). Since _connect creates the self.ws_future I inlined it into _connect(..., message_callback). It seemed reasonable to avoid nested callbacks, but this is something you might want to change. As a result, JupyterLab 3.0.0 seemed to work just fine with EG 2.3 for me after that. You can see the changes here ricklamers/jupyter_server@457059a Let me know if you want a PR for it. |
Hi @ricklamers - this looks great! I do see an issue with restarts, but oddly, only with kernels remote to EG. Kernels running local to EG (python3) appear to restart fine. Remote kernels (spark python yarn cluster) do not appear to process any cells following the restart. The YARN process looks fine and I see the Just curious if you're running with EG-remote kernels and see these same restart issues? Thank you so much for your help on this. I don't think I would have found what you found, but it makes sense given one of the changes in JS was to move to native async. Excellent - thank you! |
Sorry, I failed to recommend that you open a jupyter server PR in my previous comment - thank you! I'm fairly convinced the restart issue is something we've seen in Lab-only code. The reason remote kernels appear to be the victims is because the ports change on restarts - unlike with local kernels - and notebook performed an unconditional reconnect that Lab does not. It would be great, if you have any remote kernels at your disposal, to try a restart or two and see if you see the same behavior. I'm using lab 3rc9. |
Thank you for your kind words. Just wanted to state by the way that I think the Enterprise Gateway is an amazing project! Fantastic job by you and your fellow contributors. I’ve spent a good amount of timing pulling my hairs out as I was also able to get the LocalProccessProxy to restart the kernel (in Lab), while remote kernels were hanging after restart. Even though Lab doesn’t reconnect the client side websocket connection on restart of the kernel. For a RemoteProcessProxy the kernel restart action on the client does restart the kernel and update the connection info (ports and IP) on the Enterprise Gateway. But somehow the existing websocket connection is not properly “refreshed” or “redirected”. I tried debugging the issue but it’s not exactly clear to me how the kernel communication is routed between the Lab server and the kernel. In the regular notebook view a new websocket connection is made on kernel restart on the client which circumvents the issues above. I verified a fix @jasongrout proposed for Lab in which after restartKernel a call to _handleRestart is made which forces a new websocket connection and with that fix everything appears to be working. It would be nice if we could get the kernel restart working without having to force a websocket reconnect on the client (Lab) side. I think a lot is already in place to do it, but there is some sort of reconnect necessary on the server side of things due to the changing connection info of the kernel. Unfortunately, I don’t have time to dive in deeper now. But I’d be happy help find out where the restart is failing to properly refresh the connection across client <> lab server <> EG <> remote kernel sometime in the near future. I’ll create the PR for the _connect awaiting shortly. |
The difference is that local kernels do not change their ports on restarts. So the lab client-side isn't doing anything relative to manual restarts. Because the ports stay put on local kernels, the WebSocket connection continues to work (luckily). Since we update ports on remote kernel restarts (because the remote kernel may very well be at a different location (machine) altogether), and because no WebSocket reconnection occurs, the restart succeeds (as that is managed by the kernel manager), yet the kernel communication gets orphaned. If one manually reconnects the kernel (you have to dig deep into the lab menus to find it, while notebook classic has it as a first-class kernel option), the remote kernel will become operational again. In the notebook classic front end, an explicit reconnect occurs on kernel restarts and that's why we don't see any of this behavior with remote kernels and the notebook front-end.
The gateway config settings override the websocket channels handler and that handler essentially routes the messages to/from the gateway from/to the front-end. But I know you're familiar with that aspect since that's the area you fixed up the async code. That said, I'm not familiar with websocket programming and find that particular handler a bit confusing.
I'm not finding the The front-end also looks for an 'autorestarting' status that will never occur; it doesn't exist. I've been experimenting with forcing a status update on manual restarts that also indicates 'restarting'. But that seems to be producing TWO reconnects to occur because two callback handlers get posted. (Still trying to figure out why that's the case.)
I agree. Unfortunately, this is one of those areas where having the two "handlers" (one for kernel lifecycle management, the other for kernel messaging) essentially independent of the other is problematic. I'd like to do some more investigation, but will likely need to post to and perhaps re-open jupyterlab/jupyterlab#8013. |
Thanks for the extensive reply. I'm applying @jasongrout's fix manually by patching the https://github.com/orchest/orchest-integration/blob/jupyterlab3/src/index.ts I guess if the websocket connection is proxied by the Enterprise Gateway through the handler of the I think it would be useful if we could get details from the Lab team as to why they don't reconnect the websocket connection on kernel restart. Even though we might be able to make it work across a single persistent websocket connection (swapping out the kernel connection on the Enterprise Gateway), I'm not sure whether it makes sense to go with that route if the lack of websocket reconnect is not an explicit design choice for Jupyter Lab. I also noticed the "autorestarting" code in there and saw that would trigger a websocket reconnect since it calls the Is there any formal specification of desired kernel restart behavior somewhere that you're aware of? Or is this uncharted territory? |
@ricklamers - the PR I submitted to lab for reconnecting on restarts was merged, please build with master if you can. This, coupled with your fix to server should get us out of the woods here. It would be great if you could get ricklamers/jupyter_server@457059a into a PR since we're considering a server 1.1 release shortly. |
I think we've now closed the various loops here: jupyterlab/jupyterlab#8013 (comment) Thank you for your help Rick! |
Description
I wanted to find out whether the upcoming release of JupyterLab 3.0.0 is compatible with the latest version of Enterprise Gateway.
Screenshots / Logs
Also attached log files with --debug
log-enterprise-gateway.log
log-jupyterlab.log
logs-colored-rtf.zip
Environment
I think this in particular in the log (log-jupyterlab.log) could indicate why it's failing to communicate with the kernel:
The text was updated successfully, but these errors were encountered: