Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

JupyterLab 3.0.0 + Enterprise Gateway: kernel fails to communicate #903

Closed
ricklamers opened this issue Nov 15, 2020 · 9 comments
Closed

Comments

@ricklamers
Copy link
Contributor

ricklamers commented Nov 15, 2020

Description

I wanted to find out whether the upcoming release of JupyterLab 3.0.0 is compatible with the latest version of Enterprise Gateway.

Screenshots / Logs

Screenshot 2020-11-15 at 10 06 53

Also attached log files with --debug

log-enterprise-gateway.log
log-jupyterlab.log
logs-colored-rtf.zip

Environment

  • Enterprise Gateway Version: 2.3
  • Platform: (not using containers for this test)
  • Others: JupyterLab 3.0.0.rc9

I think this in particular in the log (log-jupyterlab.log) could indicate why it's failing to communicate with the kernel:

[D 2020-11-15 10:08:59.395 ServerApp] 101 GET /api/kernels/bbd3d6f8-8921-4fc2-80a3-6d8d08cdc1a4/channels?session_id=bc8d927d-30b3-4a51-90b2-3de002e8580f (::1) 3.65ms
/Users/ricklamers/juptyterlab3-enterprise/venv/lib/python3.8/site-packages/jupyter_server/gateway/handlers.py:194: RuntimeWarning: coroutine 'GatewayWebSocketClient._connect' was never awaited
  self._connect(kernel_id)
RuntimeWarning: Enable tracemalloc to get the object allocation traceback
@ricklamers ricklamers changed the title JupyterLab 3.0.0 (rc.9 tested) + Enterprise Gateway JupyterLab 3.0.0 + Enterprise Gateway: kernel fails to communicate Nov 15, 2020
@kevin-bates
Copy link
Member

As noted on gitter, I'm still unable to start a kernel using JL 3.

I see the initialization of the websocket logged in the JL3 instance but I can't determine if it's local rather than coming from the gateway handler. I don't ever see the same log on the EG server, so I suspect the former and somehow the initialization of the handlers in the jupyter_server startup is not working correctly. When the --gateway-url option is used, we need the gateway's websocket handlers to be used rather than the defaults.

I will focus on that next but have essentially run out of time to spend on this today. At this moment the gateway integration with JL 3 (and jupyter_server) doesn't work. It might be worth trying to get the nbclassic server extension working, which essentially uses the Notebook front-end on jupyter_server.

Sorry about this.

@ricklamers
Copy link
Contributor Author

I noticed that since _connect became async (native async without @gen decorator used in jupyter/notebook version of the gateway websocket handler) it needed to be properly awaited.

It seems that Tornado's IOLoop requires native async functions to be added to its event loop using spawn_callback (https://www.tornadoweb.org/en/stable/guide/coroutines.html#how-to-call-a-coroutine).

Since _connect creates the self.ws_future I inlined it into _connect(..., message_callback). It seemed reasonable to avoid nested callbacks, but this is something you might want to change.

As a result, JupyterLab 3.0.0 seemed to work just fine with EG 2.3 for me after that.

You can see the changes here ricklamers/jupyter_server@457059a

Let me know if you want a PR for it.

@kevin-bates
Copy link
Member

Hi @ricklamers - this looks great! I do see an issue with restarts, but oddly, only with kernels remote to EG. Kernels running local to EG (python3) appear to restart fine. Remote kernels (spark python yarn cluster) do not appear to process any cells following the restart. The YARN process looks fine and I see the execute_request message logged in EG, so the restart request is getting across and, given EG-local kernels restart fine, I suspect something is amiss in my EG instance.

Just curious if you're running with EG-remote kernels and see these same restart issues?

Thank you so much for your help on this. I don't think I would have found what you found, but it makes sense given one of the changes in JS was to move to native async. Excellent - thank you!

@kevin-bates
Copy link
Member

Sorry, I failed to recommend that you open a jupyter server PR in my previous comment - thank you! I'm fairly convinced the restart issue is something we've seen in Lab-only code. The reason remote kernels appear to be the victims is because the ports change on restarts - unlike with local kernels - and notebook performed an unconditional reconnect that Lab does not.

It would be great, if you have any remote kernels at your disposal, to try a restart or two and see if you see the same behavior. I'm using lab 3rc9.

@ricklamers
Copy link
Contributor Author

ricklamers commented Nov 17, 2020

Thank you for your kind words. Just wanted to state by the way that I think the Enterprise Gateway is an amazing project! Fantastic job by you and your fellow contributors.

I’ve spent a good amount of timing pulling my hairs out as I was also able to get the LocalProccessProxy to restart the kernel (in Lab), while remote kernels were hanging after restart. Even though Lab doesn’t reconnect the client side websocket connection on restart of the kernel.

For a RemoteProcessProxy the kernel restart action on the client does restart the kernel and update the connection info (ports and IP) on the Enterprise Gateway. But somehow the existing websocket connection is not properly “refreshed” or “redirected”. I tried debugging the issue but it’s not exactly clear to me how the kernel communication is routed between the Lab server and the kernel.

In the regular notebook view a new websocket connection is made on kernel restart on the client which circumvents the issues above. I verified a fix @jasongrout proposed for Lab in which after restartKernel a call to _handleRestart is made which forces a new websocket connection and with that fix everything appears to be working.

It would be nice if we could get the kernel restart working without having to force a websocket reconnect on the client (Lab) side. I think a lot is already in place to do it, but there is some sort of reconnect necessary on the server side of things due to the changing connection info of the kernel.

Unfortunately, I don’t have time to dive in deeper now. But I’d be happy help find out where the restart is failing to properly refresh the connection across client <> lab server <> EG <> remote kernel sometime in the near future.

I’ll create the PR for the _connect awaiting shortly.

@kevin-bates
Copy link
Member

I’ve spent a good amount of timing pulling my hairs out as I was also able to get the LocalProccessProxy to restart the kernel (in Lab), while remote kernels were hanging after restart. Even though Lab doesn’t reconnect the client side websocket connection on restart of the kernel.

The difference is that local kernels do not change their ports on restarts. So the lab client-side isn't doing anything relative to manual restarts. Because the ports stay put on local kernels, the WebSocket connection continues to work (luckily). Since we update ports on remote kernel restarts (because the remote kernel may very well be at a different location (machine) altogether), and because no WebSocket reconnection occurs, the restart succeeds (as that is managed by the kernel manager), yet the kernel communication gets orphaned. If one manually reconnects the kernel (you have to dig deep into the lab menus to find it, while notebook classic has it as a first-class kernel option), the remote kernel will become operational again.

In the notebook classic front end, an explicit reconnect occurs on kernel restarts and that's why we don't see any of this behavior with remote kernels and the notebook front-end.

For a RemoteProcessProxy the kernel restart action on the client does restart the kernel and update the connection info (ports and IP) on the Enterprise Gateway. But somehow the existing websocket connection is not properly “refreshed” or “redirected”. I tried debugging the issue but it’s not exactly clear to me how the kernel communication is routed between the Lab server and the kernel.

The gateway config settings override the websocket channels handler and that handler essentially routes the messages to/from the gateway from/to the front-end. But I know you're familiar with that aspect since that's the area you fixed up the async code. That said, I'm not familiar with websocket programming and find that particular handler a bit confusing.

In the regular notebook view a new websocket connection is made on kernel restart on the client which circumvents the issues above. I verified a fix @jasongrout proposed for Lab in which after restartKernel a call to _handleRestart is made which forces a new websocket connection and with that fix everything appears to be working.

I'm not finding the _handleRestart() method getting called at all on manual kernel restarts in JL3 - are you? I set breakpoints in restartKernel and _handleRestart, as well as reconnect and _reconnect and the only breakpoint hit is in restartKernel. I also see that _handleRestart is only conditionally called based on the executionState of restarting. This state is ONLY set by the server when a kernel triggers its automatic restart logic but not when manually restarted.

The front-end also looks for an 'autorestarting' status that will never occur; it doesn't exist.

I've been experimenting with forcing a status update on manual restarts that also indicates 'restarting'. But that seems to be producing TWO reconnects to occur because two callback handlers get posted. (Still trying to figure out why that's the case.)

It would be nice if we could get the kernel restart working without having to force a websocket reconnect on the client (Lab) side. I think a lot is already in place to do it, but there is some sort of reconnect necessary on the server side of things due to the changing connection info of the kernel.

I agree. Unfortunately, this is one of those areas where having the two "handlers" (one for kernel lifecycle management, the other for kernel messaging) essentially independent of the other is problematic.

I'd like to do some more investigation, but will likely need to post to and perhaps re-open jupyterlab/jupyterlab#8013.

@ricklamers
Copy link
Contributor Author

Thanks for the extensive reply.

I'm applying @jasongrout's fix manually by patching the restartKernel directly in JupyterLab client JS to in addition to its regular body also execute _handleRestart() to force the websocket reconnect to occur.

https://github.com/orchest/orchest-integration/blob/jupyterlab3/src/index.ts

I guess if the websocket connection is proxied by the Enterprise Gateway through the handler of the jupyter_server (the one with the async issue) than somehow we need to make sure the websocket connection is redirected on the EG to the appropriate kernel after a kernel restart. That was what I was trying to do, but hadn't been able to because I've not had a lot of time to spend on this issue.

I think it would be useful if we could get details from the Lab team as to why they don't reconnect the websocket connection on kernel restart. Even though we might be able to make it work across a single persistent websocket connection (swapping out the kernel connection on the Enterprise Gateway), I'm not sure whether it makes sense to go with that route if the lack of websocket reconnect is not an explicit design choice for Jupyter Lab. I also noticed the "autorestarting" code in there and saw that would trigger a websocket reconnect since it calls the _handleRestart(). Hence, I'm not sure what the Lab team themselves believe should happen on kernel restart.

Is there any formal specification of desired kernel restart behavior somewhere that you're aware of? Or is this uncharted territory?

@kevin-bates
Copy link
Member

kevin-bates commented Nov 20, 2020

@ricklamers - the PR I submitted to lab for reconnecting on restarts was merged, please build with master if you can.

This, coupled with your fix to server should get us out of the woods here. It would be great if you could get ricklamers/jupyter_server@457059a into a PR since we're considering a server 1.1 release shortly.

@kevin-bates
Copy link
Member

I think we've now closed the various loops here: jupyterlab/jupyterlab#8013 (comment)

Thank you for your help Rick!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants