Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

All kernels hang in UI if supervisor process is terminated #5037

Open
jmcphers opened this issue Oct 15, 2024 · 0 comments
Open

All kernels hang in UI if supervisor process is terminated #5037

jmcphers opened this issue Oct 15, 2024 · 0 comments
Labels
area: kernels Issues related to Jupyter kernels and LSP servers bug Something isn't working

Comments

@jmcphers
Copy link
Collaborator

System details:

Positron and OS details:

Any version with supervisor enabled.

Interpreter details:

Any.

Describe the issue:

If the supervisor process ends for any reason, Positron doesn't respond gracefully. All the sessions stop responding without any notifications or user feedback.

It isn't likely that the process will end on its own, but it could exit abruptly in rare circumstances:

  • Something triggers a close of all terminals (since the supervisor runs in a terminal)
  • An OOM killer or other system level reaper forces the supervisor to terminate
  • The supervisor itself panics and crashes due to an unexpected runtime condition

Steps to reproduce the issue:

Ensure the supervisor is enabled, then start a Python session. Get the process ID of the supervisor using Python's os library to get the parent process ID of the Python session:

>>> os.getppid()
84461

Then, kill that process (e.g. kill 84461 in a terminal).

Nothing appears to happen in the UI, but you can no longer execute code, LSPs are dead, you don't get diagnostics, etc.

Expected or desired behavior:

At the very least:

  • A notification should be displayed to the user indicating that a crash has occurred, along with any output emitted by the supervisor that might be relevant (e.g. a backtrace)
  • All affected sessions should enter the Exited state

More ambitiously, we could do better at crash recovery:

  • The supervisor itself could run sessions in such a way that it could reattach after a crash
  • We could attempt to restart all the sessions that were running before the crash

Implementation-wise, we might do something like check the process ID to see if it's still running if we see our websocket disconnect or an API call fails with a connection refused error unexpectedly, or we could have some sort of heartbeat for the supervisor itself.

@jmcphers jmcphers added the area: kernels Issues related to Jupyter kernels and LSP servers label Oct 16, 2024
@sharon-wang sharon-wang added the bug Something isn't working label Oct 21, 2024
jmcphers added a commit that referenced this issue Oct 21, 2024
This change causes Positron to handle an abrupt or unexpected exit of
the supervisor process more gracefully. Before the change, the IDE
basically became unresponsive. After the change, the situation is still
Not Great, but it's recoverable; it's about the same as having all your
interpreters crash at once. You'll see some error popups, but you can
get back to a working state.

Because Positron has open websockets for all the active sessions, the
best clue that the supervisor process has been terminated is that these
sockets mysteriously all disconnect. So the main approach here is to
fire disconnect events from the sessions; if we get one of these for a
session that hasn't exited, we check the process table to see what's
going on.

There are a couple of significant caveats:

- This approach only works when the supervisor process is local. We will
probably want to implement a mechanism that works for remote supervisors
too, if we add support for those.
- This approach won't help if the supervisor process is alive but
unresponsive. Though that state hasn't occurred in nature so far, we
will eventually want to guard against it.

Addresses #5037.

### QA Notes

We don't expect supervisor exits to happen to anyone during the normal
course of events, and this change does not attempt to make a supervisor
exit a pleasant experience.

This change also has a drive-by fix that addresses an issue running
notebook kernels under the supervisor.

---------

Signed-off-by: Jonathan <jonathan@posit.co>
Co-authored-by: sharon <sharon-wang@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: kernels Issues related to Jupyter kernels and LSP servers bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants