Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Detect kernel readiness more reliably #2207

Merged
merged 3 commits into from
Feb 13, 2024
Merged

Conversation

lionel-
Copy link
Contributor

@lionel- lionel- commented Feb 4, 2024

@DavisVaughan This is the idea we discussed yesterday to provide a more practical notion of readiness than just receiving a heartbeat from the kernel, in a way that's generic across kernels. Instead of switching to Ready as soon as we get a heartbeat, we first send a dummy kernel-info request to the backend. When the reply comes in we know the kernel init has sufficiently progressed that it's able to service a request about its internal state. That's our cue to switch to Ready.

Alternative to posit-dev/ark#234

@lionel-
Copy link
Contributor Author

lionel- commented Feb 5, 2024

Davis found this discussion: https://github.com/jupyter/enhancement-proposals/blob/master/65-jupyter-xpub/jupyter-xpub.md#current-solution

It warns against the possibility of Shell connecting and servicing requests before the IOPub socket is fully connected. When this happens, the output does not reach the frontend and the user loses data.

To prevent this, we now send dummy requests until we get a status message on IOPub. Our notion of readiness becomes:

  • The Shell socket is connected and able to service requests.
  • The IOPub socket is connected and ready to send output.

Comment on lines 1325 to 1338
// Ignore Busy and Idle statuses that emitted on IOPub during startup
// (e.g. during the kernel-info request that we emit to detect kernel
// readiness). This prevents switching from Starting to Busy before we
// switch to Ready.
if (this.isStarting() && this.isRuntimeStatus(status)) {
// We've got a status, which signals completion of the first half
// of the kernel startup
this._receivedInitialStatus = true;

// Return for now, we'll emit statuses after the second half of
// startup (reception of a kernel-info reply) has completed
return;
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It makes sense in my head to keep setStatus() as simple as possible.

How would you feel about moving this block into this._iopub?.onMessage(? From what I can tell, it would go right before the switch (state) { line. It feels "right" to me to push this as close as possible to the native handling of the raw message.

This would also prevent these initial statuses from getting into the _idleMessageIds and _busyMessageIds Sets, which seems nice.

It also aligns with the fact that the rest of the related code is in the similar hook, this._shell?.onMessage(

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, I have now done that

Copy link
Collaborator

@jmcphers jmcphers left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, although I wish there were a better way to do this!

@lionel- lionel- merged commit f5e232f into main Feb 13, 2024
1 check passed
@lionel- lionel- deleted the feature/kernel-readiness branch February 13, 2024 16:53
lionel- added a commit to posit-dev/ark that referenced this pull request Oct 10, 2024
Closes #569

This PR fixes a race condition regarding subscriptions to IOPub that causes clients to miss IOPub messages:

- On startup a client connects to the server sockets of a kernel.
- The client sends a request on Shell.
- The kernel starts processing the request and emits busy on IOPub.

If the client hasn't been able to fully subscribe to IOPub, messages can be lost, in particular the Busy message that encloses the request output.

On the Positron side we fixed it by sending kernel-info requests in a loop until we get a Ready message on IOPub. This signals Positron that the kernel is fully connected and in the Ready state: posit-dev/positron#2207. We haven't implemented a similar fix in our dummy clients for integration tests and we believe this is what is causing the race condition described in #569.

As noted in posit-dev/positron#2207, there is an accepted JEP proposal (JEP 65) that aims at solving this problem by switching to XPUB.

https://jupyter.org/enhancement-proposals/65-jupyter-xpub/jupyter-xpub.html
jupyter/enhancement-proposals#65

The XPUB socket allows the server to get notified of all new subscriptions. A message of type `iopub_welcome` is sent to all connected clients. They should generally ignore it but clients that have just started up can use it as a cue that IOPub is correctly connected and that they won't miss any output from that point on.

Approach:

The subscription notification comes in as a message on the IOPub socket. This is problematic because the IOPub thread now needs to listens to its crossbeam channel and to the 0MQ socket at the same time, which isn't possible without resorting to timeout polling. So we use the same approach and infrastructure that we implemented in #58 for listeing to both input replies on the StdIn socket and interrupt notifications on a crossbeam channel. The forwarding thread now owns the IOPub socket and listens for subscription notifications and fowrards IOPub messages coming from the kernel components.

---

* Start moving IOPub messages to forwarding thread

* Remove unused import

* Resolve the roundabout `Message` problem

The solution was to move the conversion to `JupyterMessage<T>` up into the match, so we "know" what `T` is!

* Use correct `Welcome` `MessageType`

* Implement `SubscriptionMessage` support and switch to `XPUB`

* The `Welcome` message doesn't come from ark

* Use `amalthea::Result`

* Add more comments

---------


Co-authored-by: Davis Vaughan <davis@posit.co>
Co-authored-by: Lionel Henry <lionel@posit.co>
lionel- added a commit to posit-dev/ark that referenced this pull request Oct 10, 2024
Closes #569

This PR fixes a race condition regarding subscriptions to IOPub that causes clients to miss IOPub messages:

- On startup a client connects to the server sockets of a kernel.
- The client sends a request on Shell.
- The kernel starts processing the request and emits busy on IOPub.

If the client hasn't been able to fully subscribe to IOPub, messages can be lost, in particular the Busy message that encloses the request output.

On the Positron side we fixed it by sending kernel-info requests in a loop until we get a Ready message on IOPub. This signals Positron that the kernel is fully connected and in the Ready state: posit-dev/positron#2207. We haven't implemented a similar fix in our dummy clients for integration tests and we believe this is what is causing the race condition described in #569.

As noted in posit-dev/positron#2207, there is an accepted JEP proposal (JEP 65) that aims at solving this problem by switching to XPUB.

https://jupyter.org/enhancement-proposals/65-jupyter-xpub/jupyter-xpub.html
jupyter/enhancement-proposals#65

The XPUB socket allows the server to get notified of all new subscriptions. A message of type `iopub_welcome` is sent to all connected clients. They should generally ignore it but clients that have just started up can use it as a cue that IOPub is correctly connected and that they won't miss any output from that point on.

Approach:

The subscription notification comes in as a message on the IOPub socket. This is problematic because the IOPub thread now needs to listens to its crossbeam channel and to the 0MQ socket at the same time, which isn't possible without resorting to timeout polling. So we use the same approach and infrastructure that we implemented in #58 for listeing to both input replies on the StdIn socket and interrupt notifications on a crossbeam channel. The forwarding thread now owns the IOPub socket and listens for subscription notifications and fowrards IOPub messages coming from the kernel components.

---

* Start moving IOPub messages to forwarding thread

* Remove unused import

* Resolve the roundabout `Message` problem

The solution was to move the conversion to `JupyterMessage<T>` up into the match, so we "know" what `T` is!

* Use correct `Welcome` `MessageType`

* Implement `SubscriptionMessage` support and switch to `XPUB`

* The `Welcome` message doesn't come from ark

* Use `amalthea::Result`

* Add more comments

---------

Co-authored-by: Davis Vaughan <davis@posit.co>
Co-authored-by: Lionel Henry <lionel@posit.co>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants