-
Notifications
You must be signed in to change notification settings - Fork 15
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Race condition between the R and Amalthea Shell threads #569
Labels
Comments
One potential hypothesis is:
I don't think we can see this right now because we wouldn't see these logs: ark/crates/amalthea/src/socket/iopub.rs Line 122 in cac78eb
It's possible this should be a panic - dropping a Busy or Idle message is quite bad, there is no way to recover from that. |
Merged
lionel-
added a commit
that referenced
this issue
Oct 10, 2024
Closes #569 This PR fixes a race condition regarding subscriptions to IOPub that causes clients to miss IOPub messages: - On startup a client connects to the server sockets of a kernel. - The client sends a request on Shell. - The kernel starts processing the request and emits busy on IOPub. If the client hasn't been able to fully subscribe to IOPub, messages can be lost, in particular the Busy message that encloses the request output. On the Positron side we fixed it by sending kernel-info requests in a loop until we get a Ready message on IOPub. This signals Positron that the kernel is fully connected and in the Ready state: posit-dev/positron#2207. We haven't implemented a similar fix in our dummy clients for integration tests and we believe this is what is causing the race condition described in #569. As noted in posit-dev/positron#2207, there is an accepted JEP proposal (JEP 65) that aims at solving this problem by switching to XPUB. https://jupyter.org/enhancement-proposals/65-jupyter-xpub/jupyter-xpub.html jupyter/enhancement-proposals#65 The XPUB socket allows the server to get notified of all new subscriptions. A message of type `iopub_welcome` is sent to all connected clients. They should generally ignore it but clients that have just started up can use it as a cue that IOPub is correctly connected and that they won't miss any output from that point on. Approach: The subscription notification comes in as a message on the IOPub socket. This is problematic because the IOPub thread now needs to listens to its crossbeam channel and to the 0MQ socket at the same time, which isn't possible without resorting to timeout polling. So we use the same approach and infrastructure that we implemented in #58 for listeing to both input replies on the StdIn socket and interrupt notifications on a crossbeam channel. The forwarding thread now owns the IOPub socket and listens for subscription notifications and fowrards IOPub messages coming from the kernel components. --- * Start moving IOPub messages to forwarding thread * Remove unused import * Resolve the roundabout `Message` problem The solution was to move the conversion to `JupyterMessage<T>` up into the match, so we "know" what `T` is! * Use correct `Welcome` `MessageType` * Implement `SubscriptionMessage` support and switch to `XPUB` * The `Welcome` message doesn't come from ark * Use `amalthea::Result` * Add more comments --------- Co-authored-by: Davis Vaughan <davis@posit.co> Co-authored-by: Lionel Henry <lionel@posit.co>
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
One run of Windows CI failed because a
status
message came in on IOPub before the expectedexecute_result
:I think it's caused by us handling requests on two different threads (Amalthea Shell and R) interacting with a third thread (Amalthea IOPub).
When a Shell request comes in, the Amalthea's Shell socket thread (dispatch in
ark/crates/amalthea/src/socket/shell.rs
Line 160 in cac78eb
execute_request
handler inark/crates/amalthea/src/socket/shell.rs
Line 221 in cac78eb
Busy
message on IOPubIdle
to IOPub.Note that IOPub's messages are not sent immediately on the IOPub socket, they are sent to the IOPub thread which relays the messages to the IOPub socket which it owns.
While Ark's
execute_request
handler (ark/crates/ark/src/shell.rs
Line 205 in cac78eb
When the request has been completed, as determined by iterating back into
ReadConsole
, the R thread sends anexecute_result
on IOPub:ark/crates/ark/src/interface.rs
Line 1242 in cac78eb
Note how there are three interacting threads:
This triangle configuration is bound to have race conditions regarding message ordering of outbound IOPub messages.
This concerns the status and result messages as in the test failure above but it's also possible for other IOPub messages sent by R thread to lag behing past an Idle boundary. It's problematic because the Jupyter protocol states:
So clients are not required to track the parent header of IOPub messages to match them to already closed requests. They can just assume these are nested in status messages.
So there are two things we can do here:
To fix the status/result ordering, we can simply move the emission of results to the Amalthea socket thread, i.e. in Ark's
execute_request
handler. We already send a response to unblock the thread so that's easy.A more comprehensive fix that would ensure the IOPub messages are enclosed in Busy/Idle statuses would be to move the entire handling of Shell messages to the R thread, which would own the Shell socket. Then there would be only two threads involved: the R/Shell thread and IOPub. This would solve the triangle race conditions and would significantly simplify our asynchronous system. Related goal: RPC: Ark: Switch to blocking requests to avoid concurrent R evaluations positron#2060
A simpler fix would be to send IOPub messages from the R thread to the Amalthea Shell thread, which would then relay them to the IOPub thread. This would merge the flows of IOPub messages and ensure proper ordering.
I'm now doubting my analysis in the details box. The three-thread issue arises when there's a continuous flow of messages from threads A and B to thread C. But here the flow of messages from R and Shell to IOPub are nested.
The text was updated successfully, but these errors were encountered: