-
-
Notifications
You must be signed in to change notification settings - Fork 685
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve client disconnect handling #2068
Conversation
Very nice work! A question of curiosity: The updates the server sends are incremental, so we can't afford skipping some? My line of thinking was that maybe clients which are too slow could be supplied with "absolute" updates, that redraw everything, just like the initial frame after attaching to a session? |
I just realized that some of the messages sent to client are not renders, but also UI state changes etc. My idea might not work with these. |
@har7an - cool work on this! I took a brief glance at the code, as it seems that the "Buffer full" message exposed some underlying issues and now a lot of users are able to report them rather than just having glitches or random crashes. It seems to me that we already have this functionality built-in in the form of the bounded queue here: zellij/zellij-server/src/os_input_output.rs Line 346 in 2afb355
Could we not do away with the retry-queue implemented in this PR by increasing that queue from 50 to 1000 as you suggested? Or maybe even turning it into an unbounded queue? |
@raphCode - render messages are ANSI instructions which are stateful. We cannot skip any of them, otherwise things will likely get corrupted for the user. We could implement redraws, which would essentially mean to have the client reattach every time they receive a buffer full message. As I mentioned in the original PR that exposed this issue, it's not super trivial to do - and arguably is a patch that doesn't address the underlying problem. The more I think about this, the more I feel an unbounded queue is the solution we're looking for here (possibly with the addition of timing out unresponsive clients after a certain amount of time). |
Yeah, I hoped redraws are easy to implement because that is what happens anyways on attaches. Buffering between server and client helps to bridge short times where the client falls behind, but eventually buffers are full and then there are three options:
We should do 3. and maybe after a long timeout (not measured in buffered bytes, but in minutes or hours) do 2. I tested what tmux does with Can you elaborate how an unbounded queue will fit into the scenario? |
Honestly, I thought an unbounded queue would fit here because I didn't realize these stalls can be measured in more than microseconds. You're right then that it's not ideal. I've seen this tmux behavior and honestly I don't like it. At least not the way it's implemented in tmux. It's a bad UX IMO and for us even now it's much much smoother. I'd probably opt for some more robust buffering (increasing the number of messages significantly to account for about 10-20 seconds of lag) and then disconnecting the client and having it reattach to get a redraw. But since @har7an is the one working on it now, let's not engage in too much armchair quarterbacking and allow him to choose the best path forward. |
Another view: Redraws are comparable to ubiquitous frame dropping during video playback or realtime rendering.
What do you not like about it? |
Alright. mkfifo /tmp/pipe
while true; do pv --rate-limit 30K /tmp/pipe > /dev/null; done &
script /tmp/pipe
# start / attach zellij here and start command that spams output |
Implementor catches up Hello you two, happy to see I've drawn your attention. :)
Yes we could, but then:
I mean we could throw away that boundary and just make the queue unbounded, fair enough. But I'm not sure what happens when the blocking client has successfully drained all free RAM. I guess it will kill the whole application with it, and I'm totally against that sort of behavior. Like: It's not the other clients fault. We can bump the queue size to some ridiculously huge amount and just disconnect the client if it's full, but before I do that I want to read up on how crossbeam allocates memory for its' queues. Generally I'm quite happy with the current behavior: Disconnect the client, display a message explaining what happened, and then have the user handle it themselves. If they know their server instance is lost beyond hope (maybe they ran @raphCode "billion cuddles attack") they can at least kill it without hassle. The idea of a full redraw is nice, but as far as I know |
Okay, so according to the crossbeam sources a bounded channel is preallocated (as an array), whereas an unbounded channel isn't (linked list). In addition, we can query the length of any channel type with the I assume the bounded channel is faster, which is beneficial for us. I mean: if the client is slow already, I see no point in having it use a slower queue for IPC. So I guess using a bounded channel with a large length is totally fine. I'll try to find out just how big one of these |
On my main PC, I get zellij to sink terminal text output at around 6 MB/s. Would 10 seconds worth buffering mean 60MB, plus struct overhead? |
There might be some more overhead given that Zellij is not a lossless pipe. We've gotta do all sorts of manipulations to the raw ANSI data - namely interpreting it and sending it to your screen in a way that will place it in the relevant pane(s) (as well as other small stuff such as moving the cursor between panes, hiding it during rendering so it won't jump around, etc.) But all in all, it shouldn't be significantly larger. I'd be interested in seeing how @har7an 's experiment goes and how much memory usage 20-30 seconds of buffering continuous heavy traffic causes - at least, what's the scale we're talking about. |
Previously any additional arguments passed on the command line were ignored. Now they are appended to `cargo run ...` as documented.
and display the last send/recv error to the user instead of a generic "Buffer full" message.
so we will know when an error occured while trying to send a message to the client. The most likely cause for this is that the client buffer filled up and hence we cannot send any new messages. While we still disconnect the client as before, now we also write a log message that explains the situation.
and log a message before disconnecting it.
that is dynamically allocated on-demand and stores `ServerToClientMsg` in case the regular IPC channel is currently full. This acts as a dynamic buffer to hold and buffer messages for a while until the client hopefully catches up. Also write a message to the log to indicate when the client is recognized to be too slow in handling server messages.
bb4617f
to
894b5e6
Compare
So according to my measurements bumping the IPC queue from 50 messages to 5000 has no noticable impact on memory usage at all. I used the following commandline to check memory usage for both cases:
after I had started zellij using |
via `Vec` and drastically increase the IPC message queue size instead. Measurements didn't discover a drastic increase in RAM caused by this, and it is a much easier fix for the problem at hand.
894b5e6
to
60b7e3a
Compare
This PR includes a few changes to the behavior when the session server disconnects a client as reaction to the client not handling messages quick enough. This happens for example when the system is under high load or the terminal application is blocked by some sort of user interaction (hence, keeping it from handling server messages). The server will continue sending
ServerToClientMsg
messages to all connected clients via an IPC channel. Currently, if the channel fills up the client is disconnected and a "Buffer full" error message is displayed to the user.Now, we have a dynamically sized retry queue that stores messages that do not fit into the IPC channel. At the moment the queue is limited to 1000 messages, with the IPC channel itself holding at most 50 messages. This should drastically decrease disconnections caused by temporary flooding of STDOUT or high system load.
The retry queue is implemented as vector and allocated dynamically. Hence, on clients that are quick enough and never saw a "Buffer full" error before, it will not take any extra space. Slower clients will allocate the retry queue as needed.
In addition, when the IPC channel is full and messages are stored in the retry queue for the first time, a warning is now logged:
If the client manages to catch up, a corresponding message is logged, too:
If the retry queue fills up the client is still forcefully disconnected, but logging on the server has now been improved:
And instead of "Buffer full", the client will now see the following message on their terminal:
This improves the UX on #2023, #2041, #2048 and #2063, and may help to mitigate client crashes in some of these cases. It does not, however, explain why sometimes the server bevomes unresponsive and locks up. I couldn't find an explanation for this.