Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve client disconnect handling #2068

Merged
merged 11 commits into from
Jan 22, 2023

Conversation

har7an
Copy link
Contributor

@har7an har7an commented Jan 8, 2023

This PR includes a few changes to the behavior when the session server disconnects a client as reaction to the client not handling messages quick enough. This happens for example when the system is under high load or the terminal application is blocked by some sort of user interaction (hence, keeping it from handling server messages). The server will continue sending ServerToClientMsg messages to all connected clients via an IPC channel. Currently, if the channel fills up the client is disconnected and a "Buffer full" error message is displayed to the user.

Now, we have a dynamically sized retry queue that stores messages that do not fit into the IPC channel. At the moment the queue is limited to 1000 messages, with the IPC channel itself holding at most 50 messages. This should drastically decrease disconnections caused by temporary flooding of STDOUT or high system load.

The retry queue is implemented as vector and allocated dynamically. Hence, on clients that are quick enough and never saw a "Buffer full" error before, it will not take any extra space. Slower clients will allocate the retry queue as needed.

In addition, when the IPC channel is full and messages are stored in the retry queue for the first time, a warning is now logged:

WARN   |zellij_server::os_input_o| 2023-01-08 08:33:48.696 [main      ] [zellij-server/src/os_input_output.rs:382]: client 1 is processing server messages too slow

If the client manages to catch up, a corresponding message is logged, too:

INFO   |zellij_server::os_input_o| 2023-01-08 08:43:23.394 [main      ] [zellij-server/src/os_input_output.rs:407]: client 1 caught up processing server messages

If the retry queue fills up the client is still forcefully disconnected, but logging on the server has now been improved:

ERROR  |???                      | 2023-01-08 08:34:04.220 [main      ] [zellij-server/src/lib.rs:572]: a non-fatal error occured

Caused by:
    0: client 1 is processing server messages too slow
    1: failed to send message to client 1
    2: failed to send or buffer message for client 1
    3: Client 1 is too slow to handle incoming messages

And instead of "Buffer full", the client will now see the following message on their terminal:

Your zellij client lost connection to the zellij server.

As a safety measure, you have been disconnected from the current zellij session.
However, the session should still exist and none of your data should be lost.

This usually means that your terminal didn't process server messages quick
enough. Maybe your system is currently under high load, or your terminal
isn't performant enough.

There are a few things you can try now:
    - Reattach to your previous session and see if it works out better this
      time: `zellij attach unarmed-snails`
    - Try using a faster terminal. GPU-accelerated terminals such as Kitty
      or Alacritty are cross-platform and known to work well with zellij.

This improves the UX on #2023, #2041, #2048 and #2063, and may help to mitigate client crashes in some of these cases. It does not, however, explain why sometimes the server bevomes unresponsive and locks up. I couldn't find an explanation for this.

@har7an har7an temporarily deployed to cachix January 8, 2023 07:52 — with GitHub Actions Inactive
@har7an har7an temporarily deployed to cachix January 8, 2023 08:36 — with GitHub Actions Inactive
@raphCode
Copy link
Contributor

Very nice work!

A question of curiosity: The updates the server sends are incremental, so we can't afford skipping some?

My line of thinking was that maybe clients which are too slow could be supplied with "absolute" updates, that redraw everything, just like the initial frame after attaching to a session?
These absolutes updates are not time-critical and don't make sense to pile up, so they can be send out in whatever pace the client can process.

@raphCode
Copy link
Contributor

I just realized that some of the messages sent to client are not renders, but also UI state changes etc. My idea might not work with these.

@imsnif
Copy link
Member

imsnif commented Jan 13, 2023

@har7an - cool work on this!

I took a brief glance at the code, as it seems that the "Buffer full" message exposed some underlying issues and now a lot of users are able to report them rather than just having glitches or random crashes.

It seems to me that we already have this functionality built-in in the form of the bounded queue here:

let (client_buffer_sender, client_buffer_receiver) = channels::bounded(50);

Could we not do away with the retry-queue implemented in this PR by increasing that queue from 50 to 1000 as you suggested? Or maybe even turning it into an unbounded queue?

@imsnif
Copy link
Member

imsnif commented Jan 13, 2023

@raphCode - render messages are ANSI instructions which are stateful. We cannot skip any of them, otherwise things will likely get corrupted for the user.

We could implement redraws, which would essentially mean to have the client reattach every time they receive a buffer full message. As I mentioned in the original PR that exposed this issue, it's not super trivial to do - and arguably is a patch that doesn't address the underlying problem. The more I think about this, the more I feel an unbounded queue is the solution we're looking for here (possibly with the addition of timing out unresponsive clients after a certain amount of time).

@imsnif imsnif mentioned this pull request Jan 13, 2023
@raphCode
Copy link
Contributor

raphCode commented Jan 13, 2023

Yeah, I hoped redraws are easy to implement because that is what happens anyways on attaches.

Buffering between server and client helps to bridge short times where the client falls behind, but eventually buffers are full and then there are three options:

  1. backpressure, stall the server
    this is what happened before: noone noticed because stalls usually were short, until fansaris weird ssh connection came along and everything blocked forever
  2. drop the client
    this is what we are doing now
  3. complete redraw when the client comes back
    tmux does this

We should do 3. and maybe after a long timeout (not measured in buffered bytes, but in minutes or hours) do 2.

I tested what tmux does with seq 1 10000000 | dd status=progress, one can pause the tmux client (Ctrl-B, Ctrl-Z), resume it after a while and the session is instantly back without some sort of intermediate updates.
Also, to be completely sure I analyzed a script transscript of the tmux session: First half of the numbers is continous, afterwards there are gaps with full-screen redraws.
Also, sometimes when executing the command inside tmux, one can even SEE the redraws as visual lagging of the numbers updating, think of it like the output framerate is reduced.


Can you elaborate how an unbounded queue will fit into the scenario?

@imsnif
Copy link
Member

imsnif commented Jan 13, 2023

Honestly, I thought an unbounded queue would fit here because I didn't realize these stalls can be measured in more than microseconds. You're right then that it's not ideal.

I've seen this tmux behavior and honestly I don't like it. At least not the way it's implemented in tmux. It's a bad UX IMO and for us even now it's much much smoother.

I'd probably opt for some more robust buffering (increasing the number of messages significantly to account for about 10-20 seconds of lag) and then disconnecting the client and having it reattach to get a redraw.

But since @har7an is the one working on it now, let's not engage in too much armchair quarterbacking and allow him to choose the best path forward.

@raphCode
Copy link
Contributor

raphCode commented Jan 13, 2023

Another view: Redraws are comparable to ubiquitous frame dropping during video playback or realtime rendering.

I've seen this tmux behavior and honestly I don't like it. At least not the way it's implemented in tmux. It's a bad UX IMO

What do you not like about it?

@imsnif
Copy link
Member

imsnif commented Jan 13, 2023

@raphCode - I'm going to defer the rest of this conversation to @har7an who is the person implementing this. I feel further theoretical discussions will not do us much good here, and will mostly serve to clutter up this PR and make it harder for the implementor to catch up.

@raphCode
Copy link
Contributor

raphCode commented Jan 13, 2023

Alright.
I am just going to leave a test case here that simulates slow clients - think ssh connections over dialup:

mkfifo /tmp/pipe
while true; do pv --rate-limit 30K /tmp/pipe > /dev/null; done &
script /tmp/pipe
# start / attach zellij here and start command that spams output

@har7an
Copy link
Contributor Author

har7an commented Jan 13, 2023

Implementor catches up

Hello you two, happy to see I've drawn your attention. :)

Could we not do away with the retry-queue implemented in this PR by increasing that queue from 50 to 1000 as you suggested?

Yes we could, but then:

  1. We wouldn't know when the client becomes slow in the first place (because we can't tell how full a crossbeam queue is). This isn't so bad I guess, because we will notice when the connection drops, but I thought it's a good hint to users to see this logged by the server.
  2. I'm not sure about the implications w.r.t. memory allocations. With the vector I can tell for sure that it isn't allocated as long as the client is quick enough, which I think is good.

I mean we could throw away that boundary and just make the queue unbounded, fair enough. But I'm not sure what happens when the blocking client has successfully drained all free RAM. I guess it will kill the whole application with it, and I'm totally against that sort of behavior. Like: It's not the other clients fault.

We can bump the queue size to some ridiculously huge amount and just disconnect the client if it's full, but before I do that I want to read up on how crossbeam allocates memory for its' queues.

Generally I'm quite happy with the current behavior: Disconnect the client, display a message explaining what happened, and then have the user handle it themselves. If they know their server instance is lost beyond hope (maybe they ran @raphCode "billion cuddles attack") they can at least kill it without hassle.

The idea of a full redraw is nice, but as far as I know Render isn't the only instruction we send to the client. What do we do with the others? Can we just ignore/drop them?

@har7an
Copy link
Contributor Author

har7an commented Jan 13, 2023

Okay, so according to the crossbeam sources a bounded channel is preallocated (as an array), whereas an unbounded channel isn't (linked list). In addition, we can query the length of any channel type with the len method (missed that before).

I assume the bounded channel is faster, which is beneficial for us. I mean: if the client is slow already, I see no point in having it use a slower queue for IPC. So I guess using a bounded channel with a large length is totally fine. I'll try to find out just how big one of these ServerToClientMsg structs is so we get an idea of how many we can allocate without anyone noticing... :)

@raphCode
Copy link
Contributor

account for about 10-20 seconds of lag

On my main PC, I get zellij to sink terminal text output at around 6 MB/s. Would 10 seconds worth buffering mean 60MB, plus struct overhead?

@imsnif
Copy link
Member

imsnif commented Jan 13, 2023

account for about 10-20 seconds of lag

On my main PC, I get zellij to sink terminal text output at around 6 MB/s. Would 10 seconds worth buffering mean 60MB, plus struct overhead?

There might be some more overhead given that Zellij is not a lossless pipe. We've gotta do all sorts of manipulations to the raw ANSI data - namely interpreting it and sending it to your screen in a way that will place it in the relevant pane(s) (as well as other small stuff such as moving the cursor between panes, hiding it during rendering so it won't jump around, etc.)

But all in all, it shouldn't be significantly larger. I'd be interested in seeing how @har7an 's experiment goes and how much memory usage 20-30 seconds of buffering continuous heavy traffic causes - at least, what's the scale we're talking about.

Previously any additional arguments passed on the command line were
ignored. Now they are appended to `cargo run ...` as documented.
and display the last send/recv error to the user instead of a generic
"Buffer full" message.
so we will know when an error occured while trying to send a message to
the client. The most likely cause for this is that the client buffer
filled up and hence we cannot send any new messages. While we still
disconnect the client as before, now we also write a log message that
explains the situation.
and log a message before disconnecting it.
that is dynamically allocated on-demand and stores `ServerToClientMsg`
in case the regular IPC channel is currently full. This acts as a
dynamic buffer to hold and buffer messages for a while until the client
hopefully catches up.

Also write a message to the log to indicate when the client is
recognized to be too slow in handling server messages.
@har7an har7an force-pushed the fix/buffer-full-in-server branch from bb4617f to 894b5e6 Compare January 22, 2023 07:16
@har7an har7an temporarily deployed to cachix January 22, 2023 07:16 — with GitHub Actions Inactive
@har7an
Copy link
Contributor Author

har7an commented Jan 22, 2023

So according to my measurements bumping the IPC queue from 50 messages to 5000 has no noticable impact on memory usage at all. I used the following commandline to check memory usage for both cases:

$ ps -e -o drs,rss,trs,vsz,cmd | grep target/debug

after I had started zellij using cargo x run. So I'm dumping the Vec-based solution I implemented and raise the queuesize to 5000 entries. If the client stalls entirely, that should be good for many seconds worth of buffering before we disconnect them. IF we do end up disconnecting the client, the error is now logged and the client display a pretty message to the user (See above). Case closed for now I would say, unless either of you has any objections. :)

via `Vec` and drastically increase the IPC message queue size instead.
Measurements didn't discover a drastic increase in RAM caused by this,
and it is a much easier fix for the problem at hand.
@har7an har7an force-pushed the fix/buffer-full-in-server branch from 894b5e6 to 60b7e3a Compare January 22, 2023 08:55
@har7an har7an temporarily deployed to cachix January 22, 2023 08:56 — with GitHub Actions Inactive
@har7an har7an merged commit beddfb7 into zellij-org:main Jan 22, 2023
@har7an har7an mentioned this pull request Jan 22, 2023
@har7an har7an deleted the fix/buffer-full-in-server branch August 28, 2023 17:30
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants