Improve client disconnect handling #2068

har7an · 2023-01-08T07:52:31Z

This PR includes a few changes to the behavior when the session server disconnects a client as reaction to the client not handling messages quick enough. This happens for example when the system is under high load or the terminal application is blocked by some sort of user interaction (hence, keeping it from handling server messages). The server will continue sending ServerToClientMsg messages to all connected clients via an IPC channel. Currently, if the channel fills up the client is disconnected and a "Buffer full" error message is displayed to the user.

Now, we have a dynamically sized retry queue that stores messages that do not fit into the IPC channel. At the moment the queue is limited to 1000 messages, with the IPC channel itself holding at most 50 messages. This should drastically decrease disconnections caused by temporary flooding of STDOUT or high system load.

The retry queue is implemented as vector and allocated dynamically. Hence, on clients that are quick enough and never saw a "Buffer full" error before, it will not take any extra space. Slower clients will allocate the retry queue as needed.

In addition, when the IPC channel is full and messages are stored in the retry queue for the first time, a warning is now logged:

WARN   |zellij_server::os_input_o| 2023-01-08 08:33:48.696 [main      ] [zellij-server/src/os_input_output.rs:382]: client 1 is processing server messages too slow

If the client manages to catch up, a corresponding message is logged, too:

INFO   |zellij_server::os_input_o| 2023-01-08 08:43:23.394 [main      ] [zellij-server/src/os_input_output.rs:407]: client 1 caught up processing server messages

If the retry queue fills up the client is still forcefully disconnected, but logging on the server has now been improved:

ERROR  |???                      | 2023-01-08 08:34:04.220 [main      ] [zellij-server/src/lib.rs:572]: a non-fatal error occured

Caused by:
    0: client 1 is processing server messages too slow
    1: failed to send message to client 1
    2: failed to send or buffer message for client 1
    3: Client 1 is too slow to handle incoming messages

And instead of "Buffer full", the client will now see the following message on their terminal:

Your zellij client lost connection to the zellij server.

As a safety measure, you have been disconnected from the current zellij session.
However, the session should still exist and none of your data should be lost.

This usually means that your terminal didn't process server messages quick
enough. Maybe your system is currently under high load, or your terminal
isn't performant enough.

There are a few things you can try now:
    - Reattach to your previous session and see if it works out better this
      time: `zellij attach unarmed-snails`
    - Try using a faster terminal. GPU-accelerated terminals such as Kitty
      or Alacritty are cross-platform and known to work well with zellij.

This improves the UX on #2023, #2041, #2048 and #2063, and may help to mitigate client crashes in some of these cases. It does not, however, explain why sometimes the server bevomes unresponsive and locks up. I couldn't find an explanation for this.

raphCode · 2023-01-11T09:56:40Z

Very nice work!

A question of curiosity: The updates the server sends are incremental, so we can't afford skipping some?

My line of thinking was that maybe clients which are too slow could be supplied with "absolute" updates, that redraw everything, just like the initial frame after attaching to a session?
These absolutes updates are not time-critical and don't make sense to pile up, so they can be send out in whatever pace the client can process.

raphCode · 2023-01-12T15:53:49Z

I just realized that some of the messages sent to client are not renders, but also UI state changes etc. My idea might not work with these.

imsnif · 2023-01-13T09:27:28Z

@har7an - cool work on this!

I took a brief glance at the code, as it seems that the "Buffer full" message exposed some underlying issues and now a lot of users are able to report them rather than just having glitches or random crashes.

It seems to me that we already have this functionality built-in in the form of the bounded queue here:

zellij/zellij-server/src/os_input_output.rs

Line 346 in 2afb355

let (client_buffer_sender, client_buffer_receiver) = channels::bounded(50);

Could we not do away with the retry-queue implemented in this PR by increasing that queue from 50 to 1000 as you suggested? Or maybe even turning it into an unbounded queue?

imsnif · 2023-01-13T09:31:10Z

@raphCode - render messages are ANSI instructions which are stateful. We cannot skip any of them, otherwise things will likely get corrupted for the user.

We could implement redraws, which would essentially mean to have the client reattach every time they receive a buffer full message. As I mentioned in the original PR that exposed this issue, it's not super trivial to do - and arguably is a patch that doesn't address the underlying problem. The more I think about this, the more I feel an unbounded queue is the solution we're looking for here (possibly with the addition of timing out unresponsive clients after a certain amount of time).

raphCode · 2023-01-13T10:44:47Z

Yeah, I hoped redraws are easy to implement because that is what happens anyways on attaches.

Buffering between server and client helps to bridge short times where the client falls behind, but eventually buffers are full and then there are three options:

backpressure, stall the server
this is what happened before: noone noticed because stalls usually were short, until fansaris weird ssh connection came along and everything blocked forever
drop the client
this is what we are doing now
complete redraw when the client comes back
tmux does this

We should do 3. and maybe after a long timeout (not measured in buffered bytes, but in minutes or hours) do 2.

I tested what tmux does with seq 1 10000000 | dd status=progress, one can pause the tmux client (Ctrl-B, Ctrl-Z), resume it after a while and the session is instantly back without some sort of intermediate updates.
Also, to be completely sure I analyzed a script transscript of the tmux session: First half of the numbers is continous, afterwards there are gaps with full-screen redraws.
Also, sometimes when executing the command inside tmux, one can even SEE the redraws as visual lagging of the numbers updating, think of it like the output framerate is reduced.

Can you elaborate how an unbounded queue will fit into the scenario?

imsnif · 2023-01-13T10:57:04Z

Honestly, I thought an unbounded queue would fit here because I didn't realize these stalls can be measured in more than microseconds. You're right then that it's not ideal.

I've seen this tmux behavior and honestly I don't like it. At least not the way it's implemented in tmux. It's a bad UX IMO and for us even now it's much much smoother.

I'd probably opt for some more robust buffering (increasing the number of messages significantly to account for about 10-20 seconds of lag) and then disconnecting the client and having it reattach to get a redraw.

But since @har7an is the one working on it now, let's not engage in too much armchair quarterbacking and allow him to choose the best path forward.

raphCode · 2023-01-13T11:23:45Z

Another view: Redraws are comparable to ubiquitous frame dropping during video playback or realtime rendering.

I've seen this tmux behavior and honestly I don't like it. At least not the way it's implemented in tmux. It's a bad UX IMO

What do you not like about it?

imsnif · 2023-01-13T11:34:19Z

@raphCode - I'm going to defer the rest of this conversation to @har7an who is the person implementing this. I feel further theoretical discussions will not do us much good here, and will mostly serve to clutter up this PR and make it harder for the implementor to catch up.

raphCode · 2023-01-13T12:03:47Z

Alright.
I am just going to leave a test case here that simulates slow clients - think ssh connections over dialup:

mkfifo /tmp/pipe
while true; do pv --rate-limit 30K /tmp/pipe > /dev/null; done &
script /tmp/pipe
# start / attach zellij here and start command that spams output

har7an · 2023-01-13T16:40:12Z

Implementor catches up

Hello you two, happy to see I've drawn your attention. :)

Could we not do away with the retry-queue implemented in this PR by increasing that queue from 50 to 1000 as you suggested?

Yes we could, but then:

We wouldn't know when the client becomes slow in the first place (because we can't tell how full a crossbeam queue is). This isn't so bad I guess, because we will notice when the connection drops, but I thought it's a good hint to users to see this logged by the server.
I'm not sure about the implications w.r.t. memory allocations. With the vector I can tell for sure that it isn't allocated as long as the client is quick enough, which I think is good.

I mean we could throw away that boundary and just make the queue unbounded, fair enough. But I'm not sure what happens when the blocking client has successfully drained all free RAM. I guess it will kill the whole application with it, and I'm totally against that sort of behavior. Like: It's not the other clients fault.

We can bump the queue size to some ridiculously huge amount and just disconnect the client if it's full, but before I do that I want to read up on how crossbeam allocates memory for its' queues.

Generally I'm quite happy with the current behavior: Disconnect the client, display a message explaining what happened, and then have the user handle it themselves. If they know their server instance is lost beyond hope (maybe they ran @raphCode "billion cuddles attack") they can at least kill it without hassle.

The idea of a full redraw is nice, but as far as I know Render isn't the only instruction we send to the client. What do we do with the others? Can we just ignore/drop them?

har7an · 2023-01-13T17:04:57Z

Okay, so according to the crossbeam sources a bounded channel is preallocated (as an array), whereas an unbounded channel isn't (linked list). In addition, we can query the length of any channel type with the len method (missed that before).

I assume the bounded channel is faster, which is beneficial for us. I mean: if the client is slow already, I see no point in having it use a slower queue for IPC. So I guess using a bounded channel with a large length is totally fine. I'll try to find out just how big one of these ServerToClientMsg structs is so we get an idea of how many we can allocate without anyone noticing... :)

raphCode · 2023-01-13T19:09:32Z

account for about 10-20 seconds of lag

On my main PC, I get zellij to sink terminal text output at around 6 MB/s. Would 10 seconds worth buffering mean 60MB, plus struct overhead?

imsnif · 2023-01-13T21:17:26Z

account for about 10-20 seconds of lag

On my main PC, I get zellij to sink terminal text output at around 6 MB/s. Would 10 seconds worth buffering mean 60MB, plus struct overhead?

There might be some more overhead given that Zellij is not a lossless pipe. We've gotta do all sorts of manipulations to the raw ANSI data - namely interpreting it and sending it to your screen in a way that will place it in the relevant pane(s) (as well as other small stuff such as moving the cursor between panes, hiding it during rendering so it won't jump around, etc.)

But all in all, it shouldn't be significantly larger. I'd be interested in seeing how @har7an 's experiment goes and how much memory usage 20-30 seconds of buffering continuous heavy traffic causes - at least, what's the scale we're talking about.

Previously any additional arguments passed on the command line were ignored. Now they are appended to `cargo run ...` as documented.

and display the last send/recv error to the user instead of a generic "Buffer full" message.

so we will know when an error occured while trying to send a message to the client. The most likely cause for this is that the client buffer filled up and hence we cannot send any new messages. While we still disconnect the client as before, now we also write a log message that explains the situation.

and log a message before disconnecting it.

that is dynamically allocated on-demand and stores `ServerToClientMsg` in case the regular IPC channel is currently full. This acts as a dynamic buffer to hold and buffer messages for a while until the client hopefully catches up. Also write a message to the log to indicate when the client is recognized to be too slow in handling server messages.

har7an · 2023-01-22T08:53:50Z

So according to my measurements bumping the IPC queue from 50 messages to 5000 has no noticable impact on memory usage at all. I used the following commandline to check memory usage for both cases:

$ ps -e -o drs,rss,trs,vsz,cmd | grep target/debug

after I had started zellij using cargo x run. So I'm dumping the Vec-based solution I implemented and raise the queuesize to 5000 entries. If the client stalls entirely, that should be good for many seconds worth of buffering before we disconnect them. IF we do end up disconnecting the client, the error is now logged and the client display a pretty message to the user (See above). Case closed for now I would say, unless either of you has any objections. :)

via `Vec` and drastically increase the IPC message queue size instead. Measurements didn't discover a drastic increase in RAM caused by this, and it is a much easier fix for the problem at hand.

har7an temporarily deployed to cachix January 8, 2023 07:52 — with GitHub Actions Inactive

har7an mentioned this pull request Jan 8, 2023

Tracking issue to improve error handling #1753

Open

har7an temporarily deployed to cachix January 8, 2023 08:36 — with GitHub Actions Inactive

imsnif mentioned this pull request Jan 13, 2023

Server: Buffer full #2041

Closed

har7an added 9 commits January 22, 2023 07:26

xtask/run: Use varargs when run with -data-dir

0f290b1

Previously any additional arguments passed on the command line were ignored. Now they are appended to `cargo run ...` as documented.

server/os_i_o: Improve error message when IPC dies

a6ba53b

and display the last send/recv error to the user instead of a generic "Buffer full" message.

utils/channel: Apply rustfmt

f279e9d

server/lib: Detect when client responds too slow

3cf2442

and log a message before disconnecting it.

server: apply rustfmt

44f6a5b

utils/ipc: Add session name to "Disconnect" error

adbbff3

utils/ipc: Fix error message indent

5712598

har7an force-pushed the fix/buffer-full-in-server branch from bb4617f to 894b5e6 Compare January 22, 2023 07:16

har7an temporarily deployed to cachix January 22, 2023 07:16 — with GitHub Actions Inactive

server/os_i_o: Undo IPC channel extension

bfc8469

via `Vec` and drastically increase the IPC message queue size instead. Measurements didn't discover a drastic increase in RAM caused by this, and it is a much easier fix for the problem at hand.

CHANGELOG: Add PR zellij-org#2068

60b7e3a

har7an force-pushed the fix/buffer-full-in-server branch from 894b5e6 to 60b7e3a Compare January 22, 2023 08:55

har7an temporarily deployed to cachix January 22, 2023 08:56 — with GitHub Actions Inactive

har7an merged commit beddfb7 into zellij-org:main Jan 22, 2023

har7an mentioned this pull request Jan 22, 2023

Does not open anymore #2107

Open

har7an mentioned this pull request Apr 22, 2023

Scrolling (too quickly) causes zellij session to be detached and it must be killed to continue. #2063

Closed

har7an deleted the fix/buffer-full-in-server branch August 28, 2023 17:30

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve client disconnect handling #2068

Improve client disconnect handling #2068

har7an commented Jan 8, 2023 •

edited

Loading

raphCode commented Jan 11, 2023

raphCode commented Jan 12, 2023

imsnif commented Jan 13, 2023

imsnif commented Jan 13, 2023

raphCode commented Jan 13, 2023 •

edited

Loading

imsnif commented Jan 13, 2023

raphCode commented Jan 13, 2023 •

edited

Loading

imsnif commented Jan 13, 2023

raphCode commented Jan 13, 2023 •

edited

Loading

har7an commented Jan 13, 2023

har7an commented Jan 13, 2023

raphCode commented Jan 13, 2023

imsnif commented Jan 13, 2023

har7an commented Jan 22, 2023

Improve client disconnect handling #2068

Improve client disconnect handling #2068

Conversation

har7an commented Jan 8, 2023 • edited Loading

raphCode commented Jan 11, 2023

raphCode commented Jan 12, 2023

imsnif commented Jan 13, 2023

imsnif commented Jan 13, 2023

raphCode commented Jan 13, 2023 • edited Loading

imsnif commented Jan 13, 2023

raphCode commented Jan 13, 2023 • edited Loading

imsnif commented Jan 13, 2023

raphCode commented Jan 13, 2023 • edited Loading

har7an commented Jan 13, 2023

har7an commented Jan 13, 2023

raphCode commented Jan 13, 2023

imsnif commented Jan 13, 2023

har7an commented Jan 22, 2023

har7an commented Jan 8, 2023 •

edited

Loading

raphCode commented Jan 13, 2023 •

edited

Loading

raphCode commented Jan 13, 2023 •

edited

Loading

raphCode commented Jan 13, 2023 •

edited

Loading