Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Preventing false positive alarms of offline mode #9068

Merged
merged 1 commit into from
Mar 2, 2021

Conversation

vzhukovs
Copy link
Contributor

@vzhukovs vzhukovs commented Feb 14, 2021

What it does

The substance of the current change proposal in modification of mechanism, that is responsible for monitoring the state of client activity. The problem lay in the way that there is a base abstract class, which has the timer, that activates the offline mode within a specified period of time, if there isn't any incoming activity from the client. And there is a class, which listens for websocket incoming activity, which also has another timer, that activates within the same period of time as the previous one, minus one second, to make a ping request and receives the pong response. That's how it's implemented in master. The main challenge is when these timers activates, the client has only one second making a ping request and receive the pong response. If the ping request executes more than one second, the client indicates that the application is in offline mode, when, in fact server responses to requests, but with some delay. This is a false positive trigger. The ping request might executes more than one second, but in that case the host operating system is under heavy system load or has limited resources. It is a very common scenario, when IDE runs as containerized application, for example in kubernetes.

To avoid false positive trigger was invited to refuse to use the timer which is located in an abstract class, which is responsible for triggering the online/offline mode. In the meantime, class, that listens to websocket activity to add an ability to listen to the websocket open and close event. Websocket closing triggers by loosing the connection. This might be the disconnecting the server, losing the internet connection on the user's side. When the user is losing the internet connection, reconnecting-websocket library is trying to restore connections and in the process, fire the event about websocket close and this is a clear message, that we can show offline mode. When the reconnecting-websocket restores the connection, it fires event about websocket open and this is an indicator for online mode.

The problem with lengthy ping request can be easily reproduced by opening the big files in the editor, which are more than 6-7 megabytes. This usecase also was reproduced on master, also on Google Cloud Shell Editor and OpenShift platforms.

The current change proposal also provides test covering common scenarios for the websocket behavior and indicating the application state, in short online/offline mode.

Reproduces:

Index: packages/core/src/node/env-variables/env-variables-server.ts
<+>UTF-8
===================================================================
diff --git a/packages/core/src/node/env-variables/env-variables-server.ts b/packages/core/src/node/env-variables/env-variables-server.ts
--- a/packages/core/src/node/env-variables/env-variables-server.ts	(revision 6798167b50e5df4682ec218294a4eb1f3817904f)
+++ b/packages/core/src/node/env-variables/env-variables-server.ts	(date 1613140075235)
@@ -58,9 +58,18 @@
         if (isWindows) {
             key = key.toLowerCase();
         }
+        if (key === 'does_not_matter') {
+            const delay = Math.floor(Math.random() * (3000 - 800)) + 800;
+            console.log(`Emulate ping response with delay: ${delay}ms`);
+            await this.delay(delay);
+        }
         return this.envs[key];
     }
 
+    delay(ms: number): Promise<void> {
+        return new Promise( resolve => setTimeout(resolve, ms) );
+    }
+
     getConfigDirUri(): Promise<string> {
         return this.configDirUri;
     }

Additional context

linked comment: eclipse-che/che#18472 (comment)
fixes: eclipse-che/che#18723

Signed-off-by: Vladyslav Zhukovskyi vzhukovs@redhat.com

How to test

Build and run Theia from the current branch. Create a few big files (it would be enough 7-8 megabytes). Try to reopen them sequentially. Additionally, the host operating system can be a bit overloaded by calling the following command in the terminal:

$ yes > /dev/null &

So there shouldn't any offline marker appeared, because the network connection is not broken.
Screencast: https://drive.google.com/file/d/1xWXQ1uGGqQNJS_miPvEQySDb0aX8HtxN/view?usp=sharing

Review checklist

Reminder for reviewers

@vzhukovs vzhukovs changed the title [WIP] Preventing false positive alarms of offline mode Preventing false positive alarms of offline mode Feb 17, 2021
@vzhukovs vzhukovs marked this pull request as ready for review February 17, 2021 23:48
@vzhukovs vzhukovs self-assigned this Feb 17, 2021
@vzhukovs vzhukovs added bug bugs found in the application ui/ux issues related to user interface / user experience labels Feb 17, 2021
Copy link
Member

@paul-marechal paul-marechal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there no timeout detection anymore or did I miss it?

edit:

The problem with lengthy ping request can be easily reproduced by opening the big files in the editor, which are more than 6-7 megabytes. This usecase also was reproduced on master, also on Google Cloud Shell Editor and OpenShift platforms.

So we get rid of the timeout detection, and just wait for the connection to be completely closed to mark the status as offline?


@inject(ILogger)
protected readonly logger: ILogger;

constructor(@inject(ConnectionStatusOptions) @optional() protected readonly options: ConnectionStatusOptions = ConnectionStatusOptions.DEFAULT) { }
protected constructor(@inject(ConnectionStatusOptions) @optional() protected readonly options: ConnectionStatusOptions = ConnectionStatusOptions.DEFAULT) { }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why protected?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Answered in #9068 (comment)

Comment on lines 131 to 139
this.wsConnectionProvider.onSocketOpened(() => {
this.updateStatus(true);
this.schedulePing();
});
this.wsConnectionProvider.onSocketClosed(() => {
this.clearTimeout(this.scheduledPing);
this.updateStatus(false);
});
this.wsConnectionProvider.onIncomingMessageActivity(() => {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given this, with this change we update the status when a connection is opened, when the connection closes, and when we receive messages. I could not find a place where schedulePing would timeout though. It will set the status to false when an error occurs, but that doesn't account for timeouts IIUC?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It will set the status to false when an error occurs, but that doesn't account for timeouts IIUC?

Yeah, that's the point of this PR. Because timeout doesn't indicate that we have a broken connection, which is displayed as "offline" mode. The main problem is that when "offline" mode is triggered by timeout, the ping request receives a response from the server, then "online" status is triggered immediately after ping request. And this is quite false positive displaying the status, because user is able to call any other requests, e.g. create file, open it in the editor, open the terminal and so on.

@vzhukovs
Copy link
Contributor Author

Hi @marechal-p,

Is there no timeout detection anymore
So we get rid of the timeout detection, and just wait for the connection to be completely closed to mark the status as offline?

Yep, that's true, no timeout detection

Copy link
Contributor

@kittaakos kittaakos left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added a few comments here and there. I will try it in action later today.

public onSocketOpened: Event<void> = this.onSocketOpenedEmitter.event;

protected readonly onSocketClosedEmitter: Emitter<void> = new Emitter();
public onSocketClosed: Event<void> = this.onSocketClosedEmitter.event;
Copy link
Contributor

@kittaakos kittaakos Feb 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • public is superfluous.
  • Should be readonly.
  • Nameing: onSocketDidClose.

Same for open.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Renamed the names for the following emitters.

@@ -45,7 +51,11 @@ export class WebSocketConnectionProvider extends AbstractConnectionProvider<WebS
const url = this.createWebSocketUrl(WebSocketChannel.wsPath);
const socket = this.createWebSocket(url);
socket.onerror = console.error;
socket.onopen = () => {
this.onSocketOpenedEmitter.fire(undefined);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please move the calls to protected methods. So one can override and customize.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. There are two new protected methods which are fire the events.

socket.onclose = ({ code, reason }) => {
this.onSocketClosedEmitter.fire(undefined);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should be called after calling channel.close(code, reason); on each channel, not before.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Right, good catch, thanks! Fixed.

@kittaakos
Copy link
Contributor

@vzhukovs, we should note this change for both devs and users in the changelog. Please reference your PR description at least. Thank you!

Comment on lines 159 to 160
this.updateStatus(false);
await this.logger.trace(e);
Copy link
Contributor

@kittaakos kittaakos Feb 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: This code never happens. Once the WS connection is closed, none of the proxy calls reject but wait until there is a new connection, or forever: #8803 (comment)

I do not mind if you leave it there, but you can simplify with this.logger.error(e) as your code does not do what you want to do. trace level is useless here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Changed from trace to error.

Once the WS connection is closed, none of the proxy calls reject but wait until there is a new connection, or forever: #8803 (comment)

AFAIU, this will happened only on browser environment, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AFAIU, this will happened only on browser environment, right?

In both. The PR was never merged.

@kittaakos
Copy link
Contributor

I like the idea of this change, however, I can still see the false-positive offline status. See attached screencast:

screencast.2021-02-19.14-21-01.mp4

(The screencast is from electron, but the browser version has the same problem. Just get any random 10MB file, open it, scroll in the editor, you will see, that app will "go offline".)

@vzhukovs
Copy link
Contributor Author

@vzhukovs, we should note this change for both devs and users in the changelog. Please reference your PR description at least. Thank you!

@kittaakos, you mean this https://github.com/eclipse-theia/theia/blob/master/CHANGELOG.md ?

@vzhukovs
Copy link
Contributor Author

I like the idea of this change, however, I can still see the false-positive offline status. See attached screencast:

screencast.2021-02-19.14-21-01.mp4
(The screencast is from electron, but the browser version has the same problem. Just get any random 10MB file, open it, scroll in the editor, you will see, that app will "go offline".)

Will try to check this out. Thanks for pointing this!

@vzhukovs
Copy link
Contributor Author

I like the idea of this change, however, I can still see the false-positive offline status. See attached screencast:

screencast.2021-02-19.14-21-01.mp4
(The screencast is from electron, but the browser version has the same problem. Just get any random 10MB file, open it, scroll in the editor, you will see, that app will "go offline".)

@kittaakos, sorry for late response, found that sometimes electron can't rebuild properly, had the similar issue as here: https://community.theia-ide.org/t/electron-start-error/1464/2

Checked this use case on electron build with file size of 20 megabytes and got the following result: https://drive.google.com/file/d/1NIOGpfyT10F0JW7zeYzQVt1Mhdo2YTNF/view?usp=sharing (screencast). Couldn't reproduce this issue.

cc @azatsarynnyy can you also check plz, on your side whether you have such issue?

@vzhukovs
Copy link
Contributor Author

@vzhukovs, we should note this change for both devs and users in the changelog. Please reference your PR description at least. Thank you!

Changelog has been updated.

@azatsarynnyy
Copy link
Member

cc @azatsarynnyy can you also check plz, on your side whether you have such issue?

@vzhukovs can't reproduce it on my side - no false-positive offline status

Peek 2021-02-25 09-37

@azatsarynnyy
Copy link
Member

@vzhukovs can we merge it?

@kittaakos
Copy link
Contributor

@vzhukovs can we merge it?

I want to give it a try again. We can merge it tomorrow if you all are OK with it.

@azatsarynnyy
Copy link
Member

@vzhukovs can we merge it?

I want to give it a try again. We can merge it tomorrow if you all are OK with it.

Sounds good. Thanks @kittaakos!

@kittaakos
Copy link
Contributor

Sounds good. Thanks @kittaakos!

Thank you so much for your patience. I do the verification tomorrow ASAP.

@kittaakos
Copy link
Contributor

Unfortunately, I still see the Offline status. I downloaded and used a 10 MB test file from here: https://www.engineerhammad.com/2015/04/Download-Test-Files.html

screencast.2021-03-02.10-26-11.mp4

I tried with the c848002 reivison.

@vzhukovs
Copy link
Contributor Author

vzhukovs commented Mar 2, 2021

Unfortunately, I still see the Offline status. I downloaded and used a 10 MB test file from here: https://www.engineerhammad.com/2015/04/Download-Test-Files.html

screencast.2021-03-02.10-26-11.mp4
I tried with the c848002 reivison.

Finally, I was able to reproduce it. Also I've just checked this use case on master branch, and there is the same behavior, but what I've noticed, is that logs contain a bunch of records about disposing connection from the server side during the opening a big file:

theia – electron:package json 2021-03-02 13-55-03

And client tries to reconnect to the server side after each disconnection. I've noticed, that this behavior is reproducible only when there is a big project opened in Theia. When there is an empty workspace, there isn't such issue with disconnecting websocket channel. So I can assume, that this is not a regression by this PR. We might register a dedicate issue for this specific use case, which is rare, but anyway. wdyt?

@kittaakos
Copy link
Contributor

So I can assume, that this is not a regression by this PR.

OK, I am fine with it. I thought the purpose of this PR is to fix the false positive offline status issue, which is still present with the proposed changes. I do not want to hold you back, so go ahead if you think it is done.

Signed-off-by: Vladyslav Zhukovskyi <vzhukovs@redhat.com>
@vzhukovs vzhukovs force-pushed the connectionStatusRefactoring branch from 7e382cf to d7dbea3 Compare March 2, 2021 12:23
@vzhukovs
Copy link
Contributor Author

vzhukovs commented Mar 2, 2021

So I can assume, that this is not a regression by this PR.

OK, I am fine with it. I thought the purpose of this PR is to fix the false positive offline status issue, which is still present with the proposed changes. I do not want to hold you back, so go ahead if you think it is done.

Thanks for you help! Just squashed commits into one, waiting for the finished build actions.

@vzhukovs vzhukovs merged commit 48944a1 into master Mar 2, 2021
@vzhukovs vzhukovs deleted the connectionStatusRefactoring branch March 2, 2021 13:31
@github-actions github-actions bot added this to the 1.12.0 milestone Mar 2, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug bugs found in the application ui/ux issues related to user interface / user experience
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Che-Theia shows Offline mode, mistakenly
4 participants