Git plugin issue after network disconnection #12823

safisa · 2023-08-10T14:11:33Z

Bug Description:

Hi,
I am using the vscode git plugin (1.77.0). When the network connection is lost, and back online again, all the plugins are reloaded back. In this case, the git plugin open changes command stops working (double click on a changed file in the source control tree or context menu command 'open changes').
The error in console log:

023-08-10T13:57:02.602Z root ERROR Error: No reply handler for reply with id: 116
    at RpcProtocol.handleReply
    at RpcProtocol.handleMessage

Steps to Reproduce:

open a project under git while using the vscode git plugin
do a change in a file
make sure that the source control tree shows this change, and double click on it will open the diff view
make a network disconnection and then reconnect back (if you are working on local machine, then use the 127.0.0.1 instead of loaclhost and using the devtools change the network from 'no throtlling' to 'offline' and then change it back to 'no throtlling')
while it is disconnected the status bar should be in offline mode (yellow)
after getting online back, try to open the changes of the file.

Additional Information

Operating System: mac/win
Theia Version: 1.40

The text was updated successfully, but these errors were encountered:

tsmaeder · 2023-08-15T12:23:46Z

The double click no longer opening the diff editor happens to me all the time when leaving Theia alone for a little bit.

tsmaeder · 2023-09-21T09:06:17Z

When I return to Theia after the machine went to sleep, it looks like Theia is restarting plugins. I'm pretty convinced that can't work without a proper reset of the front end.

msujew · 2023-09-21T09:31:00Z

Tangentially related: #11499. The consensus (I believe) is that we should aim to make the frontend survive a restart of the plugin host.

tsmaeder · 2023-09-21T09:53:11Z

@msujew losing application state after waking up the computer is not in line with the expectations our users (rightfully) have from a desktop application. We do disconnection detection via ping timeouts, or listening to socket connections. None of that makes any sense in the Electron case. My point is the plugin host shouldn't restart after lunch.

tsmaeder · 2023-09-21T09:55:44Z

Maybe the connection handling needs to be totally different between Electron/Web cases.

msujew · 2023-09-21T10:29:49Z

Sure makes sense for the Electron use case. The issue is talking about a network disconnection though, which can realistically only happen in a browser application.

tsmaeder · 2023-09-21T11:32:48Z

That is true, but it seems that the computer going to sleep counts as a "network disconnection" in the Electron case.

tsmaeder · 2023-10-21T15:07:19Z

@tortmayr suggested adding something like the following in order to reproduce the issue without having to wait a long time:

export const reconnectCommand: Command = {
    id: 'test.reconnect',
    label: 'Test frontend reconnect'
}

// Test command contribution

    @inject(WebSocketConnectionProvider)
    protected connectionProvider: WebSocketConnectionProvider;

     registerCommands(commands: CommandRegistry): void {
        commands.registerCommand(reconnectCommand, {
            execute: () => {
                this.connectionProvider.socket.disconnect();
                setTimeout(() => this.connectionProvider.socket.connect(), 500);
            }
        });
    }

tsmaeder · 2023-10-21T15:14:50Z

A couple of observations:

We have on RPCProtocol instance per main/ext service, which means each service has it's own numbering of requests. That is unfortunate for debugging purposes.
RPCProtocol instances are recreated upon reconnect. That in itself does not explain the "no handler for reply" errors: those would only happen if we get replies for messages that were sent before the reconnect after the reconnect. That seems unlikely.
To me it feels like the replies are getting routed to the wrong instance of the rpc protocol. That might be the case when some instances are not properly disposed of/disconnected.

tsmaeder · 2023-10-26T15:38:15Z

By giving the rpc protocol instances an sequential id and sending the sender id with request messages, I was able to establish the following:

Upon reconnecting, a lot of instances of RPCProtocol are created
The messages are sent by "old" RpcProtocol instances (example: "38"), but the replies handled by "new" instances (e.g. "108").

It seems like the UI elements are still connected to the old RPCProtocol instances. Also, it looks like plugins are restarted upon reconnect: I have the message "starting plugins" in the browser log twice.

tsmaeder · 2023-10-26T17:44:07Z

I think the problem at the base is this: the back-end is written in a way that it can handle a new front-end connecting to it. But the front-end does not really handle the case where it is connected to to a new back-end (which is equivalent to recreating the front-end from scratch). It looks like the back-end treats the closure of the underlying websocket as if a new front end was connected, thus recreating all the connection container modules. The front-end, meanwhile, is holding on to the proxies originally created.
Unfortunately, the whole "setting up rpc connections"-business is very indirect and "lazy" : lots of listening to channels opening and then implicitly doing the right thing. I wish we had a very clear cut lifecycle, like:

Open a connection from front-end to back end
Negotiate a UUID
In the browser case, disconnection of the socket needs to be treated as a possible exit front end. When the socket connection can't be reestablished by a front end with the same UUID in a certain time frame, we need to take down the connection-specific container & services
In the Electron case, we should explicitly shut down the back end when we exit. If the socket disconnects, we can just reconnect. Conditions like the back-end crashing can be detected reliably in this case.
Generally, we need to separate socket connection status from back end/front end status

tsmaeder · 2023-10-26T17:45:26Z

This looks like a lot of scary changes!

tsmaeder · 2023-10-26T17:50:54Z

I think basically it looks like this from the front-end's point of view:

Connect to the back end, identify with a unique id
Oh shit, I lost the connection to the back and. Try and reestablish it! Establish it with my original UUID
The back end reconnected and told me it still has the same back end connection module running. Great, nothing more to do
The back end told me it does not have the same connection module running, give up and reload the UI

tsmaeder · 2023-10-26T17:53:47Z

From the back end's point of view, it would look like this:

A front end opened a connection and requests a new UUID
I start a whole new connection context for this front end and answer with an ID

or in the case of a reconnect:

A front opened a connection and requested a reconnect with a UUID
I look up the corresponding context
If I find it, I just connect the connection to that context
If I don't find it, I return an error to this front-end

tsmaeder · 2023-10-27T08:36:41Z

There is a related PR that has been closed: #11499.

safisa · 2023-10-27T08:53:26Z

Hi,
Thank you for this detailed investigation...

I think also that this issue is the root cause of many other issues:

the plugin reload affects other plugins on disconnection, and some stop working as expected (not only the git plugin). Moreover, some took time to reload like the redhat java plugin causing bad UX.
Custom editors and webviews are fully reloaded causing some events to be broken and not functional after reconnect.
Custom editors will lose any unsaved data!
Thanks

tsmaeder · 2023-10-27T14:22:48Z

Plugin reload on reconnect comes from #6159. The trouble with the "cleanup state on reconnect" is that we have to manually clean up all the places where we retain back-end related state. An approach based on retaining the back-end for a while could be done in one place and would be simpler.
I think the difference between reloading the whole front end an reloading all back-end related state is minimal. What does @planger have to say on performance in this case?

tsmaeder · 2023-10-29T06:31:21Z

A couple of thoughts on lifecycle:

When a front-end connects to a back end, the back-end services are not created immediately. Instead, they are only created (instantiated) when a front end opens a channel with the path the service is bound to.
Proxies for front-end services are created immediately, but you won't be able to send any requests until the front-end creates a channel on the corresponding path

On the front end, the channels to the back-end are opened when we invoke AbstractConnectionFactory.createProxy(). This is usually done through a factory invocation for a service binding for a service identifier, see for example EnvVariablesServer. So the exact timing for the creation of the front end services (and consequently the "activation" of the back end services) is dependent on whenever someone gets an instance of a service proxy from the front-end inversify container. This creation of services "on demand" makes it very hard to find a place where we can say that all services have been created and are ready for use.
We had some problems in the past with plugin main/ext services where we would call back to main services from the constructor of an ext service and the main side service would be getting requests before it was properly constructed. So not having sufficient life cycle control for services can be a problem.
Note that a proxy for "the other side" of a connection no a particular path can be injected when a service handler is constructed. See the RPCServer interface.

tsmaeder · 2023-10-29T07:00:32Z

There are several places of asynchronicity in the system:

Service proxies in the front end are created immediately, but any invocations on the proxy will be deferred until the back end is connected and the channel corresponding to the service has been opened. Opening the channel requires a conversation with the back and and therefore is asynchronous. So in the absence of asynchronous resolution of inversify objects (which we don't use), this asynchronicty cannot be removed.
On the back end, obviously, we need to defer creation of any front-end related objects until a front-end actually connects to the back end (see ConnectionContainerModule). Obviously, we cannot get rid of this asynchronicity
There is then asynchronicity again when we wait for the front-end to open a channel on the service path in order to actually create the back-end service object. There are two reasons I can imagine deferred action: the first is faster startup. A back-end service might only be needed for a particular view, and as long as that view is not shown, the service proxy (and therefore the back-end service) might not get instantiated. While this argument looks convincing, it's not sure whether we have any significant gains in real life. Only measurements could tell us for sure.
The second argument has to do with initialization order. As I have mentioned before, some back end services invoke front end services upon creation. If we only create back end services upon opening a service channel, we can be reasonably sure that the front end service we're calling is fully created. But as experience shows, this does not seem to work out 100%. Only a proper two phase initialization protocol would ensure that: in the first phase, fully construct the services, in the second phase run startup action for the service.

tsmaeder · 2023-10-29T07:09:12Z

How could such a startup protocol look:

Someone requests a remote service object, usually from a factory that binds a local service object (this might happen on the back-end or front-end), but for an example. let's assume the front end requests a back end service. The service is in "not ready" state
The front requests the service from the back end, for example by opening a channel on the service path
The back end reacts by constructing the service object.
The back end sends a "service ready" message to the front end
The back asynchronously invokes "init" on the service object. If the init consumes front end services, it must wait for them to be "ready".
The front end receives the "ready" message and moves the service proxy to "ready" state.

tsmaeder · 2023-10-29T07:11:07Z

Currently, remote services have the option to have a local "client" object that communicates back on the same services channel. IMO, this makes the code unnecessarily confusing: I would strictly separate remote services which I can call and notify and handlers which can be called and notified.

tsmaeder · 2023-10-31T14:21:40Z

My concrete, minimal proposal would be this:

We get rid of "reconnect support"
This would mean that when we lose the connection to the back end, we need to reload the whole UI. There is a certain performance penalty for this, but just loading the front end is not that costly, especially in the presence of many plugins, which will have to be reloaded anyway. In order to make this approach more palatable, I suggest we
Separate websocket connection from back end connection
If we lose the websocket connection for a bit and then reconnect, we do not shut down the connection container module on the back end and restart plugins for example on the front end. The approach to do this would be to
Keep the connection context around to reconnect to
We would have a configurable timeout for how long we want the connection container module to stick around when we lose the back end connection. The back-end has a UUID. If the front end reconnects to the same back end id, we just reconnect the websocket and things continue as before.

Of course, the timeout for how long to keep the back end would have to be configurable: In the case of Electron, the timeout would be infinite, if we have a web-based solution that we deploy for a couple of developers in-house, we might want an hour, whereas if we run a public, free service, we might drop the back-end immediately upon websocket disconnect. In this way, the only downside relative to today's "reconnect" support would be that we reload the UI more often than before.

"Reconnect" support requires us to be able to selectively drop all state that is related to a particular back end connection context and to rebuild it upon reconnect. Historically, we have not been very good at that. Also, all extenders would have to support "reconnect" in their extensions, as well.
One risk with reconnecting to the same back-end is that the front end might miss some notifications that were sent, but did not reach the front end while disconnected. We could mitigate this risk by keeping a buffer of sent messages when the websocket is disconnected and send them upon reconnect.

tsmaeder · 2023-10-31T15:32:38Z

@vince-fugnitto the last comment is my proposal, basically. Since you had to drop from the dev-call, I'd be interested in your comments.

tsmaeder · 2023-10-31T16:10:02Z

An alternative idea, which occurred to me while talking about it in the dev call is this: if we had a "connection container module" which contains all objects related to a connection, we could do a proper reconnect. Right now, we can inject service proxies into any object in the container. If we had a child container (and the connection to the back end was only known inside that container), we would be forced to do explicitly export any objects outside of that container.

Fixes eclipse-theia#12823 - refactor front end to allow for multiple reconnections - remove IWebsockt abstractions - separate front end connections from service channel management - introduce mechanism to reconnect front end to existing connection context based on timeouts Contributed on behalf of STMicroelectronics Signed-off-by: Thomas Mäder <t.s.maeder@gmail.com>

Fixes #12823 - refactor front end to allow for multiple reconnections - remove IWebsockt abstractions - separate front end connections from service channel management - introduce mechanism to reconnect front end to existing connection context based on timeouts Contributed on behalf of STMicroelectronics Signed-off-by: Thomas Mäder <t.s.maeder@gmail.com>

tsmaeder mentioned this issue Aug 15, 2023

Improvements for Dog-Fooding Theia #12693

Open

4 tasks

msujew added the plug-in system issues related to the plug-in system label Sep 21, 2023

tsmaeder self-assigned this Oct 21, 2023

tsmaeder added messaging issues related to messaging websocket issues related to websockets/messaging labels Oct 21, 2023

tsmaeder mentioned this issue Nov 20, 2023

Introduce timeout for keeping connection contexts alive #13082

Merged

1 task

tsmaeder mentioned this issue Dec 1, 2023

Introduce Optional Request Tracing #13136

Open

tsmaeder closed this as completed in #13082 Dec 18, 2023

JonasHelming mentioned this issue Dec 19, 2023

Theia IDE Epic #13192

Open

63 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Git plugin issue after network disconnection #12823

Git plugin issue after network disconnection #12823

safisa commented Aug 10, 2023

tsmaeder commented Aug 15, 2023

tsmaeder commented Sep 21, 2023

msujew commented Sep 21, 2023

tsmaeder commented Sep 21, 2023 •

edited

Loading

tsmaeder commented Sep 21, 2023

msujew commented Sep 21, 2023

tsmaeder commented Sep 21, 2023

tsmaeder commented Oct 21, 2023

tsmaeder commented Oct 21, 2023

tsmaeder commented Oct 26, 2023

tsmaeder commented Oct 26, 2023

tsmaeder commented Oct 26, 2023

tsmaeder commented Oct 26, 2023

tsmaeder commented Oct 26, 2023

tsmaeder commented Oct 27, 2023

safisa commented Oct 27, 2023

tsmaeder commented Oct 27, 2023

tsmaeder commented Oct 29, 2023

tsmaeder commented Oct 29, 2023

tsmaeder commented Oct 29, 2023

tsmaeder commented Oct 29, 2023

tsmaeder commented Oct 31, 2023

tsmaeder commented Oct 31, 2023

tsmaeder commented Oct 31, 2023

Git plugin issue after network disconnection #12823

Git plugin issue after network disconnection #12823

Comments

safisa commented Aug 10, 2023

Bug Description:

Steps to Reproduce:

Additional Information

tsmaeder commented Aug 15, 2023

tsmaeder commented Sep 21, 2023

msujew commented Sep 21, 2023

tsmaeder commented Sep 21, 2023 • edited Loading

tsmaeder commented Sep 21, 2023

msujew commented Sep 21, 2023

tsmaeder commented Sep 21, 2023

tsmaeder commented Oct 21, 2023

tsmaeder commented Oct 21, 2023

tsmaeder commented Oct 26, 2023

tsmaeder commented Oct 26, 2023

tsmaeder commented Oct 26, 2023

tsmaeder commented Oct 26, 2023

tsmaeder commented Oct 26, 2023

tsmaeder commented Oct 27, 2023

safisa commented Oct 27, 2023

tsmaeder commented Oct 27, 2023

tsmaeder commented Oct 29, 2023

tsmaeder commented Oct 29, 2023

tsmaeder commented Oct 29, 2023

tsmaeder commented Oct 29, 2023

tsmaeder commented Oct 31, 2023

tsmaeder commented Oct 31, 2023

tsmaeder commented Oct 31, 2023

tsmaeder commented Sep 21, 2023 •

edited

Loading