Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Git plugin issue after network disconnection #12823

Closed
Tracked by #13192
safisa opened this issue Aug 10, 2023 · 24 comments · Fixed by #13082
Closed
Tracked by #13192

Git plugin issue after network disconnection #12823

safisa opened this issue Aug 10, 2023 · 24 comments · Fixed by #13082
Assignees
Labels
messaging issues related to messaging plug-in system issues related to the plug-in system websocket issues related to websockets/messaging

Comments

@safisa
Copy link
Contributor

safisa commented Aug 10, 2023

Bug Description:

Hi,
I am using the vscode git plugin (1.77.0). When the network connection is lost, and back online again, all the plugins are reloaded back. In this case, the git plugin open changes command stops working (double click on a changed file in the source control tree or context menu command 'open changes').
The error in console log:

023-08-10T13:57:02.602Z root ERROR Error: No reply handler for reply with id: 116
    at RpcProtocol.handleReply
    at RpcProtocol.handleMessage

Steps to Reproduce:

  1. open a project under git while using the vscode git plugin
  2. do a change in a file
  3. make sure that the source control tree shows this change, and double click on it will open the diff view
  4. make a network disconnection and then reconnect back (if you are working on local machine, then use the 127.0.0.1 instead of loaclhost and using the devtools change the network from 'no throtlling' to 'offline' and then change it back to 'no throtlling')
  5. while it is disconnected the status bar should be in offline mode (yellow)
  6. after getting online back, try to open the changes of the file.

Additional Information

  • Operating System: mac/win
  • Theia Version: 1.40
@tsmaeder
Copy link
Contributor

The double click no longer opening the diff editor happens to me all the time when leaving Theia alone for a little bit.

@tsmaeder
Copy link
Contributor

When I return to Theia after the machine went to sleep, it looks like Theia is restarting plugins. I'm pretty convinced that can't work without a proper reset of the front end.

@msujew
Copy link
Member

msujew commented Sep 21, 2023

Tangentially related: #11499. The consensus (I believe) is that we should aim to make the frontend survive a restart of the plugin host.

@tsmaeder
Copy link
Contributor

tsmaeder commented Sep 21, 2023

@msujew losing application state after waking up the computer is not in line with the expectations our users (rightfully) have from a desktop application. We do disconnection detection via ping timeouts, or listening to socket connections. None of that makes any sense in the Electron case. My point is the plugin host shouldn't restart after lunch.

@tsmaeder
Copy link
Contributor

Maybe the connection handling needs to be totally different between Electron/Web cases.

@msujew
Copy link
Member

msujew commented Sep 21, 2023

Sure makes sense for the Electron use case. The issue is talking about a network disconnection though, which can realistically only happen in a browser application.

@msujew msujew added the plug-in system issues related to the plug-in system label Sep 21, 2023
@tsmaeder
Copy link
Contributor

That is true, but it seems that the computer going to sleep counts as a "network disconnection" in the Electron case.

@tsmaeder
Copy link
Contributor

@tortmayr suggested adding something like the following in order to reproduce the issue without having to wait a long time:

export const reconnectCommand: Command = {
    id: 'test.reconnect',
    label: 'Test frontend reconnect'
}

// Test command contribution

    @inject(WebSocketConnectionProvider)
    protected connectionProvider: WebSocketConnectionProvider;

     registerCommands(commands: CommandRegistry): void {
        commands.registerCommand(reconnectCommand, {
            execute: () => {
                this.connectionProvider.socket.disconnect();
                setTimeout(() => this.connectionProvider.socket.connect(), 500);
            }
        });
    }

@tsmaeder
Copy link
Contributor

A couple of observations:

  1. We have on RPCProtocol instance per main/ext service, which means each service has it's own numbering of requests. That is unfortunate for debugging purposes.
  2. RPCProtocol instances are recreated upon reconnect. That in itself does not explain the "no handler for reply" errors: those would only happen if we get replies for messages that were sent before the reconnect after the reconnect. That seems unlikely.
  3. To me it feels like the replies are getting routed to the wrong instance of the rpc protocol. That might be the case when some instances are not properly disposed of/disconnected.

@tsmaeder tsmaeder self-assigned this Oct 21, 2023
@tsmaeder tsmaeder added messaging issues related to messaging websocket issues related to websockets/messaging labels Oct 21, 2023
@tsmaeder
Copy link
Contributor

By giving the rpc protocol instances an sequential id and sending the sender id with request messages, I was able to establish the following:

  1. Upon reconnecting, a lot of instances of RPCProtocol are created
  2. The messages are sent by "old" RpcProtocol instances (example: "38"), but the replies handled by "new" instances (e.g. "108").

It seems like the UI elements are still connected to the old RPCProtocol instances. Also, it looks like plugins are restarted upon reconnect: I have the message "starting plugins" in the browser log twice.

@tsmaeder
Copy link
Contributor

I think the problem at the base is this: the back-end is written in a way that it can handle a new front-end connecting to it. But the front-end does not really handle the case where it is connected to to a new back-end (which is equivalent to recreating the front-end from scratch). It looks like the back-end treats the closure of the underlying websocket as if a new front end was connected, thus recreating all the connection container modules. The front-end, meanwhile, is holding on to the proxies originally created.
Unfortunately, the whole "setting up rpc connections"-business is very indirect and "lazy" : lots of listening to channels opening and then implicitly doing the right thing. I wish we had a very clear cut lifecycle, like:

  1. Open a connection from front-end to back end
  2. Negotiate a UUID
  3. In the browser case, disconnection of the socket needs to be treated as a possible exit front end. When the socket connection can't be reestablished by a front end with the same UUID in a certain time frame, we need to take down the connection-specific container & services
  4. In the Electron case, we should explicitly shut down the back end when we exit. If the socket disconnects, we can just reconnect. Conditions like the back-end crashing can be detected reliably in this case.
  5. Generally, we need to separate socket connection status from back end/front end status

@tsmaeder
Copy link
Contributor

This looks like a lot of scary changes!

@tsmaeder
Copy link
Contributor

I think basically it looks like this from the front-end's point of view:

  1. Connect to the back end, identify with a unique id
  2. Oh shit, I lost the connection to the back and. Try and reestablish it! Establish it with my original UUID
  3. The back end reconnected and told me it still has the same back end connection module running. Great, nothing more to do
  4. The back end told me it does not have the same connection module running, give up and reload the UI

@tsmaeder
Copy link
Contributor

From the back end's point of view, it would look like this:

  1. A front end opened a connection and requests a new UUID
  2. I start a whole new connection context for this front end and answer with an ID

or in the case of a reconnect:

  1. A front opened a connection and requested a reconnect with a UUID
  2. I look up the corresponding context
  3. If I find it, I just connect the connection to that context
  4. If I don't find it, I return an error to this front-end

@tsmaeder
Copy link
Contributor

There is a related PR that has been closed: #11499.

@safisa
Copy link
Contributor Author

safisa commented Oct 27, 2023

Hi,
Thank you for this detailed investigation...

I think also that this issue is the root cause of many other issues:

  1. the plugin reload affects other plugins on disconnection, and some stop working as expected (not only the git plugin). Moreover, some took time to reload like the redhat java plugin causing bad UX.
  2. Custom editors and webviews are fully reloaded causing some events to be broken and not functional after reconnect.
  3. Custom editors will lose any unsaved data!
    Thanks

@tsmaeder
Copy link
Contributor

Plugin reload on reconnect comes from #6159. The trouble with the "cleanup state on reconnect" is that we have to manually clean up all the places where we retain back-end related state. An approach based on retaining the back-end for a while could be done in one place and would be simpler.
I think the difference between reloading the whole front end an reloading all back-end related state is minimal. What does @planger have to say on performance in this case?

@tsmaeder
Copy link
Contributor

A couple of thoughts on lifecycle:

  1. When a front-end connects to a back end, the back-end services are not created immediately. Instead, they are only created (instantiated) when a front end opens a channel with the path the service is bound to.
  2. Proxies for front-end services are created immediately, but you won't be able to send any requests until the front-end creates a channel on the corresponding path

On the front end, the channels to the back-end are opened when we invoke AbstractConnectionFactory.createProxy(). This is usually done through a factory invocation for a service binding for a service identifier, see for example EnvVariablesServer. So the exact timing for the creation of the front end services (and consequently the "activation" of the back end services) is dependent on whenever someone gets an instance of a service proxy from the front-end inversify container. This creation of services "on demand" makes it very hard to find a place where we can say that all services have been created and are ready for use.
We had some problems in the past with plugin main/ext services where we would call back to main services from the constructor of an ext service and the main side service would be getting requests before it was properly constructed. So not having sufficient life cycle control for services can be a problem.
Note that a proxy for "the other side" of a connection no a particular path can be injected when a service handler is constructed. See the RPCServer interface.

@tsmaeder
Copy link
Contributor

There are several places of asynchronicity in the system:

  1. Service proxies in the front end are created immediately, but any invocations on the proxy will be deferred until the back end is connected and the channel corresponding to the service has been opened. Opening the channel requires a conversation with the back and and therefore is asynchronous. So in the absence of asynchronous resolution of inversify objects (which we don't use), this asynchronicty cannot be removed.
  2. On the back end, obviously, we need to defer creation of any front-end related objects until a front-end actually connects to the back end (see ConnectionContainerModule). Obviously, we cannot get rid of this asynchronicity
  3. There is then asynchronicity again when we wait for the front-end to open a channel on the service path in order to actually create the back-end service object. There are two reasons I can imagine deferred action: the first is faster startup. A back-end service might only be needed for a particular view, and as long as that view is not shown, the service proxy (and therefore the back-end service) might not get instantiated. While this argument looks convincing, it's not sure whether we have any significant gains in real life. Only measurements could tell us for sure.
    The second argument has to do with initialization order. As I have mentioned before, some back end services invoke front end services upon creation. If we only create back end services upon opening a service channel, we can be reasonably sure that the front end service we're calling is fully created. But as experience shows, this does not seem to work out 100%. Only a proper two phase initialization protocol would ensure that: in the first phase, fully construct the services, in the second phase run startup action for the service.

@tsmaeder
Copy link
Contributor

How could such a startup protocol look:

  1. Someone requests a remote service object, usually from a factory that binds a local service object (this might happen on the back-end or front-end), but for an example. let's assume the front end requests a back end service. The service is in "not ready" state
  2. The front requests the service from the back end, for example by opening a channel on the service path
  3. The back end reacts by constructing the service object.
  4. The back end sends a "service ready" message to the front end
  5. The back asynchronously invokes "init" on the service object. If the init consumes front end services, it must wait for them to be "ready".
  6. The front end receives the "ready" message and moves the service proxy to "ready" state.

@tsmaeder
Copy link
Contributor

Currently, remote services have the option to have a local "client" object that communicates back on the same services channel. IMO, this makes the code unnecessarily confusing: I would strictly separate remote services which I can call and notify and handlers which can be called and notified.

@tsmaeder
Copy link
Contributor

My concrete, minimal proposal would be this:

  1. We get rid of "reconnect support"
    This would mean that when we lose the connection to the back end, we need to reload the whole UI. There is a certain performance penalty for this, but just loading the front end is not that costly, especially in the presence of many plugins, which will have to be reloaded anyway. In order to make this approach more palatable, I suggest we
  2. Separate websocket connection from back end connection
    If we lose the websocket connection for a bit and then reconnect, we do not shut down the connection container module on the back end and restart plugins for example on the front end. The approach to do this would be to
  3. Keep the connection context around to reconnect to
    We would have a configurable timeout for how long we want the connection container module to stick around when we lose the back end connection. The back-end has a UUID. If the front end reconnects to the same back end id, we just reconnect the websocket and things continue as before.

Of course, the timeout for how long to keep the back end would have to be configurable: In the case of Electron, the timeout would be infinite, if we have a web-based solution that we deploy for a couple of developers in-house, we might want an hour, whereas if we run a public, free service, we might drop the back-end immediately upon websocket disconnect. In this way, the only downside relative to today's "reconnect" support would be that we reload the UI more often than before.

"Reconnect" support requires us to be able to selectively drop all state that is related to a particular back end connection context and to rebuild it upon reconnect. Historically, we have not been very good at that. Also, all extenders would have to support "reconnect" in their extensions, as well.
One risk with reconnecting to the same back-end is that the front end might miss some notifications that were sent, but did not reach the front end while disconnected. We could mitigate this risk by keeping a buffer of sent messages when the websocket is disconnected and send them upon reconnect.

@tsmaeder
Copy link
Contributor

@vince-fugnitto the last comment is my proposal, basically. Since you had to drop from the dev-call, I'd be interested in your comments.

@tsmaeder
Copy link
Contributor

An alternative idea, which occurred to me while talking about it in the dev call is this: if we had a "connection container module" which contains all objects related to a connection, we could do a proper reconnect. Right now, we can inject service proxies into any object in the container. If we had a child container (and the connection to the back end was only known inside that container), we would be forced to do explicitly export any objects outside of that container.

tsmaeder added a commit to tsmaeder/theia that referenced this issue Nov 13, 2023
Fixes eclipse-theia#12823

- refactor front end to allow for multiple reconnections
- remove IWebsockt abstractions
- separate front end connections from service channel management
- introduce mechanism to reconnect front end to existing connection
  context based on timeouts

Contributed on behalf of STMicroelectronics

Signed-off-by: Thomas Mäder <t.s.maeder@gmail.com>
tsmaeder added a commit to tsmaeder/theia that referenced this issue Nov 20, 2023
Fixes eclipse-theia#12823

- refactor front end to allow for multiple reconnections
- remove IWebsockt abstractions
- separate front end connections from service channel management
- introduce mechanism to reconnect front end to existing connection
  context based on timeouts

Contributed on behalf of STMicroelectronics

Signed-off-by: Thomas Mäder <t.s.maeder@gmail.com>
tsmaeder added a commit to tsmaeder/theia that referenced this issue Nov 20, 2023
Fixes eclipse-theia#12823

- refactor front end to allow for multiple reconnections
- remove IWebsockt abstractions
- separate front end connections from service channel management
- introduce mechanism to reconnect front end to existing connection
  context based on timeouts

Contributed on behalf of STMicroelectronics

Signed-off-by: Thomas Mäder <t.s.maeder@gmail.com>
tsmaeder added a commit to tsmaeder/theia that referenced this issue Dec 13, 2023
Fixes eclipse-theia#12823

- refactor front end to allow for multiple reconnections
- remove IWebsockt abstractions
- separate front end connections from service channel management
- introduce mechanism to reconnect front end to existing connection
  context based on timeouts

Contributed on behalf of STMicroelectronics

Signed-off-by: Thomas Mäder <t.s.maeder@gmail.com>
tsmaeder added a commit that referenced this issue Dec 18, 2023
Fixes #12823

- refactor front end to allow for multiple reconnections
- remove IWebsockt abstractions
- separate front end connections from service channel management
- introduce mechanism to reconnect front end to existing connection
  context based on timeouts

Contributed on behalf of STMicroelectronics

Signed-off-by: Thomas Mäder <t.s.maeder@gmail.com>
@JonasHelming JonasHelming mentioned this issue Dec 19, 2023
63 tasks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
messaging issues related to messaging plug-in system issues related to the plug-in system websocket issues related to websockets/messaging
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants