Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Complete, test JS API reconnects #730

Closed
niloc132 opened this issue Jun 9, 2021 · 2 comments · Fixed by #3502
Closed

Complete, test JS API reconnects #730

niloc132 opened this issue Jun 9, 2021 · 2 comments · Fixed by #3502
Assignees
Labels
Milestone

Comments

@niloc132
Copy link
Member

niloc132 commented Jun 9, 2021

Since replacing the websocket and gwt-rpc serialization with http2/grpc/protobuf, we haven't had an actual server deployed suitable for testing reconnecting a session after some kind of network loss, so the work is incomplete still. Presently we only are testing with the websocket proxy transport (so as to be useful on localhost or in other ssl-less environments, since http2 can't be used in the browser without ssl).

While we're probably interested in websocket proxy reconnects working correctly, it isn't as much of a priority as actual http2 reconnects, so the first step of this task is to get the stack running on a server with ssl and no websockets, then bounce envoy or the network connection to periodically break the streams and make sure that we are able to restore the session, or in cases where the session cannot correctly come back, restore tables which can be re-fetched.

@niloc132 niloc132 added the jsapi label Jun 9, 2021
@niloc132 niloc132 added this to the Backlog milestone Jun 9, 2021
@niloc132 niloc132 self-assigned this Jun 9, 2021
niloc132 added a commit that referenced this issue Aug 19, 2021
Added basic error handling, but not proper reconnects or reauth. Three error types are handled for the main streams or polling calls:

    auth failure which suggests either that the server was restarted or the session timed out somehow - in either case, a new auth is required to resume working
    Code.INTERNAL, which our grpc-web client library will pass in cases where the response doesn't make sense (broken payloads)
    Code.UNKNOWN, which our grpc-web client library will pass in cases where the transport doesn't seem to work (no response before connection died)

A new event was added, temporarily, to inform the UI that we are giving up if one of these happens, and the page needs to be reloaded. This will be removed when we have proper reconnect, and inform the UI again that the reconnect succeeded or failed.

Also modified a few errors being sent from the server to hopefully better reflect the intent of the gRPC status codes, and avoid using "INTERNAL" when "something bad" happens.

The UI will need a few small changes to handle this, and to handle a few cases where API calls can fail but not be registered in the UI.

Fixes #1065, but #730 needs to formally address this.
niloc132 added a commit that referenced this issue Aug 27, 2021
Fetch/h2-based grpc-web clients do not support bidirectional streaming at this time, but the websocket client we are using from https://github.com/improbable-eng/grpc-web/ does. This lets us keep a single websocket open instead of making many calls. The client has support now to abstract how a bidirectional stream is created, either by a single websocket, or by a pair of fetch calls.

Additionally, autocomplete was seeing problems that appeared related to the server-side grpc handling payloads out of order. These methods have been consolidated into a single bidirectional stream to ensure ordering is sane. The server now has support to only implement a bidirectional stream once, and mark the "open"/"next" rpc methods as part of that, and automatically have their requests enqueued to be handled serially.

Other improvements got slipped into this work as well, such as only interacting with autocomplete from the gRPC service and never from DB, so that dependency was cut (as well as the dependency on gRPC utils, which only existed to get jsr305 annotations into various projects).

The client is incomplete in this patch, we plan on continuing this work as part of #730.

Fixes #929
Fixes #834
Fixes #1049
@nbauernfeind
Copy link
Member

Note Devin says that adding an additional container to the docker-compose and restarting that is enough to cause docker to "hiccup" the entire network stack. This causes the UI to give up, but it sounds like an ideal test-bed for implementing the trivial reconnect scenario.

@devinrsmith
Copy link
Member

devinrsmith commented Sep 9, 2021

The data flow is:

crypto -> kafka -> grpc-api.

Restarting just the crypto service container causes the web ui message to display

"Unable to connect: Response closed without grpc-status (Headers only)"

a few seconds after crypto restarts (and I notice that new data is coming in to tables for those few seconds). It's followed by messages on the server of the form

ERROR c-nio-worker-ELG-3-1 | i.d.g.a.ArrowFlightUtil   | DoExchangeMarshaller{4e942be2}: unexpected error; force closing subscription: caused by io.grpc.StatusRuntimeException: CANCELLED: client cancelled

but, probably harmless according to #1205.

niloc132 added a commit that referenced this issue Mar 24, 2023
As long as a user's session on the server remains active, this should
enable a client to reconnect to that session, and continue using
already-exported objects. Some of the work is restored here for #3501
to be finished as well, but the client would need a way to create a new
session without user interaction.

The dh.IdeConnection.HACK_CONNECTION_FAILURE event is now deprecated,
and should be removed. Clients should instead use the
EVENT_DISCONNECTED and EVENT_RECONNECTED events to signal to the user
that an object isn't available. Alternatively, to detect shutdown, use
EVENT_SHUTDOWN.

Also fixes a bug where the last seen log item is replayed incorrectly.

Fixes #730
Fixes #225
This issue was closed.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants