-
-
Notifications
You must be signed in to change notification settings - Fork 147
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Mitigate Split-brain of Long-lived Connection #526
Comments
I've confirmed that server is sending
But I'm still searching for how to capture http2 GOAWAY frame in gRPC. |
As of my understanding, gRPC's HTTP/2 transport layer( So I left a question in grpc/grpc-go, hoping gRPC members can provide me a good explanation for my questions. |
I have discussed this issue with grpc community, and I found out that our
Since our Therefore, we might need to add timer in our |
After more researches on gRPC server-side streaming usage, I have found Kubecon 2018 video: Using gRPC for Long-lived and Streaming RPCs - Eric Anderson, Google which explains about gRPC's long-lived RPC's issue and it's improvements. This is what I have concluded based on above reference.
Therefore, I suggest two options for RPC connection close.
Option 1 is the "graceful" and "suggested" way to improve(resolve) this issue, but I think option 2 is more suitable considering our use cases. Because Yorkie is used for "real-time" collaboration, sync sensitivity between peers is very important. Therefore noticing split-brain issue and closing connection as soon as possible is more important than having graceful/long interval of connection close. This |
To conclude:
|
We need to reconsider this issue because we changed RPC from gRPC to Connect. |
Description:
Recently, we have introduced a Sharded cluster mode to support production environment.
But, there is a issue on long-lived connection like
WatchDocument
RPC, where connection issplit-brained
when backend host set changes(server added or removed). To get more context about this issue, follow the links below:Currently, we are mitigating this issue with forceful connection close with envoy's
stream_idle_timeout
, so that when connection gets idle for a while(1 min) due to connection split-brain, connection is forcefully closed and rerouted to proper backend server.But this is not a perfect way to solve this issue, because there will be a time period (about 1 min or less) between connection gets split-brained and reestablished by forceful closure. And between this time period, users cannot receive any change notifications via
WatchDocument
, which will decrease sync sensitivity between peers.To solve this issue, we need to introduce graceful and instant way to reestablished connection when split-brain occurs.
Since gRPC is based on HTTP/2, we can use HTTP/2
GOAWAY
frame to gracefully close connection. As RFC 7540 defines,GOAWAY
frame is used to initiate graceful connection close.We can use gRPC's
MAX_CONNECTION_AGE
to send GOAWAY frame when connection reaches max age to keep alive (This is what gRPC suggests to use when load balancing long-lived connection).Moreover, we can use envoy's
close_connections_on_host_set_change
to instantly and gracefully close connection. This is because this option drains connections when backend host set changes, and drain sequence sends HTTP2 GOAWAY to terminate connection.But GOAWAY is not a signal to close connection instantly. It's just a signal to tell client not to send additional request to server(grpc/grpc-java#8770), so we need to handle connection closure in client-side.
When client receives GOAWAY frame from the server, client needs to reset connection and reestablish connection.
So overall sequence will look something like this:
close_connections_on_host_set_change
is set in envoy proxy.This process will ensure instant and graceful way to close connection, and completely resolve
WatchDocument
's split-brain issue.We need to implement GOAWAY handler in client-side, in Go SDK, JS SDK, and etc.
I'm currently searching for way to implement in Go, and I will update process in the comments below.
Why:
To completely resolve decreased sync sensitivity between peers caused by
WatchDocument
's split-brain issue.The text was updated successfully, but these errors were encountered: