Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clientv3: wait for ConnectNotify before sending RPCs #8601

Merged
merged 2 commits into from
Sep 27, 2017

Conversation

gyuho
Copy link
Contributor

@gyuho gyuho commented Sep 25, 2017

With slow CPU, gRPC can lag behind with RPCs being sent before
calling 'Up', returning 'no address available' on the first try.

Fix #8581 and many other test failures since #8545.

Full explanation coming...

@gyuho gyuho force-pushed the notify branch 4 times, most recently from 4b725dd to 09a43e5 Compare September 25, 2017 09:47
@gyuho gyuho changed the title [DO NOT MERGE] clientv3: wait for ConnectNotify before sending RPCs clientv3: wait for ConnectNotify before sending RPCs Sep 25, 2017
@gyuho
Copy link
Contributor Author

gyuho commented Sep 25, 2017

@xiang90 @fanminshi

Health balancer in #8545 adds a method balancer.next to force endpoint switch, by notifying empty grpc.Address and draining all current connections in gRPC. But gRPC routine after notify is asynchronous with balancer.next(). It is possible, in slow machine, that:

  1. RPC is sent before connection has been established via balancer.Up
  2. gRPC returns error no address available
  3. clientv3 newRetryWrapper calls balancer.next()
  4. Address is pinned before balancer.next() has been completed
  5. balancer.next() triggers gRPC to drain all connections (including pinned ones)
  6. RPC errors with grpc: the connection is drained
  7. clientv3 newRetryWrapper receives error grpc: the connection is closing, exiting

This happens because balancer.next() is called before the first connection has been established, and then gRPC routine drains all connections while clientv3 balancer pins the address. This patch makes sure the connection is ready before sending RPC. And returns context errors on <-context.Context.Done().

@@ -442,7 +442,7 @@ func TestKVGetErrConnClosed(t *testing.T) {
go func() {
defer close(donec)
_, err := cli.Get(context.TODO(), "foo")
if err != nil && err != grpc.ErrClientConnClosing {
if err != nil && err != context.DeadlineExceeded && err != grpc.ErrClientConnClosing {
t.Fatalf("expected %v, got %v", grpc.ErrClientConnClosing, err)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs to be similar to line 795?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

also this seems wrong. the user does not pass in deadline (context.TODO()). so why deadlineExceeded will be returned? is dialing deadline exceeded in this case or what?

@@ -791,8 +791,8 @@ func TestKVGetStoppedServerAndClose(t *testing.T) {
// this Get fails and triggers an asynchronous connection retry
_, err := cli.Get(ctx, "abc")
cancel()
if !strings.Contains(err.Error(), "context deadline") {
t.Fatal(err)
if err != nil && err != context.DeadlineExceeded && err != grpc.ErrClientConnClosing {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should return deadline exceed in this case.

@xiang90
Copy link
Contributor

xiang90 commented Sep 25, 2017

i feel we should not mix deadline exceed and clientclosing error together.

select {
case <-c.balancer.ConnectNotify():
case <-rpcCtx.Done():
return grpc.ErrClientConnClosing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should just return rpcCtx.Error() here, no?

case <-rpcCtx.Done():
return grpc.ErrClientConnClosing
case <-c.ctx.Done():
return grpc.ErrClientConnClosing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same here

@gyuho gyuho added the WIP label Sep 25, 2017
@gyuho gyuho force-pushed the notify branch 6 times, most recently from 3ec5fac to 2619d7f Compare September 26, 2017 19:57
@gyuho gyuho removed the WIP label Sep 26, 2017
@gyuho
Copy link
Contributor Author

gyuho commented Sep 26, 2017

@xiang90 PTAL. Context errors and client closing errors are mixed only for grpc-proxy tests.

With slow CPU, gRPC can lag behind with RPCs being sent before
calling 'Up', returning 'no address available' on the first try.

Signed-off-by: Gyu-Ho Lee <gyuhox@gmail.com>
Signed-off-by: Gyu-Ho Lee <gyuhox@gmail.com>
@xiang90
Copy link
Contributor

xiang90 commented Sep 27, 2017

lgtm

@gyuho gyuho merged commit 398e6ba into etcd-io:master Sep 27, 2017
@gyuho gyuho deleted the notify branch September 27, 2017 21:22
@codecov-io
Copy link

Codecov Report

❗ No coverage uploaded for pull request base (master@b6b4898). Click here to learn what that means.
The diff coverage is 100%.

Impacted file tree graph

@@            Coverage Diff            @@
##             master    #8601   +/-   ##
=========================================
  Coverage          ?   75.81%           
=========================================
  Files             ?      358           
  Lines             ?    29368           
  Branches          ?        0           
=========================================
  Hits              ?    22265           
  Misses            ?     5546           
  Partials          ?     1557
Impacted Files Coverage Δ
clientv3/retry.go 61.57% <100%> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update b6b4898...6368159. Read the comment docs.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

3 participants