Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clientv3: when a endpoint is unavailable, switch it in retry #8327

Closed
wants to merge 1 commit into from

Conversation

HardySimpson
Copy link
Contributor

clientv3: balancer.go and retry.go

when a endpoint is unavailable, switch it in retry

fix: #8326

Copy link
Contributor

@heyitsanthony heyitsanthony left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this needs tests; dangerously broken as-is

@@ -36,14 +36,22 @@ func (c *Client) newRetryWrapper() retryRpcFunc {
eErr := rpctypes.Error(err)
// always stop retry on etcd errors
if _, ok := eErr.(rpctypes.EtcdError); ok {
return err
if grpc.Code(err) != codes.Unavailable {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't safe. Retrying will violate at-most-once semantics for Put/Del/Txn since a request timed out error could be returned even if the request makes it through raft. This can happen if there's a partition between the time of submitting a raft proposal and commit.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code changed

b.updateAddrs(allEpsWithoutPinAddr)


// at most sleep 10 seconds here
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

depending on sleeps is slow and unreliable; the balancer probably needs to directly reason about graylisted endpoints

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

code also changed

@xiang90
Copy link
Contributor

xiang90 commented Jul 28, 2017

@heyitsanthony @fanminshi

I thought our balance knows how to switch endpoints on failures/timeouts already. No?

@heyitsanthony
Copy link
Contributor

@xiang90 no, it's still an open issue. It will only detect/switch on transport errors, not errors returned by etcd.

@xiang90
Copy link
Contributor

xiang90 commented Jul 28, 2017

transport errors, not errors returned by etcd.

isnt timeout a transport error?

@heyitsanthony
Copy link
Contributor

@xiang90 transport error between the client and server, not transport errors between peers

@xiang90
Copy link
Contributor

xiang90 commented Jul 29, 2017

@heyitsanthony

It that is the case, then this should be done at server side. server will cut off the connection if it is partitioned.

@heyitsanthony
Copy link
Contributor

@xiang90 I don't think that's enough; the client could retry the same endpoint without graylisting.

@HardySimpson
Copy link
Contributor Author

HardySimpson commented Jul 29, 2017

@heyitsanthony agree to your opinion, server side switch is not enough, as server can not tell whether client's request is actually success or fail or request time out

@HardySimpson
Copy link
Contributor Author

@heyitsanthony

about previous code commet.

  • the sleep is slow and not reliable, so I change to sync.cond way in code

// at most switch one time per 30 seconds
// can be securely called by concurrently
c.SwitchEndpoint()
}
Copy link
Contributor Author

@HardySimpson HardySimpson Jul 29, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • about the put/txn 's only-once semantic, only do switch here but not retry. keep this request fail and wish other RPC will success

@@ -44,6 +52,7 @@ func (c *Client) newRetryWrapper() retryRpcFunc {
return err
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

  • But, if client fall into a grpc level request time out, it also violate only-once semantic and not OK

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these are only returned before the request is sent over the wire afaik

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes I search the doc, see comments below

@HardySimpson HardySimpson force-pushed the switch-on-retry branch 4 times, most recently from 4c29508 to 7398596 Compare July 29, 2017 07:47
Copy link
Contributor

@heyitsanthony heyitsanthony left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is not adequately tested

@@ -44,6 +52,7 @@ func (c *Client) newRetryWrapper() retryRpcFunc {
return err
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

these are only returned before the request is sent over the wire afaik


b.mu.Lock()
for (prevPinAddr == b.pinAddr) {
b.pinAddrCond.Wait()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

there is no way for the ctx attached to the rpc to cancel and break out of this wait

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same to the previous comment, cond is relate to sb.mu

}


b.mu.Lock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this will deadlock because b.mu.Lock isn't released on the Wait; anything that tries to update pinAddrCond will wait forever to acquire the lock

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

forget to do the cond init, add it

@@ -44,6 +52,7 @@ func (c *Client) newRetryWrapper() retryRpcFunc {
return err
}


Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please don't add blank lines if there's no new code attached

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, fixed

// try to request another endpoint at next rpc
// leave this fail
// already do anti-too-frequently switch
// at most switch one time per 30 seconds
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the implementation details of the balancer policy shouldn't be included in the comment at the callsite since if the balancer code changes, this comment will have to be updated (which probably won't happen)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, I delete all comments, and rename to TrySwitchEndpoint, that is a better name tells what the function does


b.mu.RUnlock()

if lastSwitchTime.Add(30 * time.Second).After(tNow) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This needs to be endpoint graylisting; a single timeout isn't enough. Suppose there are 5 members with 2 partitioned. The client could end up bouncing between both partitioned endpoints for a while.

Copy link
Contributor Author

@HardySimpson HardySimpson Jul 31, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

graylist has one problem, what time should we put the graylisted endpoint back to noramal? in the client we don't have enough info

I think a good way it to put a counter on each endpoint, every time switch the endpoint we choose a minimal retry count endpoint to connect. And all counter will reset every time updateEndpoints happened, periodically.

Still this will change a lot of code in balancer now. and it may cause some kind of un-balance between multiple clients. after a period maybe most client will attach to a health member and leave recover-health member without clients.

The algorithm now is random choose. when a client found a good endpoint, it will stick to it. the random choose is near to balance. as once it found a health member it will stick to it. and will cause 'bouncing between both partitioned endpoints' a while and at last worked well.

Copy link
Contributor Author

@HardySimpson HardySimpson Jul 31, 2017

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe a more better way it to choose endpoint both random and retry-count based.
for example, we put a 100 points on every endpoint, evey time the switch happens reduce one count.

and the switch choose is to select by the weight probability, if there are 3 memeber. their points is like this:

member point choose probability
etcd-0 100 100/(100+80+60) = 42%
etcd-1 80 80/(100+80+60) = 33%
etcd-2 60 60/(100+80+60) = 25%

if this is OK, should be implement in another PR

@HardySimpson HardySimpson force-pushed the switch-on-retry branch 5 times, most recently from 1a3ef27 to fc2fcd5 Compare July 31, 2017 02:12
@HardySimpson
Copy link
Contributor Author

HardySimpson commented Jul 31, 2017

about retry, I found in grpc's doc

    // FailedPrecondition indicates operation was rejected because the
    // system is not in a state required for the operation's execution.
    // For example, directory to be deleted may be non-empty, an rmdir
    // operation is applied to a non-directory, etc.
    //
    // A litmus test that may help a service implementor in deciding
    // between FailedPrecondition, Aborted, and Unavailable:
    //  (a) Use Unavailable if the client can retry just the failing call.
    //  (b) Use Aborted if the client should retry at a higher-level
    //      (e.g., restarting a read-modify-write sequence).
    //  (c) Use FailedPrecondition if the client should not retry until
    //      the system state has been explicitly fixed.  E.g., if an "rmdir"
    //      fails because the directory is non-empty, FailedPrecondition
    //      should be returned since the client should not retry unless
    //      they have first fixed up the directory by deleting files from it.
    //  (d) Use FailedPrecondition if the client performs conditional
    //      REST Get/Update/Delete on a resource and the resource on the
    //      server does not match the condition. E.g., conflicting
    //      read-modify-write on the same resource.
    FailedPrecondition Code = 9

    // OutOfRange means operation was attempted past the valid range.
    // E.g., seeking or reading past end of file.
    //
    // Unlike InvalidArgument, this error indicates a problem that may
    // be fixed if the system state changes. For example, a 32-bit file
    // system will generate InvalidArgument if asked to read at an
    // offset that is not in the range [0,2^32-1], but it will generate
    // OutOfRange if asked to read from an offset past the current
    // file size.
    //
    // There is a fair bit of overlap between FailedPrecondition and
    // OutOfRange.  We recommend using OutOfRange (the more specific
    // error) when it applies so that callers who are iterating through
    // a space can easily look for an OutOfRange error to detect when
    // they are done.
    OutOfRange Code = 11

As it mentioned, when face to a codes.Unavailable, client should retry, and when face to a codes.FailedPrecondition, client should not retry

But in etcd's error now

	ErrGRPCNoLeader                   = grpc.Errorf(codes.Unavailable, "etcdserver: no leader")
	ErrGRPCNotLeader                  = grpc.Errorf(codes.Unavailable, "etcdserver: not leader")
	ErrGRPCNotCapable                 = grpc.Errorf(codes.Unavailable, "etcdserver: not capable")
	ErrGRPCStopped                    = grpc.Errorf(codes.Unavailable, "etcdserver: server stopped")
	ErrGRPCTimeout                    = grpc.Errorf(codes.Unavailable, "etcdserver: request timed out")
	ErrGRPCTimeoutDueToLeaderFail     = grpc.Errorf(codes.Unavailable, "etcdserver: request timed out, possibly due to previous leader failure")
	ErrGRPCTimeoutDueToConnectionLost = grpc.Errorf(codes.Unavailable, "etcdserver: request timed out, possibly due to connection lost")
	ErrGRPCUnhealthy                  = grpc.Errorf(codes.Unavailable, "etcdserver: unhealthy cluster")

So, as a result, should we re-classify these errors?
Put all time out error to codes.FailedPrecondition and not retry, leave no leader error to Unavailale

and the retry code will be like this

func (c *Client) newRetryWrapper() retryRpcFunc {
	return func(rpcCtx context.Context, f rpcFunc) error {
		for {
			err := f(rpcCtx)
			if err == nil {
				return nil
			}

                       if grpc.Code(err) != codes.Unavailable || {{ some timeout judge}} {
                               c.TrySwitchEndpoint()
                       }
			// only retry if unavailable
			if grpc.Code(err) != codes.Unavailable {
				return err
			}
			select {
			case <-c.balancer.ConnectNotify():
			case <-rpcCtx.Done():
				return rpcCtx.Err()
			case <-c.ctx.Done():
				return c.ctx.Err()
			}
		}
	}
}

@HardySimpson HardySimpson force-pushed the switch-on-retry branch 2 times, most recently from 9da2545 to 6c4e421 Compare August 2, 2017 01:50
@HardySimpson HardySimpson changed the title when a endpoint is unavailable, switch it in retry clientv3: when a endpoint is unavailable, switch it in retry Aug 2, 2017
@HardySimpson HardySimpson force-pushed the switch-on-retry branch 4 times, most recently from 1e1ca75 to 8421b28 Compare August 3, 2017 07:34
@HardySimpson HardySimpson force-pushed the switch-on-retry branch 13 times, most recently from f5910ea to a621ee2 Compare August 8, 2017 02:22
@HardySimpson
Copy link
Contributor Author

@heyitsanthony I tried for a week, to implement a good switch, the best I can achieve is in this PR.

though I found as the balance code now is async, not sync. so can not write a stable unit test. the unit test's output now is like this.

"D:\Program Files (x86)\JetBrains\IntelliJ IDEA Community Edition 2017.1\bin\runnerw.exe" C:/Go\bin\go.exe test -v -timeout 30s github.com/coreos/etcd/clientv3 -run ^TestBalancerTrySwitch$
get from notify addr:  [{localhost:2379 <nil>} {localhost:22379 <nil>} {localhost:32379 <nil>}]
get from notify addr:  [{localhost:2379 <nil>}]
-----
get from notify addr:  [{localhost:22379 <nil>} {localhost:32379 <nil>}]
get from notify addr:  [{localhost:2379 <nil>}]
get from notify addr:  [{localhost:22379 <nil>} {localhost:32379 <nil>}]
get from notify addr:  [{localhost:22379 <nil>} {localhost:32379 <nil>}]
get from notify addr:  [{localhost:22379 <nil>}]
get from notify addr:  [{localhost:22379 <nil>} {localhost:32379 <nil>} {localhost:2379 <nil>}]
get from notify addr:  [{localhost:22379 <nil>}]
-----
get from notify addr:  [{localhost:32379 <nil>} {localhost:2379 <nil>}]
get from notify addr:  [{localhost:22379 <nil>}]
get from notify addr:  [{localhost:32379 <nil>} {localhost:2379 <nil>}]
get from notify addr:  [{localhost:32379 <nil>} {localhost:2379 <nil>} {localhost:22379 <nil>}]
get from notify addr:  [{localhost:32379 <nil>}]
get from notify addr:  [{localhost:32379 <nil>} {localhost:2379 <nil>} {localhost:22379 <nil>}]
get from notify addr:  [{localhost:32379 <nil>}]
-----
get from notify addr:  [{localhost:2379 <nil>} {localhost:22379 <nil>}]
get from notify addr:  [{localhost:32379 <nil>}]
get from notify addr:  [{localhost:2379 <nil>} {localhost:22379 <nil>}]
get from notify addr:  [{localhost:2379 <nil>}]
get from notify addr:  [{localhost:2379 <nil>} {localhost:22379 <nil>}]
get from notify addr:  [{localhost:2379 <nil>}]
get from notify addr:  [{localhost:2379 <nil>} {localhost:22379 <nil>} {localhost:32379 <nil>}]
get from notify addr:  [{localhost:2379 <nil>}]
-----
get from notify addr:  [{localhost:22379 <nil>} {localhost:32379 <nil>}]
get from notify addr:  [{localhost:2379 <nil>}]
get from notify addr:  [{localhost:22379 <nil>} {localhost:32379 <nil>}]
get from notify addr:  [{localhost:22379 <nil>} {localhost:32379 <nil>}]
get from notify addr:  [{localhost:22379 <nil>} {localhost:32379 <nil>}]
get from notify addr:  [{localhost:22379 <nil>} {localhost:32379 <nil>} {localhost:2379 <nil>}]
get from notify addr:  [{localhost:32379 <nil>}]
-----

though it works, with sb.trySwitchEndpoint(0), in a duration of 0 second, it actually does the switch.

so, what's next? could you solve this problem perfectly in next version etcd? as we really need it.

@heyitsanthony
Copy link
Contributor

@HardySimpson
Puts should follow at most-once-semantics. gRPC, not etcd, is returning Unavailable codes even if it's possible the RPC request was already transmitted to the server. Error codes like ErrGRPCNoLeader should be retried in all cases, ErrGRPCTimeout shouldn't be retried for puts, etc. I'm not really interested in reclassifying all the error codes at this time since it'll likely break older clients.

I'll try to put some time into this. The balancer doesn't really need synchronous endpoint switching as much as the client should eventually switch endpoints if RPCs begin returning unavailable responses so that it can at least make forward progress.

when a endpoint is unavailable, switch it in retry

fix: etcd-io#8326
@HardySimpson
Copy link
Contributor Author

@heyitsanthony , I have refactor some code, base on test the last pinAddr equals pinAddr now, something like graylist, and add unit test to verify the correctness. So I think maybe this PR can be merged ?

// TrySwitchEndpoint try to make balancer change it's fix endpoint.
// It may do nothing while their is only one endpoint, or switch has just happened
// It is called when the endpoint now is not available, like network partition happens
func (c *Client) TrySwitchEndpoint() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think this should be a public interface; the user shouldn't have to give hints to the client to switch endpoints. Just call c.balancer.trySwitchEndpoint in retry.go?

@@ -237,3 +238,82 @@ func (kcl *killConnListener) close() {
close(kcl.stopc)
kcl.wg.Wait()
}

func TestBalancerTrySwitch(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The mock grpc notifications this test uses could diverge from grpc behavior and the test itself is rather complicated. A clientv3/integration/kv_test.go test would be better instead since what matters is the Get failing over:

func TestKVGetFailoverPartition(t *testing.T) {
        defer testutil.AfterTest(t)
        clus := integration.NewClusterV3(t, &integration.ClusterConfig{Size: 3})
        defer clus.Terminate(t)

        cli, err := clientv3.New(clientv3.Config{
                Endpoints:   clus.Client(0).Endpoints(),
                DialTimeout: time.Second,
        })
        testutil.AssertNil(t, err)
        defer cli.Close()

        // force to connect to member 0
        _, err = cli.Get(context.TODO(), "abc")
        testutil.AssertNil(t, err)

        // permit failover to 1
        cli.SetEndpoints(append(cli.Endpoints(), clus.Client(1).Endpoints()...)...)

        // partition member 0 from rest of cluster
        clus.Members[0].InjectPartition(t, clus.Members[1:])

        // may have triggered a leader election; expect timeouts
        timeout := clus.Members[1].ServerConfig.ReqTimeout()
        // member 0 fails to get consensus for the l-read, balancer switches
        ctx, cancel := context.WithTimeout(context.TODO(), 3*timeout)
        _, err = cli.Get(ctx, "abc")
        cancel()
        testutil.AssertNil(t, err)
}

@@ -310,7 +323,9 @@ func (b *simpleBalancer) Get(ctx context.Context, opts grpc.BalancerGetOptions)
return grpc.Address{Addr: addr}, func() {}, nil
}

func (b *simpleBalancer) Notify() <-chan []grpc.Address { return b.notifyCh }
func (b *simpleBalancer) Notify() <-chan []grpc.Address {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't change

// notify client that a connection is up
b.readyOnce.Do(func() { close(b.readyc) })
b.readyOnce.Do(func() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't change

close(sb.downc)
go sb.updateNotifyLoop()
return sb
}

func (b *simpleBalancer) Start(target string, config grpc.BalancerConfig) error { return nil }
func (b *simpleBalancer) Start(target string, config grpc.BalancerConfig) error {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

don't change


func() {
b.mu.Lock()
defer b.mu.Unlock()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deferring will deadlock

  1. acquire b.mu.Lock
  2. block on b.updateAddrsC <- struct{}{}
  3. updateNotifyLoop tries to acquire b.mu.RLock and deadlocks.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, put it outside the func()

b.pinAddrCond.Wait()
}

b.host2ep[prevPinAddr] = prevEp
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could corrupt the balancer endpoint list.

  1. start trySwitchEndpoint, delete prevPinAddr "a" from the addr list
  2. some other goroutine calls client.SetEndpoints("b", "c")
  3. trySwitchEndpoint gets notification from pinAddrCond, puts prevEp back into endpoint list.
  4. client.Endpoints() == {"a", "b", "c"}

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK, I think this is the main reason that this implementation not good. A better way, I think, is not to modify the balancer's endpoints list, but erase the pinAddr, and then put the (endpoints - pinAddr) directly to notifyCh, wait grpc to Up() one of them. it this OK?

@gyuho
Copy link
Contributor

gyuho commented Sep 21, 2017

@HardySimpson Closing in favor of #8545.
Thanks!

@gyuho gyuho closed this Sep 21, 2017
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

4 participants