Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Clarify/define clientv3 retry logic (fix broken retries) #8691

Closed
gyuho opened this issue Oct 12, 2017 · 9 comments
Closed

Clarify/define clientv3 retry logic (fix broken retries) #8691

gyuho opened this issue Oct 12, 2017 · 9 comments

Comments

@gyuho
Copy link
Contributor

gyuho commented Oct 12, 2017

Now that retry logic is such a critical part of our balancer logic, we need document clearly when Go client retries its RPCs; currently we don't have any, other than some comments in clientv3/retry.go. Should be helpful for other client language bindings.

release-3.2, as of f1d7dd8

Mutable/immutable RPCs share the same error handling logic:

  1. no error is returned, then no retry
  2. if the error is rpctypes.EtcdError type, then no retry
    • e.g. rpctypes.ErrEmptyKey, rpctypes.ErrNoSpace, rpctypes. ErrTimeout
  3. if the error is grpc/status.(*statusError) type and its error code is codes.Unavailable which means the service is not currently available, then retry

rpctypes.EtcdError and grpc/status.(*statusError) errors are mutually exclusive.
This only works with grpc/grpc-go v1.2.1.

master branch, as of 764a0f7

grpc/grpc-go upgraded to >v1.6.x, which changed the behavior of error handling.

During network disconnections, error clientv3.ErrNoAddrAvilable can happen, and its error code is codes.Unavailable, so should be retried (same as grpc.Errorf(codes.Unavailable, "there is no address available")).

However, due to the change in grpc-go, transport.ErrStreamDrain can be returned. etcd mutable RPCs should not be retried on transport.ErrStreamDrain and only retried on clientv3.ErrNoAddrAvilable. This is fixed via #8335 (fix "put at most once", not in 3.2).

Plus with health checking balancer #8545, now retry error handling logic is:

  1. no error is returned, then no retry (no matter what RPCs are)
  2. for immutable requests, if the error is grpc/status.(*statusError) type and its error code is codes.Unavailable, then mark unhealthy, endpoint-switch, wait for connection notify, and retry
  3. for mutable requests, if the error is grpc/status.(*statusError) type and its error code is codes.Unavailable and the error message is there is no address available, then mark unhealthy, endpoint-switch, wait for connection notify, and retry
  4. for immutable requests, if the error is rpctypes.EtcdError type, then mark unhealthy, endpoint-switch, and exit
    • e.g. rpctypes.ErrEmptyKey, rpctypes.ErrNoSpace, rpctypes. ErrTimeout
  5. for mutable requests, if the error is rpctypes.EtcdError type, then mark unhealthy, endpoint-switch, and exit (proper handling is missing though)

TODOs

@xiang90
Copy link
Contributor

xiang90 commented Oct 24, 2017

is this fixed?

@gyuho
Copy link
Contributor Author

gyuho commented Oct 24, 2017

Yes.

@gyuho gyuho closed this as completed Oct 24, 2017
@devnev
Copy link

devnev commented Dec 5, 2017

Hi. The Watch client still contains a failFast=False with a TODO to switch it to failFast=True. Was that meant to be part of this issue? Or is there an issue tracking that elsewhere?

@gyuho
Copy link
Contributor Author

gyuho commented Dec 5, 2017

@devnev We've disabled FailFast=true to all RPCs, replaced by internal, manual retry in clientv3 side.

We've covered some network fault cases in #8711, if that helps.

For watch, clientv3 (>=v3.2.10) should be able to switch and retry when a server fails.

@devnev
Copy link

devnev commented Dec 5, 2017

We've disabled FailFast=true to all RPCs

@gyuho thas is incorrect, see https://github.com/coreos/etcd/blob/master/clientv3/watch.go#L778

@devnev
Copy link

devnev commented Dec 5, 2017

A search of the codebase also reveals a FailFast=false in the grpcproxy

@gyuho
Copy link
Contributor Author

gyuho commented Dec 5, 2017

@devnev Indeed.

I meant to say

We've disabled FailFast=false

Somehow got confused.

So we do FailFast=false for watch, but for others FailFast=true, thus gRPC side retry is disabled. But note that that watch has for-loop for manual retry, since we assume gRPC retry logic is not stable enough.

@devnev
Copy link

devnev commented Dec 5, 2017

For watches in particular FailFast=False is problematic, as they usually do not have RPC timeouts. In our case the connections are failing for extended periods of times, but we cannot set a timeout on the request as it is a watch. However, because FailFast=False, we're also not given any indication that the watch is in fact broken.

@gyuho
Copy link
Contributor Author

gyuho commented Dec 5, 2017

etcd watch API is not meant for detecting connection issues. Disconnect is handled in client balancer layer. We've added HTTP/2 keepalive and client balancer health checking (only available >= v3.2.10).

Please try HTTP/2 keepalive ping. If it still doesn't work, file open a new issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

3 participants