Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

clientv3: Lessor can abort the main loop and keep failing #7880

Closed
yudai opened this issue May 5, 2017 · 8 comments
Closed

clientv3: Lessor can abort the main loop and keep failing #7880

yudai opened this issue May 5, 2017 · 8 comments

Comments

@yudai
Copy link
Contributor

yudai commented May 5, 2017

Issue Observed

lessor.KeepAlive() keeps raising errors after a server side trouble.

I my case, after some trouble, it kept raising the error leases keep alive halted: rpc error: code = 11 desc = EOF" for every KeepAlive() call.

Issue Detail

I have a private grpc proxy, which executes some authentication on every request, between clients and the etcd server.

After something (still investigating) happened, clients started getting the error from lessor.KeepAlive() and never recovered.

lessor.recvKeepAliveLoop(), which is the main loop of lessor, returns when it gets an error which is not HaltErr. Once it's returned, there is no main loop running (and lessor.donec is closed in the deferred func), so KeepAlive() returns with the error.

IsHaltErr() returns false for context cancelation or timeout, which makes sense. However, as for grpc errors, only codes.Unavailable and codes.Internal are considered retryable errors.. So other grpc errors retuned by servers have the function return true, even when it's actually retryable.

In my case, the grpc error with code 11 were considered critical to halt and the main loop returned, then even after everything recovered, the client kept failing on KeepAlive().

You can reproduce the same issue by using the grpcproxy. It returns sometimes context.Canceled, which will be the code 1 error by grpc-go.

The main server code in v3api might return to-Halt errors as well, but I'm not sure about this.

Suggested Resolution

It's probably hard to list up all recoverable errors in IsHaltErr(), and it makes hard for third parties to implement proxies for etcd (even the official grpcproxy has the issue).

I therefore think the main loop should not return by any server side error. The current client library is supposed to be server-side error proof (automatically retry) , so it's should be return only by users' context cancel (or timeout).

I'm not sure if it's better to simply remove the conditions for grpc errors in IsHaltErr(), which are used by some other methods. Or we can add a new function like IsUserCancel(err).

I'll crete a PR if this should be resolved in the client side.
Otherwise, I think the grpcproxy should be fixed at least not to return context errors.

Appendix: Workaroud

  • Recreate a client.
@xiang90
Copy link
Contributor

xiang90 commented May 5, 2017

@heyitsanthony

I read our doc on Lease API, we actually say nothing about the client: https://github.com/coreos/etcd/blob/master/clientv3/lease.go#L91-L111.

We also say nothing about the client on the haltErr: https://github.com/coreos/etcd/blob/master/clientv3/lease.go#L76-L81.

We probably can change this behavior without breaking our "contract".

@heyitsanthony
Copy link
Contributor

ok I can change the interface for that other one

@yudai
Copy link
Contributor Author

yudai commented May 5, 2017

Update)
The comment below was incorrect. Watch() recreates WatcherGrpcStream when it's not running.


Watch might have the same issue as well. watchGrpcStream.run() can returns by an error from Recv().

https://github.com/coreos/etcd/blob/master/clientv3/watch.go#L484-L487

@yudai
Copy link
Contributor Author

yudai commented May 5, 2017

Created a PR anyway.

I can send another PR to backport the same fix to the release-3.1 branch if needed (yudai@8a4bb15)

@xiang90
Copy link
Contributor

xiang90 commented May 9, 2017

Fixed by #7890

@xiang90 xiang90 closed this as completed May 9, 2017
@yudai
Copy link
Contributor Author

yudai commented May 9, 2017

@xiang90 Just a question. Is there any page available online that explains the maintenance cycle of releases? I'm wondering if this fix will be backported to 3.1.8 (or next release will be 3.2 and no 3.1.8?). Because it seems 3.2 will have some non-backward-compatible changes, so we are now thinking that we might need to maintain 3.1 release ourselves for a while, if 3.1.7 is the final version of 3.1.x release.

@gyuho
Copy link
Contributor

gyuho commented May 9, 2017

@yudai We have release policy here https://github.com/coreos/etcd/blob/master/Documentation/dev-internal/release.md#patch-version-release and follow http://semver.org as much as we can.

Your patch will be in our next 3.2.0-rc.1 + (planned for this Friday).
But not likely backported to 3.1+, since we had already made some breaking changes in recvKeepAlive(resp *pb.LeaseKeepAliveResponse).

/cc @heyitsanthony

@yudai
Copy link
Contributor Author

yudai commented May 9, 2017

@gyuho Thanks for the reference.
Ok, maybe we will maintain our own 3.1 based branch for a while.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

4 participants