clientv3: Lessor can abort the main loop and keep failing #7880

yudai · 2017-05-05T01:30:32Z

Issue Observed

lessor.KeepAlive() keeps raising errors after a server side trouble.

I my case, after some trouble, it kept raising the error leases keep alive halted: rpc error: code = 11 desc = EOF" for every KeepAlive() call.

Issue Detail

I have a private grpc proxy, which executes some authentication on every request, between clients and the etcd server.

After something (still investigating) happened, clients started getting the error from lessor.KeepAlive() and never recovered.

lessor.recvKeepAliveLoop(), which is the main loop of lessor, returns when it gets an error which is not HaltErr. Once it's returned, there is no main loop running (and lessor.donec is closed in the deferred func), so KeepAlive() returns with the error.

IsHaltErr() returns false for context cancelation or timeout, which makes sense. However, as for grpc errors, only codes.Unavailable and codes.Internal are considered retryable errors.. So other grpc errors retuned by servers have the function return true, even when it's actually retryable.

In my case, the grpc error with code 11 were considered critical to halt and the main loop returned, then even after everything recovered, the client kept failing on KeepAlive().

You can reproduce the same issue by using the grpcproxy. It returns sometimes context.Canceled, which will be the code 1 error by grpc-go.

The main server code in v3api might return to-Halt errors as well, but I'm not sure about this.

Suggested Resolution

It's probably hard to list up all recoverable errors in IsHaltErr(), and it makes hard for third parties to implement proxies for etcd (even the official grpcproxy has the issue).

I therefore think the main loop should not return by any server side error. The current client library is supposed to be server-side error proof (automatically retry) , so it's should be return only by users' context cancel (or timeout).

I'm not sure if it's better to simply remove the conditions for grpc errors in IsHaltErr(), which are used by some other methods. Or we can add a new function like IsUserCancel(err).

I'll crete a PR if this should be resolved in the client side.
Otherwise, I think the grpcproxy should be fixed at least not to return context errors.

Appendix: Workaroud

Recreate a client.

The text was updated successfully, but these errors were encountered:

xiang90 · 2017-05-05T01:35:42Z

@heyitsanthony

I read our doc on Lease API, we actually say nothing about the client: https://github.com/coreos/etcd/blob/master/clientv3/lease.go#L91-L111.

We also say nothing about the client on the haltErr: https://github.com/coreos/etcd/blob/master/clientv3/lease.go#L76-L81.

We probably can change this behavior without breaking our "contract".

heyitsanthony · 2017-05-05T01:36:22Z

ok I can change the interface for that other one

yudai · 2017-05-05T01:48:55Z

Update)
The comment below was incorrect. Watch() recreates WatcherGrpcStream when it's not running.

Watch might have the same issue as well. watchGrpcStream.run() can returns by an error from Recv().

https://github.com/coreos/etcd/blob/master/clientv3/watch.go#L484-L487

yudai · 2017-05-05T20:22:03Z

Created a PR anyway.

I can send another PR to backport the same fix to the release-3.1 branch if needed (yudai@8a4bb15)

xiang90 · 2017-05-09T18:00:55Z

Fixed by #7890

yudai · 2017-05-09T18:08:26Z

@xiang90 Just a question. Is there any page available online that explains the maintenance cycle of releases? I'm wondering if this fix will be backported to 3.1.8 (or next release will be 3.2 and no 3.1.8?). Because it seems 3.2 will have some non-backward-compatible changes, so we are now thinking that we might need to maintain 3.1 release ourselves for a while, if 3.1.7 is the final version of 3.1.x release.

gyuho · 2017-05-09T18:47:12Z

@yudai We have release policy here https://github.com/coreos/etcd/blob/master/Documentation/dev-internal/release.md#patch-version-release and follow http://semver.org as much as we can.

Your patch will be in our next 3.2.0-rc.1 + (planned for this Friday).
But not likely backported to 3.1+, since we had already made some breaking changes in recvKeepAlive(resp *pb.LeaseKeepAliveResponse).

/cc @heyitsanthony

yudai · 2017-05-09T19:16:57Z

@gyuho Thanks for the reference.
Ok, maybe we will maintain our own 3.1 based branch for a while.

yudai mentioned this issue May 5, 2017

clientv3: Do no stop keep alive loop by server side errors #7890

Merged

xiang90 closed this as completed May 9, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

clientv3: Lessor can abort the main loop and keep failing #7880

clientv3: Lessor can abort the main loop and keep failing #7880

yudai commented May 5, 2017 •

edited

Loading

xiang90 commented May 5, 2017 •

edited

Loading

heyitsanthony commented May 5, 2017

yudai commented May 5, 2017 •

edited

Loading

yudai commented May 5, 2017

xiang90 commented May 9, 2017

yudai commented May 9, 2017

gyuho commented May 9, 2017

yudai commented May 9, 2017

clientv3: Lessor can abort the main loop and keep failing #7880

clientv3: Lessor can abort the main loop and keep failing #7880

Comments

yudai commented May 5, 2017 • edited Loading

Issue Observed

Issue Detail

Suggested Resolution

Appendix: Workaroud

xiang90 commented May 5, 2017 • edited Loading

heyitsanthony commented May 5, 2017

yudai commented May 5, 2017 • edited Loading

yudai commented May 5, 2017

xiang90 commented May 9, 2017

yudai commented May 9, 2017

gyuho commented May 9, 2017

yudai commented May 9, 2017

yudai commented May 5, 2017 •

edited

Loading

xiang90 commented May 5, 2017 •

edited

Loading

yudai commented May 5, 2017 •

edited

Loading