Remove infinite loop in doSerialize #10218

horkhe · 2018-10-26T12:33:10Z

Problem

If authentication is enabled, then changing permissions of a role while there are active readers/writers authenticated with the role can cause Etcd nodes consume 100% of CPU.

Root cause

Consider the following function:

// doSerialize handles the auth logic, with permissions checked by "chk", for a serialized request "get". Returns a non-nil error on authentication failure.
func (s *EtcdServer) doSerialize(ctx context.Context, chk func(*auth.AuthInfo) error, get func()) error {
	for {
		ai, err := s.AuthInfoFromCtx(ctx)
		if err != nil {
			return err
		}
		if ai == nil {
			// chk expects non-nil AuthInfo; use empty credentials
			ai = &auth.AuthInfo{}
		}
		if err = chk(ai); err != nil {
			if err == auth.ErrAuthOldRevision {
				continue
			}
			return err
		}
		// fetch response for serialized request
		get()
		//  empty credentials or current auth info means no need to retry
		if ai.Revision == 0 || ai.Revision == s.authStore.Revision() {
			return nil
		}
		// avoid TOCTOU error, retry of the request is required.
	}
}

Synopsis: Once chk(ai) fails with auth.ErrAuthOldRevision it will always fail with the same error regardless how many times you retry.

Details: ai is retrieved from the incoming request auth token and has the version number of the role at the time when the connection was authenticated, so while in this loop this number never changes. On the other hand chk returns auth.ErrAuthOldRevision when the ai version number is not the same as the role version in s.authStore. The role version is incremented every time when the role is updated. So ai version is constant and the role version from s.authStore can only increase, hence the check will always fail.

Solution

Just return the error and let the client deal with it by re-authenticating.

gyuho-bot · 2018-10-26T12:33:11Z

Welcome @horkhe! It looks like this is your first PR to etcd-io/etcd 🎉🎉

gyuho-bot · 2018-10-26T12:33:24Z

Hi @horkhe. Thanks for your PR.

I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

codecov-io · 2018-10-26T13:42:20Z

Codecov Report

Merging #10218 into master will decrease coverage by 0.14%.
The diff coverage is 63.63%.

@@            Coverage Diff             @@
##           master   #10218      +/-   ##
==========================================
- Coverage   71.64%    71.5%   -0.15%     
==========================================
  Files         390      390              
  Lines       36369    36367       -2     
==========================================
- Hits        26056    26003      -53     
- Misses       8493     8548      +55     
+ Partials     1820     1816       -4

Impacted Files	Coverage Δ
etcdserver/v3_server.go	`79.08% <63.63%> (+0.59%)`	⬆️
proxy/grpcproxy/register.go	`69.44% <0%> (-13.89%)`	⬇️
auth/simple_token.go	`77.23% <0%> (-9.76%)`	⬇️
etcdserver/apply.go	`80.36% <0%> (-8.52%)`	⬇️
pkg/netutil/netutil.go	`63.11% <0%> (-6.56%)`	⬇️
pkg/testutil/recorder.go	`77.77% <0%> (-3.71%)`	⬇️
clientv3/leasing/cache.go	`87.77% <0%> (-3.34%)`	⬇️
proxy/httpproxy/director.go	`80% <0%> (-2.86%)`	⬇️
clientv3/leasing/kv.go	`89.03% <0%> (-1.67%)`	⬇️
etcdserver/api/v2http/client.go	`85.51% <0%> (-1.21%)`	⬇️
... and 21 more

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update ae25c5e...91e583c. Read the comment docs.

jingyih · 2018-10-26T22:44:19Z

/lgtm

gyuho-bot · 2018-10-26T22:44:22Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: jingyih
To fully approve this pull request, please assign additional approvers.
We suggest the following additional approver: jpbetz

If they are not already assigned, you can assign the PR to them by writing /assign @jpbetz in a comment when ready.

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

xiang90 · 2018-10-26T22:50:03Z

Just return the error and let the client deal with it by re-authenticating.

I would like someone to explore more if we can retry correctly in the server side. If there is no easy way to do it, we can let the client to handle it.

@gyuho Only people in the maintainer list should be able to put lgtm label. It seems that everyone in the org can do this today, which is not what we want.

gyuho · 2018-10-26T23:20:12Z

@xiang90 Yeah, think /lgtm label is open to all reviewers. I will see if we can tweak it.

xiang90 · 2018-10-27T00:55:16Z

@xiang90 oh. i see, we actually separate the approve and lgtm labels like k8s does... then it is fine...

horkhe · 2018-10-27T17:35:01Z

@xiang90 do you want me to investigate possible server side retries? Honestly I do not understand why you need to check if the version is different at all? Unless you are shooting for the case that a role was revoked some permissions. In that case chk rather then return auth.ErrAuthOldRevision should confirm that the accessed range is still allowed. The downside of this is that the extra verification would need to be repeated on each subsequent request since the auth token would still contain old version. Making clients re-authenticate would allow avoiding the extra check, and it is the simplest possible solution IMHO.

horkhe · 2018-10-27T18:51:41Z

I looked a bit more and turned out that chk calls auth.(*authStore).isOpPermitted that checks if the accessed range is allowed. But first it checks if the version is correct. And I do not understand why you need to check the version, when you check the range anyway? It seems to me as long as the accessed range is allowed who cares what the version was when the session was authenticated. Or am I missing something?

mitake · 2018-10-28T14:35:37Z

@horkhe thanks for your PR. It seems that continuing in the case of auth.ErrAuthOldRevision is incorrect. The motivation of checking the version is denying tokens which were issued with old auth information. In this case, ideal behaviour would be replying the error code and letting the client auth again or report the error.

In addition, the version number validation in this line (https://github.com/etcd-io/etcd/pull/10218/files#diff-a5a4bca15b031f18356513fe1382c3c7L560) should be performed because doSerialize() can be executed during auth store updates in a concurrent manner.

horkhe · 2018-10-28T15:33:54Z

@mitake my question is why does it matter what the version is? IMHO as long as this range check passes, it should be irrelevant what the role version had been when the session was authenticated is not it?

horkhe · 2018-10-30T14:39:23Z

By the way, we checked, the client does not re-authenticate on auth.ErrAuthOldRevision. So it also needs to be fixed.

mitake · 2018-10-31T14:34:57Z

@horkhe sorry for my late reply.

IMHO as long as this range check passes, it should be irrelevant what the role version had been when the session was authenticated is not it?

The motivation of version number validation is a little bit complicated. I'd like to share it with you hopefully this weekend. Is this ok?

horkhe · 2018-10-31T16:22:19Z

@mitake sure, whenever it is convenient to you. Now that we are running a patched etcd version we are not in a rush :-).

mitake · 2018-11-11T15:31:58Z

@horkhe Sorry for my late reply. I'd like to explain why the version number validation is required.
The motivation behind the validation mechanism is rejecting operations which are authorized with stale information. Let's assume a schedule like this: 1. a client X calls Authenticate() RPC and gets a token 2. another client Y changes password of X and update auth store 3. X issues a KV request (Range() or Txn with read ops) with the token

If every operation is executed in a serialized manner, nothing bad will happen. Basically the state machine of etcd executes its command in a serialized manner so no problems seem to be possible. However, we cannot have a guarantee about the order of 2 and 3. This is because for avoiding auth overhead (especially bcrypt's hash value calculation), etcd executes the step 2 in the outside of the state machine loop. So X's token can be stale in the step 3 if the authorization in the step 1 is executed with the step 2 in a parallel manner.

For avoiding serving requests with a token based on stale auth store's state, the version number check is required. The staleness can be checked by comparing a version number in the token (a number of the auth store's state when the token was generated) and and current version number of auth store. If they are different, it means the token is invalid.

Probably the best way of handling this is simply rejecting a request with a stale token and letting the client auth again. How do you think?

horkhe · 2018-11-12T07:18:34Z

@mitake if I got you right, the point of version checking is to reject access to data via sessions authenticated with old passwords? Well in that case version checking is indeed needed.

The solution proposed in this PR solves the first part - it rejects requests made via sessions with stale auth tokens, but the client does not handle those errors internally, it just returns them to end users. So the client code also needs to be updated. I suggest you guys develop the solution on the client side. I do not know the code well enough for that.

mitake · 2018-11-12T11:59:24Z

@horkhe yes but you should not remove the validation starting from this line: https://github.com/etcd-io/etcd/pull/10218/files#diff-a5a4bca15b031f18356513fe1382c3c7L559

horkhe · 2018-11-12T20:17:51Z

@mitake I can put it back. But honestly I see no point. Consider this: the version of the store changes a nanosecond after the second check, how is it different? The second check only adds to be rejected a handful of reads, that happen between the first and the second check, in the interval of few milliseconds. I would say why bother. Any complication albeit small is still a complication... but it is you call and I will put it back.

Once chk(ai) fails with auth.ErrAuthOldRevision it will always do, regardless how many times you retry. So the error is better be returned to fail the pending request and make the client re-authenticate.

horkhe · 2018-11-26T07:57:52Z

@mitake are there any more comments that I need to address?

mitake · 2018-11-26T13:09:28Z

@horkhe sorry for my late reply. Yeah it is a very corner case. But I think having the validation mechanism will make reasoning safer because it keeps the linearlizable semantics. It makes the behavior of etcd predictable e.g. combination of auth config update and network partitioning can allow reading keys with stale permission (the schedule will be like this: 1. client issues Authenticate() successfully 2. an etcd node is partitioned from majority of nodes 3. auth config is updated in the majority of the nodes 4. the client issues serializable read to the partitioned node). In addition, its runtime cost is very cheap.

I'll review your recent change later. Sorry for keeping you waiting.

mitake · 2019-01-04T15:16:11Z

@horkhe sorry for my very late reply... I think this can be merged. The failed CI wouldn't be related to this change. I rerun travis.

horkhe · 2019-01-04T18:48:07Z

@mitake thank you.

xiang90 · 2019-01-09T18:38:07Z

CI failure is not related. So merging.

horkhe · 2019-01-10T20:32:05Z

@xiang90 thanks a lot. Well, now the client code needs to be fixed to reconnect on auth.ErrAuthOldRevision should I create an issue for that?

xiang90 · 2019-01-10T21:11:33Z

@horkhe Can you fix it directly?

horkhe · 2019-01-15T08:54:28Z

@xiang90 sorry, but I am not familiar with that part of the code, and it has been awhile since I looked into it, and I have moved on since then.

mitake · 2019-01-15T15:04:30Z

@horkhe @xiang90 I can handle that part. Opened a new issue here: #10408

horkhe · 2019-01-15T15:13:04Z

Thank you very much @mitake.

…#10468-upstream-release-3.3 Automated cherry pick of #10218 #10468 on release 3.3

gyuho-bot requested review from hexfusion and jpbetz October 26, 2018 12:33

gyuho-bot added size/M labels Oct 26, 2018

gyuho-bot assigned jingyih Oct 26, 2018

gyuho-bot added the lgtm label Oct 26, 2018

xiang90 removed the lgtm label Oct 26, 2018

jingyih removed their assignment Oct 30, 2018

etcdserver: Remove infinite loop in doSerialize

91e583c

Once chk(ai) fails with auth.ErrAuthOldRevision it will always do, regardless how many times you retry. So the error is better be returned to fail the pending request and make the client re-authenticate.

horkhe force-pushed the maxim/develop branch from a420892 to 91e583c Compare November 12, 2018 20:28

xiang90 merged commit 2063b35 into etcd-io:master Jan 9, 2019

mitake mentioned this pull request Jan 15, 2019

Migrate aborted validation of Authenticate() to clientv3 side #10408

Closed

jingyih mentioned this pull request Feb 12, 2019

etcdserver: remove infinite loop for auth in raftRequest #10468

Merged

xqzhang2015 mentioned this pull request Aug 19, 2019

auth jwt: grpc context timed out for ErrAuthOldRevision #11033

Closed

jingyih mentioned this pull request Oct 28, 2019

Automated cherry pick of #10468 on release-3.4 #11299

Merged

mitake mentioned this pull request Nov 20, 2019

etcdserver hang with syntactically valid but old token? #11377

Closed

jingyih mentioned this pull request Nov 21, 2019

Automated cherry pick of #10218 #10468 on release 3.3 #11378

Merged

weopm mentioned this pull request Jan 8, 2020

3 nodes etcd cluster with two node in high cpu usage #11012

Closed

mitake added a commit that referenced this pull request Jan 25, 2020

Merge pull request #11378 from jingyih/automated-cherry-pick-of-#10218-…

70853d6

…#10468-upstream-release-3.3 Automated cherry pick of #10218 #10468 on release 3.3

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove infinite loop in doSerialize #10218

Remove infinite loop in doSerialize #10218

horkhe commented Oct 26, 2018 •

edited

Loading

gyuho-bot commented Oct 26, 2018

gyuho-bot commented Oct 26, 2018

codecov-io commented Oct 26, 2018 •

edited

Loading

jingyih commented Oct 26, 2018

gyuho-bot commented Oct 26, 2018

xiang90 commented Oct 26, 2018

gyuho commented Oct 26, 2018

xiang90 commented Oct 27, 2018

horkhe commented Oct 27, 2018 •

edited

Loading

horkhe commented Oct 27, 2018 •

edited

Loading

mitake commented Oct 28, 2018

horkhe commented Oct 28, 2018

horkhe commented Oct 30, 2018

mitake commented Oct 31, 2018

horkhe commented Oct 31, 2018

mitake commented Nov 11, 2018

horkhe commented Nov 12, 2018

mitake commented Nov 12, 2018

horkhe commented Nov 12, 2018

horkhe commented Nov 26, 2018

mitake commented Nov 26, 2018

mitake commented Jan 4, 2019

horkhe commented Jan 4, 2019

xiang90 commented Jan 9, 2019

horkhe commented Jan 10, 2019

xiang90 commented Jan 10, 2019

horkhe commented Jan 15, 2019

mitake commented Jan 15, 2019

horkhe commented Jan 15, 2019

Remove infinite loop in doSerialize #10218

Remove infinite loop in doSerialize #10218

Conversation

horkhe commented Oct 26, 2018 • edited Loading

Problem

Root cause

Solution

gyuho-bot commented Oct 26, 2018

gyuho-bot commented Oct 26, 2018

codecov-io commented Oct 26, 2018 • edited Loading

Codecov Report

jingyih commented Oct 26, 2018

gyuho-bot commented Oct 26, 2018

xiang90 commented Oct 26, 2018

gyuho commented Oct 26, 2018

xiang90 commented Oct 27, 2018

horkhe commented Oct 27, 2018 • edited Loading

horkhe commented Oct 27, 2018 • edited Loading

mitake commented Oct 28, 2018

horkhe commented Oct 28, 2018

horkhe commented Oct 30, 2018

mitake commented Oct 31, 2018

horkhe commented Oct 31, 2018

mitake commented Nov 11, 2018

horkhe commented Nov 12, 2018

mitake commented Nov 12, 2018

horkhe commented Nov 12, 2018

horkhe commented Nov 26, 2018

mitake commented Nov 26, 2018

mitake commented Jan 4, 2019

horkhe commented Jan 4, 2019

xiang90 commented Jan 9, 2019

horkhe commented Jan 10, 2019

xiang90 commented Jan 10, 2019

horkhe commented Jan 15, 2019

mitake commented Jan 15, 2019

horkhe commented Jan 15, 2019

horkhe commented Oct 26, 2018 •

edited

Loading

codecov-io commented Oct 26, 2018 •

edited

Loading

horkhe commented Oct 27, 2018 •

edited

Loading

horkhe commented Oct 27, 2018 •

edited

Loading