-
Notifications
You must be signed in to change notification settings - Fork 9.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove infinite loop in doSerialize #10218
Conversation
Welcome @horkhe! It looks like this is your first PR to etcd-io/etcd 🎉🎉 |
Hi @horkhe. Thanks for your PR. I'm waiting for a etcd-io member to verify that this patch is reasonable to test. If it is, they should reply with I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
Codecov Report
@@ Coverage Diff @@
## master #10218 +/- ##
==========================================
- Coverage 71.64% 71.5% -0.15%
==========================================
Files 390 390
Lines 36369 36367 -2
==========================================
- Hits 26056 26003 -53
- Misses 8493 8548 +55
+ Partials 1820 1816 -4
Continue to review full report at Codecov.
|
/lgtm |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: jingyih If they are not already assigned, you can assign the PR to them by writing The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
I would like someone to explore more if we can retry correctly in the server side. If there is no easy way to do it, we can let the client to handle it. @gyuho Only people in the maintainer list should be able to put lgtm label. It seems that everyone in the org can do this today, which is not what we want. |
@xiang90 Yeah, think |
@xiang90 oh. i see, we actually separate the approve and lgtm labels like k8s does... then it is fine... |
@xiang90 do you want me to investigate possible server side retries? Honestly I do not understand why you need to check if the version is different at all? Unless you are shooting for the case that a role was revoked some permissions. In that case |
I looked a bit more and turned out that |
@horkhe thanks for your PR. It seems that continuing in the case of In addition, the version number validation in this line (https://github.com/etcd-io/etcd/pull/10218/files#diff-a5a4bca15b031f18356513fe1382c3c7L560) should be performed because |
@mitake my question is why does it matter what the version is? IMHO as long as this range check passes, it should be irrelevant what the role version had been when the session was authenticated is not it? |
By the way, we checked, the client does not re-authenticate on |
@horkhe sorry for my late reply.
The motivation of version number validation is a little bit complicated. I'd like to share it with you hopefully this weekend. Is this ok? |
@mitake sure, whenever it is convenient to you. Now that we are running a patched etcd version we are not in a rush :-). |
@horkhe Sorry for my late reply. I'd like to explain why the version number validation is required. If every operation is executed in a serialized manner, nothing bad will happen. Basically the state machine of etcd executes its command in a serialized manner so no problems seem to be possible. However, we cannot have a guarantee about the order of 2 and 3. This is because for avoiding auth overhead (especially bcrypt's hash value calculation), etcd executes the step 2 in the outside of the state machine loop. So X's token can be stale in the step 3 if the authorization in the step 1 is executed with the step 2 in a parallel manner. For avoiding serving requests with a token based on stale auth store's state, the version number check is required. The staleness can be checked by comparing a version number in the token (a number of the auth store's state when the token was generated) and and current version number of auth store. If they are different, it means the token is invalid. Probably the best way of handling this is simply rejecting a request with a stale token and letting the client auth again. How do you think? |
@mitake if I got you right, the point of version checking is to reject access to data via sessions authenticated with old passwords? Well in that case version checking is indeed needed. The solution proposed in this PR solves the first part - it rejects requests made via sessions with stale auth tokens, but the client does not handle those errors internally, it just returns them to end users. So the client code also needs to be updated. I suggest you guys develop the solution on the client side. I do not know the code well enough for that. |
@horkhe yes but you should not remove the validation starting from this line: https://github.com/etcd-io/etcd/pull/10218/files#diff-a5a4bca15b031f18356513fe1382c3c7L559 |
@mitake I can put it back. But honestly I see no point. Consider this: the version of the store changes a nanosecond after the second check, how is it different? The second check only adds to be rejected a handful of reads, that happen between the first and the second check, in the interval of few milliseconds. I would say why bother. Any complication albeit small is still a complication... but it is you call and I will put it back. |
Once chk(ai) fails with auth.ErrAuthOldRevision it will always do, regardless how many times you retry. So the error is better be returned to fail the pending request and make the client re-authenticate.
a420892
to
91e583c
Compare
@mitake are there any more comments that I need to address? |
@horkhe sorry for my late reply. Yeah it is a very corner case. But I think having the validation mechanism will make reasoning safer because it keeps the linearlizable semantics. It makes the behavior of etcd predictable e.g. combination of auth config update and network partitioning can allow reading keys with stale permission (the schedule will be like this: 1. client issues Authenticate() successfully 2. an etcd node is partitioned from majority of nodes 3. auth config is updated in the majority of the nodes 4. the client issues serializable read to the partitioned node). In addition, its runtime cost is very cheap. I'll review your recent change later. Sorry for keeping you waiting. |
@horkhe sorry for my very late reply... I think this can be merged. The failed CI wouldn't be related to this change. I rerun travis. |
@mitake thank you. |
CI failure is not related. So merging. |
@xiang90 thanks a lot. Well, now the client code needs to be fixed to reconnect on |
@horkhe Can you fix it directly? |
@xiang90 sorry, but I am not familiar with that part of the code, and it has been awhile since I looked into it, and I have moved on since then. |
Thank you very much @mitake. |
Problem
If authentication is enabled, then changing permissions of a role while there are active readers/writers authenticated with the role can cause Etcd nodes consume 100% of CPU.
Root cause
Consider the following function:
Synopsis: Once
chk(ai)
fails withauth.ErrAuthOldRevision
it will always fail with the same error regardless how many times you retry.Details:
ai
is retrieved from the incoming request auth token and has the version number of the role at the time when the connection was authenticated, so while in this loop this number never changes. On the other handchk
returnsauth.ErrAuthOldRevision
when theai
version number is not the same as the role version ins.authStore
. The role version is incremented every time when the role is updated. Soai
version is constant and the role version froms.authStore
can only increase, hence the check will always fail.Solution
Just return the error and let the client deal with it by re-authenticating.