Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

After re-Adding a Lost Node, etcdserver reports "invalid auth token" #9629

Closed
aauren opened this issue Apr 25, 2018 · 13 comments
Closed

After re-Adding a Lost Node, etcdserver reports "invalid auth token" #9629

aauren opened this issue Apr 25, 2018 · 13 comments
Assignees

Comments

@aauren
Copy link

aauren commented Apr 25, 2018

I was simulating loss of the etcd data directory on one of our etcd test clusters. In this cluster we have HTTPS setup and also authentication turned on. Since certs and authentication are required assume that the following environment variables are present and configured correctly for all of the below etcdctl commands (unless specifically overridden): ETCDCTL_CERT, ETCDCTL_KEY, ETCDCTL_USER (which is set to the root user and password), ETCDCTL_ENDPOINTS and ETCDCTL_API=3.

The steps I was using are as follows:

rc-service etcd stop  # We use Gentoo with OpenRC
etcdctl member remove ae7f7a301bd595af  # this is the hash for etcd2
rm -rf /srv/etcd/member
etcdctl member add etcd2.test.domain.com --peer-urls="https://etcd1.test.domain.com:2380"
rc-service etcd start

The service starts correctly and everything looks good in the logs (at first). etcdctl endpoint status shows the following:

https://etcd1.test.domain.com:2379, ae7f7a301bd595af, 3.3.3, 272 MB, true, 4, 5518017
https://etcd2.test.domain.com:2379, 7556dd1882bcc76e, 3.3.3, 272 MB, false, 4, 5518018
https://etcd3.test.domain.com:2379, de8bb1c19d2a1228, 3.3.3, 272 MB, false, 4, 5518019

However, if I actually try to execute anything against etcd2 I get the following error both from etcdctl and in the logs:

# ETCDCTL_ENDPOINTS="https://etcd2.test.domain.com:2379" etcdctl endpoint health
https://etcd2.test.domain.com:2379 is unhealthy: failed to commit proposal: etcdserver: invalid auth token
Error: unhealthy cluster
# ETCDCTL_ENDPOINTS="https://etcd2.test.domain.com:2379" etcdctl user list  
Error: etcdserver: invalid auth token

the etcd error logs display a bunch of the following:

2018-04-25 21:52:12.805624 W | auth: invalid auth token: hhrtGzNzDcIIDYUQ.5518690
2018-04-25 21:52:12.927572 W | auth: invalid auth token: XPVPhlDxIJYKjWLA.5518691
2018-04-25 21:52:13.063212 W | auth: invalid auth token: JSJApVTTUsSEhqSW.5518692
2018-04-25 21:52:13.198602 W | auth: invalid auth token: UgsgTmIddpPVyvQF.5518693

Running the above commands against either of the other two nodes in the cluster succeeds and displays the correct results.

It seems to me that when the new node comes up, while it is able to sync data down from the other two etcd's left in the cluster, somehow authentication credentials aren't being synced up correctly. So that for any endpoint that requires authentication (like the health endpoint or user list) it fails to authenticate correctly.

I've attached a scrubbed log from the offending node.
etcd.err.zip

Let me know if there is any other information that I can provide to help get this figured out/resolved.

@aauren
Copy link
Author

aauren commented Apr 25, 2018

Just a quick update, we have several nodes that authenticate using only certificates instead of using a specific user/password. Those hosts appear to be unaffected by this authentication problem. So it appears to only be the user/password authentication tokens that don't work correctly.

@mitake
Copy link
Contributor

mitake commented Apr 26, 2018

@aauren thanks for your report. The problem comes from the stateful nature of simple token. The simple tokens aren't replicated during membership change so they are invalid from the rejoined node.

Is it possible for you to try jwt token: https://github.com/coreos/etcd/blob/master/Documentation/op-guide/configuration.md#--auth-token ? JWT is stateless so the problem won't happen.

@aauren
Copy link
Author

aauren commented Apr 26, 2018

Thanks for your response @mitake! While JWT is certainly something that we can look into in the future, we have deployed clusters with simple token authentication already. If one of them loses their data directory is there any work around to allow them to authenticate again with simple token authentication like the other nodes in the cluster do?

Also, do you know if this caveat to simple token authentication is somewhere in the documentation?

@zyf0330
Copy link

zyf0330 commented Jan 23, 2019

I meet this problem when upgrade etcd from v3.3.4 to v3.3.11.

@paulcaskey
Copy link

This problem hit me, as well. JWT is not something my organization supports. We do SSL certs so that is feasible, but it adds another whole level of complexity at scale, to a simple problem of needing some sort of authentication to the etcd cluster from hundreds or thousands of clients.

With this known issue (auth breaking - permanently - if you replace a node), plus the missing feature of plain auth working over TLS in etcd v3, this really is landmine in any real enterprise deployment. This needs to be more clearly documented. The whole plain-auth feature needs a giant asterisk -- it's really only a feature for testing or small-scale development. "DO try this at home -- ONLY."

Many system designers are adding an NGINX layer on top of etcd v3 just to solve this problem. The gRPC-proxy is another good workaround, at least as a TLS termination point getting you crypto over your WAN links, say, and being in-the-clear only on the LAN to clients within a data center. Still, this is rapidly becoming "not good enough" in today's tightening security climate.

@snowdusk
Copy link

snowdusk commented Aug 23, 2019

I disable auth, then enable auth to solve this problem

@mitake
Copy link
Contributor

mitake commented Aug 26, 2019

@aauren @zyf0330 @paulcaskey sorry for my late reply, I missed your comments... I'll enhance a doc for describing the problem. Probably @wangsong93 's workaround would be good for testing purpose so I'll add it too.

@mitake mitake self-assigned this Aug 26, 2019
@zyf0330
Copy link

zyf0330 commented Aug 27, 2019

I tried re-enable auth like @wangsong93 said but useless.

@Sheph
Copy link

Sheph commented Sep 10, 2019

I had similar problem, try restarting re-added node once right after it's up, it'll work fine then

@aauren
Copy link
Author

aauren commented Sep 10, 2019

My guess is that you experienced a different problem than the one here if restarting or re-adding the node worked for you. Like @mitake said originally, this information for the simple auth token isn't synced during membership, so the only way that it can be re-synchronized to a node that doesn't have this information is by adding the node to the cluster, disabling authentication, setting the password on the user again, and then re-enabling authentication.

This is the only procedure that I've found to work consistently when this problem is encountered.

@zyf0330
Copy link

zyf0330 commented Sep 11, 2019

Sorry, my situation is not a problem. Actually some watchers with expired token cause that log.

@stale
Copy link

stale bot commented Apr 6, 2020

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

@johnxiaohe
Copy link

I had similar problem, try restarting re-added node once right after it's up, it'll work fine then

YES. re-start the add node . it worked

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

8 participants