After re-Adding a Lost Node, etcdserver reports "invalid auth token" #9629

aauren · 2018-04-25T22:10:39Z

I was simulating loss of the etcd data directory on one of our etcd test clusters. In this cluster we have HTTPS setup and also authentication turned on. Since certs and authentication are required assume that the following environment variables are present and configured correctly for all of the below etcdctl commands (unless specifically overridden): ETCDCTL_CERT, ETCDCTL_KEY, ETCDCTL_USER (which is set to the root user and password), ETCDCTL_ENDPOINTS and ETCDCTL_API=3.

The steps I was using are as follows:

rc-service etcd stop  # We use Gentoo with OpenRC
etcdctl member remove ae7f7a301bd595af  # this is the hash for etcd2
rm -rf /srv/etcd/member
etcdctl member add etcd2.test.domain.com --peer-urls="https://etcd1.test.domain.com:2380"
rc-service etcd start

The service starts correctly and everything looks good in the logs (at first). etcdctl endpoint status shows the following:

https://etcd1.test.domain.com:2379, ae7f7a301bd595af, 3.3.3, 272 MB, true, 4, 5518017
https://etcd2.test.domain.com:2379, 7556dd1882bcc76e, 3.3.3, 272 MB, false, 4, 5518018
https://etcd3.test.domain.com:2379, de8bb1c19d2a1228, 3.3.3, 272 MB, false, 4, 5518019

However, if I actually try to execute anything against etcd2 I get the following error both from etcdctl and in the logs:

# ETCDCTL_ENDPOINTS="https://etcd2.test.domain.com:2379" etcdctl endpoint health
https://etcd2.test.domain.com:2379 is unhealthy: failed to commit proposal: etcdserver: invalid auth token
Error: unhealthy cluster
# ETCDCTL_ENDPOINTS="https://etcd2.test.domain.com:2379" etcdctl user list  
Error: etcdserver: invalid auth token

the etcd error logs display a bunch of the following:

2018-04-25 21:52:12.805624 W | auth: invalid auth token: hhrtGzNzDcIIDYUQ.5518690
2018-04-25 21:52:12.927572 W | auth: invalid auth token: XPVPhlDxIJYKjWLA.5518691
2018-04-25 21:52:13.063212 W | auth: invalid auth token: JSJApVTTUsSEhqSW.5518692
2018-04-25 21:52:13.198602 W | auth: invalid auth token: UgsgTmIddpPVyvQF.5518693

Running the above commands against either of the other two nodes in the cluster succeeds and displays the correct results.

It seems to me that when the new node comes up, while it is able to sync data down from the other two etcd's left in the cluster, somehow authentication credentials aren't being synced up correctly. So that for any endpoint that requires authentication (like the health endpoint or user list) it fails to authenticate correctly.

I've attached a scrubbed log from the offending node.
etcd.err.zip

Let me know if there is any other information that I can provide to help get this figured out/resolved.

The text was updated successfully, but these errors were encountered:

aauren · 2018-04-25T22:24:20Z

Just a quick update, we have several nodes that authenticate using only certificates instead of using a specific user/password. Those hosts appear to be unaffected by this authentication problem. So it appears to only be the user/password authentication tokens that don't work correctly.

mitake · 2018-04-26T07:32:53Z

@aauren thanks for your report. The problem comes from the stateful nature of simple token. The simple tokens aren't replicated during membership change so they are invalid from the rejoined node.

Is it possible for you to try jwt token: https://github.com/coreos/etcd/blob/master/Documentation/op-guide/configuration.md#--auth-token ? JWT is stateless so the problem won't happen.

aauren · 2018-04-26T13:32:07Z

Thanks for your response @mitake! While JWT is certainly something that we can look into in the future, we have deployed clusters with simple token authentication already. If one of them loses their data directory is there any work around to allow them to authenticate again with simple token authentication like the other nodes in the cluster do?

Also, do you know if this caveat to simple token authentication is somewhere in the documentation?

zyf0330 · 2019-01-23T09:42:34Z

I meet this problem when upgrade etcd from v3.3.4 to v3.3.11.

paulcaskey · 2019-04-18T21:27:17Z

This problem hit me, as well. JWT is not something my organization supports. We do SSL certs so that is feasible, but it adds another whole level of complexity at scale, to a simple problem of needing some sort of authentication to the etcd cluster from hundreds or thousands of clients.

With this known issue (auth breaking - permanently - if you replace a node), plus the missing feature of plain auth working over TLS in etcd v3, this really is landmine in any real enterprise deployment. This needs to be more clearly documented. The whole plain-auth feature needs a giant asterisk -- it's really only a feature for testing or small-scale development. "DO try this at home -- ONLY."

Many system designers are adding an NGINX layer on top of etcd v3 just to solve this problem. The gRPC-proxy is another good workaround, at least as a TLS termination point getting you crypto over your WAN links, say, and being in-the-clear only on the LAN to clients within a data center. Still, this is rapidly becoming "not good enough" in today's tightening security climate.

snowdusk · 2019-08-23T09:07:36Z

I disable auth, then enable auth to solve this problem

mitake · 2019-08-26T13:35:36Z

@aauren @zyf0330 @paulcaskey sorry for my late reply, I missed your comments... I'll enhance a doc for describing the problem. Probably @wangsong93 's workaround would be good for testing purpose so I'll add it too.

zyf0330 · 2019-08-27T01:44:44Z

I tried re-enable auth like @wangsong93 said but useless.

Sheph · 2019-09-10T12:24:37Z

I had similar problem, try restarting re-added node once right after it's up, it'll work fine then

aauren · 2019-09-10T19:28:23Z

My guess is that you experienced a different problem than the one here if restarting or re-adding the node worked for you. Like @mitake said originally, this information for the simple auth token isn't synced during membership, so the only way that it can be re-synchronized to a node that doesn't have this information is by adding the node to the cluster, disabling authentication, setting the password on the user again, and then re-enabling authentication.

This is the only procedure that I've found to work consistently when this problem is encountered.

zyf0330 · 2019-09-11T08:42:53Z

Sorry, my situation is not a problem. Actually some watchers with expired token cause that log.

stale · 2020-04-06T20:57:02Z

This issue has been automatically marked as stale because it has not had recent activity. It will be closed if no further activity occurs. Thank you for your contributions.

johnxiaohe · 2022-01-07T02:23:16Z

I had similar problem, try restarting re-added node once right after it's up, it'll work fine then

YES. re-start the add node . it worked

gyuho added the area/auth label Apr 26, 2018

mitake self-assigned this Aug 26, 2019

Sh4d1 mentioned this issue Sep 5, 2019

After adding new nodes in cluster: failed to revoke <> ("auth: invalid auth token") #11123

Closed

ae6rt mentioned this issue Feb 2, 2020

example of v3 auth via jwt token #10144

Closed

stale bot added the stale label Apr 6, 2020

stale bot closed this as completed Apr 28, 2020

yifan-gu mentioned this issue May 3, 2021

pkg/*: Add support for jwt auth token in the ECO. Quentin-M/etcd-cloud-operator#63

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

After re-Adding a Lost Node, etcdserver reports "invalid auth token" #9629

After re-Adding a Lost Node, etcdserver reports "invalid auth token" #9629

aauren commented Apr 25, 2018

aauren commented Apr 25, 2018

mitake commented Apr 26, 2018

aauren commented Apr 26, 2018

zyf0330 commented Jan 23, 2019

paulcaskey commented Apr 18, 2019

snowdusk commented Aug 23, 2019 •

edited

Loading

mitake commented Aug 26, 2019

zyf0330 commented Aug 27, 2019

Sheph commented Sep 10, 2019

aauren commented Sep 10, 2019

zyf0330 commented Sep 11, 2019

stale bot commented Apr 6, 2020

johnxiaohe commented Jan 7, 2022

After re-Adding a Lost Node, etcdserver reports "invalid auth token" #9629

After re-Adding a Lost Node, etcdserver reports "invalid auth token" #9629

Comments

aauren commented Apr 25, 2018

aauren commented Apr 25, 2018

mitake commented Apr 26, 2018

aauren commented Apr 26, 2018

zyf0330 commented Jan 23, 2019

paulcaskey commented Apr 18, 2019

snowdusk commented Aug 23, 2019 • edited Loading

mitake commented Aug 26, 2019

zyf0330 commented Aug 27, 2019

Sheph commented Sep 10, 2019

aauren commented Sep 10, 2019

zyf0330 commented Sep 11, 2019

stale bot commented Apr 6, 2020

johnxiaohe commented Jan 7, 2022

snowdusk commented Aug 23, 2019 •

edited

Loading