Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How to force node go to ready state, after remove client state dir #2550

Closed
tantra35 opened this issue Apr 11, 2017 · 3 comments
Closed

How to force node go to ready state, after remove client state dir #2550

tantra35 opened this issue Apr 11, 2017 · 3 comments

Comments

@tantra35
Copy link
Contributor

tantra35 commented Apr 11, 2017

Nomad version

0.5.6

Issue

after a failed upgrade from nomad 0.5.4 to 0.5.6 on some of our hosts, we got broken nomad on that nodes(it doesn't work) So we decide to cleanup nomad client state dir(we simply remove it from file system), and relaunch nomad agent. But it can't join to working cluster due follow errors in log:

Apr 12 00:14:52 monitor1 nomad[3226]: * RPC failed to server 192.168.30.5:4647: rpc error: rpc error: node secret ID does not match. Not registering node.
Apr 12 00:14:52 monitor1 nomad[3226]: * RPC failed to server 192.168.30.2:4647: rpc error: rpc error: node secret ID does not match. Not registering node.
Apr 12 00:14:52 monitor1 nomad[3226]: * RPC failed to server 192.168.30.6:4647: rpc error: rpc error: node secret ID does not match. Not registering node.
Apr 12 00:14:52 monitor1 nomad[3226]: * RPC failed to server 192.168.30.1:4647: rpc error: node secret ID does not match. Not registering node.
Apr 12 00:14:52 monitor1 nomad[3226]: * RPC failed to server 192.168.31.220:4647: rpc error: rpc error: node secret ID does not match. Not registering node.
Apr 12 00:14:52 monitor1 nomad[3226]: * RPC failed to server 192.168.30.4:4647: rpc error: failed to get conn: dial tcp 192.168.30.4:4647: getsockopt: connec
tion refused
Apr 12 00:14:52 monitor1 nomad[3226]: client: registration failure: 7 error(s) occurred:#012#012* RPC failed to server 192.168.30.3:4647: rpc error: failed t
o get conn: dial tcp 192.168.30.3:4647: getsockopt: connection refused#012* RPC failed to server 192.168.30.5:4647: rpc error: rpc error: node secret ID does
 not match. Not registering node.#012* RPC failed to server 192.168.30.2:4647: rpc error: rpc error: node secret ID does not match. Not registering node.#012
* RPC failed to server 192.168.30.6:4647: rpc error: rpc error: node secret ID does not match. Not registering node.#012* RPC failed to server 192.168.30.1:4
647: rpc error: node secret ID does not match. Not registering node.#012* RPC failed to server 192.168.31.220:4647: rpc error: rpc error: node secret ID does
 not match. Not registering node.#012* RPC failed to server 192.168.30.4:4647: rpc error: failed to get conn: dial tcp 192.168.30.4:4647: getsockopt: connect
ion refused

and node maked as donw, without any chance go to ready state

root@social:/home/ruslan# nomad node-status
00000000  test  vol-h-docker-02  ceph   false  ready
a50ce082  test  server6          ceph   false  ready
a3e6b08b  test  monitor1         ceph   false  down
439a2f5a  test  graphite         ceph   false  ready
41b521c8  test  vol-h-docker-01  ceph   false  ready
ec475f0a  test  social           ceph   false  ready
1e6111fb  test  server2          ceph   false  ready

As in understand due GH-2277, nodes now have persistent IDs, but secretIDs not persistent, because it can be cleared by remove nomad agent state dir(in our case) so nomad servers thinks that buggy node(because it remember persistent nodeID) try to register, and reject it. In nomad, no any commands that allow to force nomad to forget about down nodes, thus giving her a chance to re-register (it seems that we just have to wait when nomad will made a GC of down nodes, but this require time). What can we do in this situation?

@tantra35
Copy link
Contributor Author

tantra35 commented Apr 13, 2017

As workaround we fount that new noamd ganet client option no_host_uuid config parameter take place

@dadgar
Copy link
Contributor

dadgar commented Apr 17, 2017

@tantra35 Yeah that is a bit tricky. What you can do is stop the node and wait for nomad to detect it as dead (30 seconds) and then issue a GC which will clear knowledge of that node from the servers. You can do that as follows:

$ curl -XPUT http://127.0.0.1:4646/v1/system/gc

@dadgar dadgar closed this as completed Apr 17, 2017
@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 14, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants