How to force node go to ready state, after remove client state dir #2550

tantra35 · 2017-04-11T22:25:55Z

Nomad version

0.5.6

Issue

after a failed upgrade from nomad 0.5.4 to 0.5.6 on some of our hosts, we got broken nomad on that nodes(it doesn't work) So we decide to cleanup nomad client state dir(we simply remove it from file system), and relaunch nomad agent. But it can't join to working cluster due follow errors in log:

Apr 12 00:14:52 monitor1 nomad[3226]: * RPC failed to server 192.168.30.5:4647: rpc error: rpc error: node secret ID does not match. Not registering node.
Apr 12 00:14:52 monitor1 nomad[3226]: * RPC failed to server 192.168.30.2:4647: rpc error: rpc error: node secret ID does not match. Not registering node.
Apr 12 00:14:52 monitor1 nomad[3226]: * RPC failed to server 192.168.30.6:4647: rpc error: rpc error: node secret ID does not match. Not registering node.
Apr 12 00:14:52 monitor1 nomad[3226]: * RPC failed to server 192.168.30.1:4647: rpc error: node secret ID does not match. Not registering node.
Apr 12 00:14:52 monitor1 nomad[3226]: * RPC failed to server 192.168.31.220:4647: rpc error: rpc error: node secret ID does not match. Not registering node.
Apr 12 00:14:52 monitor1 nomad[3226]: * RPC failed to server 192.168.30.4:4647: rpc error: failed to get conn: dial tcp 192.168.30.4:4647: getsockopt: connec
tion refused
Apr 12 00:14:52 monitor1 nomad[3226]: client: registration failure: 7 error(s) occurred:#012#012* RPC failed to server 192.168.30.3:4647: rpc error: failed t
o get conn: dial tcp 192.168.30.3:4647: getsockopt: connection refused#012* RPC failed to server 192.168.30.5:4647: rpc error: rpc error: node secret ID does
 not match. Not registering node.#012* RPC failed to server 192.168.30.2:4647: rpc error: rpc error: node secret ID does not match. Not registering node.#012
* RPC failed to server 192.168.30.6:4647: rpc error: rpc error: node secret ID does not match. Not registering node.#012* RPC failed to server 192.168.30.1:4
647: rpc error: node secret ID does not match. Not registering node.#012* RPC failed to server 192.168.31.220:4647: rpc error: rpc error: node secret ID does
 not match. Not registering node.#012* RPC failed to server 192.168.30.4:4647: rpc error: failed to get conn: dial tcp 192.168.30.4:4647: getsockopt: connect
ion refused

and node maked as donw, without any chance go to ready state

root@social:/home/ruslan# nomad node-status
00000000  test  vol-h-docker-02  ceph   false  ready
a50ce082  test  server6          ceph   false  ready
a3e6b08b  test  monitor1         ceph   false  down
439a2f5a  test  graphite         ceph   false  ready
41b521c8  test  vol-h-docker-01  ceph   false  ready
ec475f0a  test  social           ceph   false  ready
1e6111fb  test  server2          ceph   false  ready

As in understand due GH-2277, nodes now have persistent IDs, but secretIDs not persistent, because it can be cleared by remove nomad agent state dir(in our case) so nomad servers thinks that buggy node(because it remember persistent nodeID) try to register, and reject it. In nomad, no any commands that allow to force nomad to forget about down nodes, thus giving her a chance to re-register (it seems that we just have to wait when nomad will made a GC of down nodes, but this require time). What can we do in this situation?

The text was updated successfully, but these errors were encountered:

tantra35 · 2017-04-13T10:25:01Z

As workaround we fount that new noamd ganet client option no_host_uuid config parameter take place

dadgar · 2017-04-17T23:59:50Z

@tantra35 Yeah that is a bit tricky. What you can do is stop the node and wait for nomad to detect it as dead (30 seconds) and then issue a GC which will clear knowledge of that node from the servers. You can do that as follows:

$ curl -XPUT http://127.0.0.1:4646/v1/system/gc

github-actions · 2022-12-14T02:17:21Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

dadgar closed this as completed Apr 17, 2017

github-actions bot locked as resolved and limited conversation to collaborators Dec 14, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

How to force node go to ready state, after remove client state dir #2550

How to force node go to ready state, after remove client state dir #2550

tantra35 commented Apr 11, 2017 •

edited

Loading

tantra35 commented Apr 13, 2017 •

edited

Loading

dadgar commented Apr 17, 2017

github-actions bot commented Dec 14, 2022

How to force node go to ready state, after remove client state dir #2550

How to force node go to ready state, after remove client state dir #2550

Comments

tantra35 commented Apr 11, 2017 • edited Loading

Nomad version

Issue

tantra35 commented Apr 13, 2017 • edited Loading

dadgar commented Apr 17, 2017

github-actions bot commented Dec 14, 2022

tantra35 commented Apr 11, 2017 •

edited

Loading

tantra35 commented Apr 13, 2017 •

edited

Loading