Seems that after leader fail, some agents may forgot about servers except leader, what leads this nodes to miss they heartbeats #4666

tantra35 · 2018-09-11T21:23:47Z

Nomad version

Nomad v0.8.4 (dbee1d7)

Issue

We see this situation time to time, when one of nomad servers hung due hardware failure or network connectivity problems. At last time was follow situation due aws fail, one of nomad servers hung, and exactly this server was leader:

2018/09/06 10:40:03 [WARN] raft: Rejecting vote request from 172.29.4.194:4647 since we have a leader: 172.29.20.70:4647
2018/09/06 10:40:03 [WARN] raft: Heartbeat timeout from "172.29.20.70:4647" reached, starting election

By this logs lines we conclude that node with ip 172.29.20.70, was leader, while aws autoscale begin shutdown failed node, and bring up new(this process took about 5-7 minutes), the remaining two elect new leader, but 2 nomad agents forgot about remaining server and still communicate only with failed leader, only after 15-20 seconds those agents see server list and begin send heartbeats to proper RPC server, but this time was enough for the nomad server(new leader) conclude that this agents miss they hearbeats TTL and begin reallocate all allocation placed on that agents. Follow logs demonstrate situation:

2018-09-06T10:41:00.460+0300 [DEBUG] plugin.nomad: 2018/09/06 10:41:00 [ERR] plugin: plugin server: accept unix /tmp/plugin320636499: use of closed network connection
2018/09/06 10:41:00 [ERR] yamux: Failed to write header: write unix @->/tmp/plugin185990052: write: broken pipe
2018-09-06T10:41:00.461+0300 [DEBUG] plugin: plugin process exited: path=/opt/nomad/nomad_0.8.4/nomad
2018-09-06T10:41:00.459+0300 [DEBUG] plugin: plugin process exited: path=/opt/nomad/nomad_0.8.4/nomad
2018-09-06T10:41:00.459+0300 [DEBUG] plugin.nomad: 2018/09/06 10:41:00 [ERR] plugin: plugin server: accept unix /tmp/plugin622419670: use of closed network connection
2018/09/06 10:41:00.433332 [WARN] client: heartbeat missed (request took 15.898862ms). Heartbeat TTL was 13.446301173s and heartbeated after 26.197845989s
2018/09/06 10:41:00.416017 [ERR] client: heartbeating failed. Retrying in 1.473438241s: failed to update status: rpc error: failed to get conn: rpc error: lead thread didn't get connection
2018/09/06 10:41:00.414357 [ERR] nomad: "Node.UpdateStatus" RPC failed to server 172.29.20.70:4647: rpc error: failed to get conn: rpc error: lead thread didn't get connection
2018/09/06 10:41:00.414170 [ERR] nomad: "Alloc.GetAllocs" RPC failed to server 172.29.20.70:4647: rpc error: failed to get conn: rpc error: lead thread didn't get connection
2018/09/06 10:41:00.414207 [ERR] client: failed to query updated allocations: rpc error: failed to get conn: rpc error: lead thread didn't get connection
2018/09/06 10:41:00.414226 [ERR] nomad: "ACL.ResolveToken" RPC failed to server 172.29.20.70:4647: rpc error: failed to get conn: rpc error: lead thread didn't get connection
2018/09/06 10:41:00.414242 [WARN] client: failed to resolve token, using expired cached value: rpc error: failed to get conn: rpc error: lead thread didn't get connection
2018/09/06 10:40:50.409838 [WARN] client: failed to resolve token, using expired cached value: rpc error: failed to get conn: dial tcp 172.29.20.70:4647: i/o timeout
2018/09/06 10:40:50.410347 [ERR] nomad: "Node.UpdateStatus" RPC failed to server 172.29.20.70:4647: rpc error: failed to get conn: rpc error: lead thread didn't get connection
2018/09/06 10:40:50.409804 [ERR] nomad: "ACL.ResolveToken" RPC failed to server 172.29.20.70:4647: rpc error: failed to get conn: dial tcp 172.29.20.70:4647: i/o timeout
2018/09/06 10:40:50.410172 [ERR] nomad: "ACL.ResolveToken" RPC failed to server 172.29.20.70:4647: rpc error: failed to get conn: rpc error: lead thread didn't get connection
2018/09/06 10:40:50.410185 [WARN] client: failed to resolve token, using expired cached value: rpc error: failed to get conn: rpc error: lead thread didn't get connection
2018/09/06 10:40:50.410380 [ERR] client: heartbeating failed. Retrying in 1.847214154s: failed to update status: rpc error: failed to get conn: rpc error: lead thread didn't get connection
2018/09/06 10:40:34.229367 [WARN] client: failed to resolve token, using expired cached value: rpc error: EOF
2018/09/06 10:40:34.228986 [ERR] nomad: "Node.UpdateStatus" RPC failed to server 172.29.20.70:4647: rpc error: EOF
2018/09/06 10:40:34 [ERR] yamux: keepalive failed: i/o deadline reached
2018/09/06 10:40:34.229006 [ERR] client: heartbeating failed. Retrying in 1.398842761s: failed to update status: rpc error: EOF
2018/09/06 10:40:34.229353 [ERR] nomad: "ACL.ResolveToken" RPC failed to server 172.29.20.70:4647: rpc error: EOF

As you can see no any server rotation was made after each failed RPC(so we conclude that agent forgot about all server except 172.29.20.70), as it must be done due code https://github.com/hashicorp/nomad/blob/master/client/rpc.go#L76

For now we think that potential reason could be due absolute setting new server list in follow code https://github.com/hashicorp/nomad/blob/master/client/client.go#L1509

Our hypothesis is. Before fully hung, leader forget about another servers and provide this info to some nomad clients(this agents not lucky, and they made RPC request to this faulty server), after that nomad server fully hung and stop processing RPC requests, so unlucky clients have not any chance to communicate with healthy servers. But it is strange that consul discovery doesn't helps, and also strange that distance betwen follow erro records

2018/09/06 10:40:34.228986 [ERR] nomad: "Node.UpdateStatus" RPC failed to server 172.29.20.70:4647: rpc error: EOF

and

2018/09/06 10:40:50.410380 [ERR] client: heartbeating failed. Retrying in 1.847214154s: failed to update status: rpc error: failed to get conn: rpc error: lead thread didn't get connection

is more than 15 seconds, but in theory this distance should be very small

The text was updated successfully, but these errors were encountered:

dadgar · 2018-09-11T22:58:52Z

Can you share the server logs for 172.29.20.70 around the issue and more of the client logs.
Also can you share the client's config

tantra35 · 2018-09-11T23:59:33Z

Here only this logs from 172.29.20.70 for that day

    2018/09/06 09:31:31 [INFO] raft: Snapshot to 535205 complete
    2018/09/06 09:31:28 [INFO] raft: Compacting logs from 516773 to 524965
    2018/09/06 09:31:28 [INFO] snapshot: reaping snapshot /var/lib/nomad/server/raft/snapshots/421-518807-1536043917836
    2018/09/06 09:31:28 [INFO] snapshot: Creating new snapshot at /var/lib/nomad/server/raft/snapshots/422-535205-1536215488580.tmp
    2018/09/06 09:31:28 [INFO] raft: Starting snapshot up to 535205
    2018/09/06 09:05:45 [ERR] yamux: keepalive failed: session shutdown

and absolutely silence at moment of crash. May be logs from server doesn't reach collector, but this is all we have, aws forcedly terminate that server and it volume is lost

our client config is separate by files:

client.hcl

region = "atf01"
datacenter = "test"
data_dir = "/var/lib/nomad/"
bind_addr = "0.0.0.0" # the default
disable_update_check = true

enable_syslog = false
log_level = "WARN"

leave_on_terminate = false

client
{
  enabled = true
  max_kill_timeout = "10m"
  node_class = "dockerworker"

  options = {
    "docker.auth.config" = "/root/.docker/config.json"
    "driver.raw_exec.enable" = true
  }

  reserved
  {
    cpu = 100
    memory = 256
  }
}

acl.hcl

acl
{
  enabled = true
}

vault.hcl

vault {
  enabled = true
  address = "https://vault.service.consul:8200"
}

consul.hcl

consul
{
  token = "xxxxxxx-xxxxx-xxx-xxx-xxxxxxxxxxxx"
}

advertise.hcl

advertise
{
        http = "172.29.0.29"
        rpc = "172.29.0.29"
        serf = "172.29.0.29"
}

replicator.hcl

client
{
  meta
  {
    "replicator_worker_pool" = "dockerworker-t2"
  }
}

Also i remember that the same situation happens on another cluster, at moment when we began debug nomad leader server with delve debugger(so it hung for e long time), but we not try to reproduce this in this way

tantra35 · 2018-09-14T22:35:47Z

@dadgar this situation may be reproduced if you hung nomad server by debugger in our case this is 100% reproducible (we do dlv attach <nomad server pid>) and it's not depend which server hang - with leader role or not(so our hypothesis about special leader influence is wrong)

for example on test vagrant stand we hung one of nomad servers(we havd separate VM between servers and clients) we see in client logs:

2018/09/15 01:21:13 [ERR] yamux: keepalive failed: i/o deadline reached
2018/09/15 01:21:13.330954 [ERR] nomad: "Node.UpdateStatus" RPC failed to server 192.168.142.101:4647: rpc error: EOF
2018/09/15 01:21:13.330998 [ERR] client: heartbeating failed. Retrying in 1.938277957s: failed to update status: rpc error: EOF
2018/09/15 01:21:53 [ERR] yamux: keepalive failed: i/o deadline reached
2018/09/15 01:21:53.344845 [INFO] client.consul: discovered following Servers: 192.168.142.102:4647(DC: ),192.168.142.103:4647(DC: ),192.168.142.101:4647(DC: )
2018/09/15 01:21:53.344877 [INFO] manager: setting new servers list: 192.168.142.101:4647(DC: ),192.168.142.102:4647(DC: ),192.168.142.103:4647(DC: ), old was: 192.168.142.101:4647(DC: test),192.168.142.102:4647(DC: test),192.168.142.103:4647(DC: test)
2018/09/15 01:21:53.345358 [INFO] manager: setting new servers list: 192.168.142.101:4647(DC: test),192.168.142.102:4647(DC: test),192.168.142.103:4647(DC: test), old was: 192.168.142.101:4647(DC: ),192.168.142.102:4647(DC: ),192.168.142.103:4647(DC: )
2018/09/15 01:22:04.908463 [WARN] client: heartbeat missed (request took 1.900011ms). Heartbeat TTL was 12.468564142s and heartbeated after 11.563112784s
2018/09/15 01:22:04.908502 [INFO] manager: setting new servers list: 192.168.142.102:4647(DC: test),192.168.142.103:4647(DC: test), old was: 192.168.142.101:4647(DC: test),192.168.142.102:4647(DC: test),192.168.142.103:4647(DC: test)
2018/09/15 01:31:41.560745 [INFO] manager: setting new servers list: 192.168.142.101:4647(DC: test),192.168.142.102:4647(DC: test),192.168.142.103:4647(DC: test), old was: 192.168.142.102:4647(DC: test),192.168.142.103:4647(DC: test)

as you can see RPC error happens at 2018/09/15 01:21:13 but consul discovery was completed at 2018/09/15 01:21:53 after 40 seconds this is abnormal, and at the same moment on leader(this was 100% working node) we see:

Sep 15 01:21:20 consulnomad-2 nomad.sh[1596]:     2018/09/15 01:21:20 [INFO] memberlist: Suspect consulnomad-1.global has failed, no acks received                                                      142.101:4647: i/o timeout
Sep 15 01:21:20 consulnomad-2 nomad.sh[1596]:     2018/09/15 01:21:20.049775 [INFO] nomad: removing server consulnomad-1.global (Addr: 192.168.142.101:4647) (DC: test)
Sep 15 01:21:26 consulnomad-2 nomad.sh[1596]:     2018/09/15 01:21:26.535135 [WARN] nomad.heartbeat: node '4838d20d-2dbe-86f5-30ee-eedb75d15f50' TTL expired                                            142.101:4647: i/o timeout
Sep 15 01:21:27 consulnomad-2 nomad.sh[1596]:     2018/09/15 01:21:27 [ERR] raft: Failed to AppendEntries to {Voter 192.168.142.101:4647 192.168.142.101:4647}: read tcp 192.168.142.102:58188->192.168.
Sep 15 01:21:28 consulnomad-2 nomad.sh[1596]:     2018/09/15 01:21:28 [ERR] raft: Failed to heartbeat to 192.168.142.101:4647: read tcp 192.168.142.102:58192->192.168.142.101:4647: i/o timeout        142.101:4647: i/o timeout
Sep 15 01:21:37 consulnomad-2 nomad.sh[1596]:     2018/09/15 01:21:37.741499 [WARN] nomad.heartbeat: node '06ac6820-d2d4-10d6-6015-ab88058f9e69' TTL expired
Sep 15 01:21:37 consulnomad-2 nomad.sh[1596]:     2018/09/15 01:21:37 [ERR] raft: Failed to AppendEntries to {Voter 192.168.142.101:4647 192.168.142.101:4647}: read tcp 192.168.142.102:58202->192.168.142.101:4647: i/o timeout
Sep 15 01:21:38 consulnomad-2 nomad.sh[1596]:     2018/09/15 01:21:38 [ERR] raft: Failed to heartbeat to 192.168.142.101:4647: read tcp 192.168.142.102:58204->192.168.142.101:4647: i/o timeout
Sep 15 01:21:46 consulnomad-2 nomad.sh[1596]:     2018/09/15 01:21:46.677738 [WARN] nomad.heartbeat: node '1412a607-fbf6-5c11-153b-0171f0a651a6' TTL expired
Sep 15 01:21:47 consulnomad-2 nomad.sh[1596]:     2018/09/15 01:21:47 [ERR] raft: Failed to AppendEntries to {Voter 192.168.142.101:4647 192.168.142.101:4647}: read tcp 192.168.142.102:58214->192.168.142.101:4647: i/o timeout
Sep 15 01:21:48 consulnomad-2 nomad.sh[1596]:     2018/09/15 01:21:48 [ERR] raft: Failed to heartbeat to 192.168.142.101:4647: read tcp 192.168.142.102:58216->192.168.142.101:4647: i/o timeout
Sep 15 01:21:58 consulnomad-2 nomad.sh[1596]:     2018/09/15 01:21:58 [ERR] raft: Failed to AppendEntries to {Voter 192.168.142.101:4647 192.168.142.101:4647}: read tcp 192.168.142.102:58226->192.168.142.101:4647: i/o timeout
Sep 15 01:21:59 consulnomad-2 nomad.sh[1596]:     2018/09/15 01:21:59 [ERR] raft: Failed to heartbeat to 192.168.142.101:4647: read tcp 192.168.142.102:58228->192.168.142.101:4647: i/o timeout
Sep 15 01:22:05 consulnomad-2 nomad.sh[1596]:     2018/09/15 01:22:05 [INFO] serf: attempting reconnect to consulnomad-1.global 192.168.142.101:4648

3 nodes lost they ttl(on test stand we have only 4 nodes)

tantra35 · 2018-09-15T01:27:20Z

I think that a reason of this is lock held by consulDiscoveryImpl which may hold it to very long time if some communication with nomad server hung in RPC when we getting peers in consul discovery. Also hung server must be first when returned be Peers RPC

tantra35 · 2018-09-17T10:55:57Z

@dadgar The problem was more complex than simply remove hearbeatLock from consulDiscovery, so we make PR to solve this #4688

The key ideas is:

Prevent to triggerDiscovery if we have health servers(new server list must be geted only only from heartbeats, if we have quorum of health servers )
Change logic of Setservers in Manager(client/servers/manager.go), now it only changes servers if servers list really changed(before, client forgot about health of servers after each heartbeat)

After this changes we can't reproduce this issue on test stand. But this PR solve only one part of problem initialy described, and doesn't explain why there was only one server in client communications. As we add additional logging in INFO we hope that sooner or later we will see the crux of the problem and that was only 1 server in the list

Also we found that autorebalancing of nomad server on client doesn't launched - method func (m *Manager) Start() not called in any way, is this normal?

a-zagaevskiy · 2018-12-07T13:09:28Z

It seems that we also have the issue. Probably it's been starting since we upgraded Nomad to v0.8.4. Due to the issue we have constantly restarting tasks on random clients when one of the server is shut-downed.
Is there any activity to fix it? It's a very serious problem for us.

tantra35 · 2018-12-07T20:22:38Z

@AlexanderZagaevskiy as workaround you may increase heartbeat_grace parameter for example to 1m, or apply our pull request patch

a-zagaevskiy · 2018-12-13T10:40:26Z

@tantra35 Our cluster is very demanding on fast fault detection, thus the solution with heartbeat_grace is not fit for us. Moreover PR #4688 didn't help.

Nevertheless, сlose look at the issue and @tantra35 's PR allowed to fix it.
Main problems are:

There is a bug with omitted datacenter of servers discovered via Consul. ( PR client: fix omitted datacenter of a server discovered via Consul #4997 )
Yamux's sessions are kept alive longer than default heartbeat TTL (30s vs 10s). This causes to protracted RPC calls to failed server so that client missed his heartbeat check.
Servers list is shuffled after each successful node update and discovery via Consul. Hence client might call to the failed server more than once in a row and miss heartbeat check.

As a temporary solution, we decreased keep alive interval for yamux sessions, got servers list shuffled only for incoming list that differed from already known one and turned autorebalancing of nomad servers on.

tantra35 · 2018-12-13T20:48:25Z

@AlexanderZagaevskiy i agree with your timing settings for yamux, but why you have so often times of non responsive servers? Do you have any telemetry?

a-zagaevskiy · 2018-12-13T22:35:15Z

Our QA engineers just checked fault tolerance feature and turned servers off or unplugged network on purpose. One of test cases.

github-actions · 2022-11-24T02:19:47Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

dadgar added stage/waiting-reply stage/needs-investigation labels Sep 11, 2018

tantra35 mentioned this issue Sep 18, 2018

remove unnecessary triggerDiscovery calls. So we call it only in case… #4688

Closed

preetapan mentioned this issue May 6, 2019

Remove unnecessary locking and serverlist syncing in heartbeats #5654

Merged

preetapan closed this as completed in #5654 May 8, 2019

nickethier mentioned this issue May 14, 2019

client: fix omitted datacenter of a server discovered via Consul #4997

Closed

mikeblum mentioned this issue Jun 25, 2019

Nomad 0.9.x Network Timeouts Evict Cluster Leader #5874

Closed

github-actions bot locked as resolved and limited conversation to collaborators Nov 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Seems that after leader fail, some agents may forgot about servers except leader, what leads this nodes to miss they heartbeats #4666

Seems that after leader fail, some agents may forgot about servers except leader, what leads this nodes to miss they heartbeats #4666

tantra35 commented Sep 11, 2018 •

edited

Loading

dadgar commented Sep 11, 2018

tantra35 commented Sep 11, 2018 •

edited

Loading

tantra35 commented Sep 14, 2018 •

edited

Loading

tantra35 commented Sep 15, 2018 •

edited

Loading

tantra35 commented Sep 17, 2018 •

edited

Loading

a-zagaevskiy commented Dec 7, 2018

tantra35 commented Dec 7, 2018 •

edited

Loading

a-zagaevskiy commented Dec 13, 2018

tantra35 commented Dec 13, 2018

a-zagaevskiy commented Dec 13, 2018

github-actions bot commented Nov 24, 2022

Seems that after leader fail, some agents may forgot about servers except leader, what leads this nodes to miss they heartbeats #4666

Seems that after leader fail, some agents may forgot about servers except leader, what leads this nodes to miss they heartbeats #4666

Comments

tantra35 commented Sep 11, 2018 • edited Loading

Nomad version

Issue

dadgar commented Sep 11, 2018

tantra35 commented Sep 11, 2018 • edited Loading

tantra35 commented Sep 14, 2018 • edited Loading

tantra35 commented Sep 15, 2018 • edited Loading

tantra35 commented Sep 17, 2018 • edited Loading

a-zagaevskiy commented Dec 7, 2018

tantra35 commented Dec 7, 2018 • edited Loading

a-zagaevskiy commented Dec 13, 2018

tantra35 commented Dec 13, 2018

a-zagaevskiy commented Dec 13, 2018

github-actions bot commented Nov 24, 2022

tantra35 commented Sep 11, 2018 •

edited

Loading

tantra35 commented Sep 11, 2018 •

edited

Loading

tantra35 commented Sep 14, 2018 •

edited

Loading

tantra35 commented Sep 15, 2018 •

edited

Loading

tantra35 commented Sep 17, 2018 •

edited

Loading

tantra35 commented Dec 7, 2018 •

edited

Loading