Nomad reallocated all jobs when one server lost connection in cluster #3840

atillamas · 2018-02-06T18:14:19Z

If filing a bug please include the following:

Nomad version

Nomad v0.7.1 (0b295d3)

Operating system and Environment details

"Ubuntu 16.04.2 LTS"

Issue

Running a cluster with 3 Nomad masters and 3 clients in AWS eu-west-1, one per AZ
According to the logs one node seem to have lost connection to the other two servers, and this caused the cluster to reallocate all the tasks, first on 1 worker node, and then everything on 2 worker nodes 5 minutes later causing some downtime on allocations that were placed on only the 2 nodes.

No nodes was terminated, just seem like it was a network hickup.

Reproduction steps

Nope

Nomad Server logs (if appropriate)

server1.eu-west-1.compute.amazonaws.com.nomad_logs.txt
server3.eu-west-1.compute.amazonaws.com.nomad_logs.txt
server2.eu-west-1.compute.amazonaws.com.nomad_logs.txt

Nomad Client logs (if appropriate)

nothing interesting just lots of
client.gc: marking allocation c529a7e1-e5e9-2d6c-20de-405e9f10ce6a for GC

The text was updated successfully, but these errors were encountered:

schmichael · 2018-02-07T01:23:13Z

Yikes! Sounds like AWS's network had a lot of issues. This behavior is to be expected during severe network issues: whenever client nodes are unable to heartbeat to a quorum of servers for a period of time, the servers will consider it lost and reschedule its allocations on nodes that are able to heartbeat.

There are a couple things you can do to try to prevent service outages during network issues like this:

Higher service redundancy - raising the count on your services (and adding client nodes if necessary) will hopefully keep a sufficient number of allocations on healthy nodes for all jobs.
Spread allocations across availability zones with a distinct_property constraint on platform.aws.placement.availability-zone.
Raise the heartbeat_grace parameter on your servers to mark client nodes as lost more slowly. The default setting is fairly low to try to detect node failures as quickly as possible.

I hope this helps but please reopen if you think there's an issue I'm missing. Thank you for including extremely helpful logs!

atillamas · 2018-02-07T06:17:00Z

@schmichael
Hi. Thanks for you reply however this just raises more questions for me.
In the logs i cannot find that it looses quorum of servers? It looses leader and reelects a new one.

All allocation was placed in different AZ's since there was 1 client node in each AZ (by Autoscaling group rule), and since i'm using distinct_host constraint this leads to only one allocation in every AZ. Since i would want my services to come back up in the event of loosing an enitre AZ, it wouldn't happen if set constraint on AZ.

When I've played around with nomad and killed quorum of servers destroying the cluster the clients has always continued to run its allocation until i kill it manually via docker. But now Nomad killed all allocations on one client node node, and 5 min later all allocation on the other 2 client nodes (66% of cluster capacity and thus causing downtime, We haven't scaled it to survive the loss of 2 AZ's). It doesn't feel like it's something it should do.

I cannot find anything in the logs indicating that client-server connection has failed? Is that something that is logged?

schmichael · 2018-02-08T01:41:43Z

In the logs i cannot find that it looses quorum of servers? It looses leader and reelects a new one.

Great question! We need to document these logs or something. This line indicates leadership was lost:

Feb 06 16:59:10 2018/02/06 16:59:10.665298 [INFO] nomad: cluster leadership lost

These lines indicate the cluster trying to elect a new leader, having a bit of a hard time, but ultimately succeeding:

Feb 06 16:59:11 2018/02/06 16:59:11 [WARN] raft: Heartbeat timeout from "" reached, starting election
Feb 06 16:59:11 2018/02/06 16:59:11 [INFO] raft: Node at 10.0.21.253:4647 [Candidate] entering Candidate state in term 37
Feb 06 16:59:13 2018/02/06 16:59:13 [WARN] raft: Election timeout reached, restarting election
Feb 06 16:59:13 2018/02/06 16:59:13 [INFO] raft: Node at 10.0.21.253:4647 [Candidate] entering Candidate state in term 38
Feb 06 16:59:13 2018/02/06 16:59:13 [INFO] raft: Duplicate RequestVote for same term: 38
Feb 06 16:59:13 2018/02/06 16:59:13 [INFO] raft: Node at 10.0.21.253:4647 [Follower] entering Follower state (Leader: "10.0.31.24:4647")

This line indicates a client node was lost (meaning its allocations are considered lost and will be replaced on healthy nodes):

Feb 06 16:59:35 2018/02/06 16:59:35.133789 [WARN] nomad.heartbeat: node 'f85226c6-69c7-57f3-4b91-9949ee986714' TTL expired
...
Feb 06 17:04:10 2018/02/06 17:04:10.155017 [WARN] nomad.heartbeat: node '53bb70ea-757e-85fa-5e33-5d95b7d253aa' TTL expired
Feb 06 17:04:10 2018/02/06 17:04:10.155051 [WARN] nomad.heartbeat: node 'f534814b-2a73-ae30-7596-9803ae968747' TTL expired

So server3 considered all of your client nodes lost.

All allocation was placed in different AZ's since there was 1 client node in each AZ (by Autoscaling group rule), and since i'm using distinct_host constraint this leads to only one allocation in every AZ.

Sounds good!

But now Nomad killed all allocations on one client node node, and 5 min later all allocation on the other 2 client nodes (66% of cluster capacity and thus causing downtime, We haven't scaled it to survive the loss of 2 AZ's).

Hm, the lost nodes should not have terminated their running allocations until they reconnected to the servers. Perhaps they were able to reconnect briefly to be told to stop allocations, but weren't able to maintain their connection due to network issues? That would mean when another client node became lost their allocations may not find a placement.

Do you have the IPs for each of the servers you posted? That would make it easier to read the logs and understand which server was the leader at each point and what their view of the cluster state view.

Client logs may also be useful, especially if you have DEBUG logging enabled. Without debug logging I'm not sure we'll have enough information to reconstruct a timeline, but it's possible.

My best guess from the logs pasted are that because the network issues were intermittent it caused the worst possible conditions for the cluster: nodes would be lost and not be able to maintain a stable connection to a quorum of servers to reschedule the lost allocations. Do you have any idea of the scope of this network issue? Did Amazon post an update? Do you have any other services to correlate errors against?

It's definitely possible that Nomad didn't behave optimally, but I'm afraid I can't determine that from the logs presented.

Raising that heartbeat grace setting may avoid this issue in the further by simply not treating nodes as lost during intermittent network issues. The tradeoff is that if a node really is down it will take that much longer to migrate its work elsewhere. I think that tradeoff makes sense for your cluster topology.

atillamas · 2018-02-08T17:36:32Z

@schmichael

Hi. Thank you for your detailed response.
Since consul is running on the same nodes i attach both nomad and consul logs for each of the 3 servers, and each of the 3 clients (corresponding ips is shown in the logfile) so that you maybe get a better insight in what actually happened. I sanitized the logs and removed service names, and everything some time before the incident and after.

My best guess from the logs pasted are that because the network issues were intermittent it caused the worst possible conditions for the cluster: nodes would be lost and not be able to maintain a stable connection to a quorum of servers to reschedule the lost allocations. Do you have any idea of the scope of this network issue? Did Amazon post an update? Do you have any other services to correlate errors against?

No this is the strange thing, Nothing else was affected, not our testing/staging environments running in the same AZ's, containers contacting DB's and each other didn't signal anything. No info about disturbance on AWS. Only Nomad deciding to kill off (almost) everything.

schmichael · 2018-02-09T04:30:32Z

No this is the strange thing, Nothing else was affected, not our testing/staging environments running in the same AZ's, containers contacting DB's and each other didn't signal anything. No info about disturbance on AWS. Only Nomad deciding to kill off (almost) everything.

Oh that is disturbing. Thanks for taking to time to post detailed logs! We'll try to dig in and see what happened.

alxark · 2018-03-12T11:07:15Z

I'm facing similar problems now. During some outage and leadership lost i'm getting reallocation of all services. I'm using 2 DC in germany + 1 in france. Probably this is a network issue, but i don't think we should have full reallocations.

github-actions · 2022-12-03T02:14:07Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

schmichael added theme/core theme/coordination stage/needs-investigation labels Feb 6, 2018

schmichael closed this as completed Feb 7, 2018

schmichael added type/question and removed stage/needs-investigation labels Feb 7, 2018

schmichael reopened this Feb 9, 2018

schmichael added the stage/needs-investigation label Feb 9, 2018

dadgar mentioned this issue Mar 12, 2018

Heartbeat improvements and handling failures during establishing leadership #3890

Merged

dadgar closed this as completed in #3890 Mar 12, 2018

github-actions bot locked as resolved and limited conversation to collaborators Dec 3, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Nomad reallocated all jobs when one server lost connection in cluster #3840

Nomad reallocated all jobs when one server lost connection in cluster #3840

atillamas commented Feb 6, 2018

schmichael commented Feb 7, 2018

atillamas commented Feb 7, 2018 •

edited

Loading

schmichael commented Feb 8, 2018

atillamas commented Feb 8, 2018

schmichael commented Feb 9, 2018

alxark commented Mar 12, 2018

github-actions bot commented Dec 3, 2022

Nomad reallocated all jobs when one server lost connection in cluster #3840

Nomad reallocated all jobs when one server lost connection in cluster #3840

Comments

atillamas commented Feb 6, 2018

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

schmichael commented Feb 7, 2018

atillamas commented Feb 7, 2018 • edited Loading

schmichael commented Feb 8, 2018

atillamas commented Feb 8, 2018

schmichael commented Feb 9, 2018

alxark commented Mar 12, 2018

github-actions bot commented Dec 3, 2022

atillamas commented Feb 7, 2018 •

edited

Loading