[0.7] all evaluations blocked due to "failed to dequeue evaluation: eval broker disabled" #3333

jippi · 2017-10-06T08:40:03Z

Nomad version

Nomad v0.7.0-beta1

Operating system and Environment details

Distributor ID:	Ubuntu
Description:	Ubuntu 14.04.5 LTS
Release:	14.04
Codename:	trusty

Issue

All evaluations in my cluster was in pending state.

Tailing the nomad server output I saw a lot of errors like 2017/10/06 08:37:25.583510 [ERR] worker: failed to dequeue evaluation: eval broker disabled.

Restarting the server (which was leader) made all evaluations get processed correctly from the backlog right away (after new leader election)

Reproduction steps

N/A - not sure how it got into this state

Nomad Server logs (if appropriate)

N/A - let me know if there is anything you need, the instance is running in debug mode

Nomad Client logs (if appropriate)

N/A - let me know if there is anything you need, the instance is running in debug mode

The text was updated successfully, but these errors were encountered:

dadgar · 2017-10-06T17:25:36Z

@jippi Did you have a leader? That error is a sign that there was no cluster leader.

jippi · 2017-10-06T18:58:44Z

@dadgar the cluster showed an active leader, the log in the OP was from the cluster leader.

Asking raft at the time pointed to this node as active leader too, and state otherwise was updated when i did evaluations and job submissions. With and without "AllowStale" query option

dadgar · 2017-10-12T23:36:36Z

@jippi Do you have server logs from this period? Has this happened before the beta? Nothing has changed that would effect that system for a while now!

jippi · 2017-10-13T06:37:48Z

@dadgar it's the same cluster i shared logs for on another issue on gitter, so the logs would be in there. The error happened 1h before opening this ticket

dadgar · 2017-10-30T18:14:14Z

@jippi Hey had a chance to read through your logs. As suspected, there are leadership transitions causing the evaluation broker to be disabled. The leader is the only server handing out work to the other schedulers, so when there is no leader, the eval broker" is disabled and you get those logs.

    2017/10/05 17:08:38 [DEBUG] raft: Failed to contact 10.30.11.196:4647 in 2.34604883s
    2017/10/05 17:08:39 [DEBUG] raft: Failed to contact 10.30.11.196:4647 in 2.794493545s
    2017/10/05 17:08:39 [DEBUG] raft-net: 10.30.142.211:4647 accepted connection from: 10.30.11.196:44442
    2017/10/05 17:08:39 [WARN] raft: Rejecting vote request from 10.30.11.196:4647 since we have a leader: 10.30.142.211:4647
    2017/10/05 17:08:39 [DEBUG] raft: Failed to contact 10.30.11.196:4647 in 3.273116054s
    2017/10/05 17:08:39 [ERR] raft: peer {Voter 10.30.11.196:4647 10.30.11.196:4647} has newer term, stopping replication
    2017/10/05 17:08:39 [INFO] raft: Node at 10.30.142.211:4647 [Follower] entering Follower state (Leader: "")
    2017/10/05 17:08:39.524351 [INFO] nomad: cluster leadership lost
    2017/10/05 17:08:39 [INFO] raft: aborting pipeline replication to peer {Voter 10.30.95.94:4647 10.30.95.94:4647}
    2017/10/05 17:08:39.524764 [ERR] worker: failed to dequeue evaluation: eval broker disabled
    2017/10/05 17:08:39 [DEBUG] memberlist: TCP connection from=10.30.11.196:36238

I am not sure if you were having an unusual network event but the timing between the servers is excessively large. Your latency between servers should be ~10ms.

jippi · 2017-10-31T09:08:18Z

@dadgar we run each nomad server in an AZ, other than that there isn't anything special about it. We've only seen this once in the last 8 months, so might be a network fluke somewhere - thank you!

josegonzalez · 2017-12-07T17:22:37Z

We just saw this happen two days in a row. The symptom is stalled deployments without anything being blocked. To fix this, we've been:

killing existing deployments
restarting the nomad leader

Its not clear if restarting the leader without killing deployments will work, we've just been doing that as a safety precaution, though the next time this appears, we'll be trying that.

Is there anything we can do to further diagnose this issue? /cc @jippi in case there was something in the logs you saw that stuck out.

dadgar · 2017-12-07T17:49:52Z

@josegonzalez I would suggest you look at the logs to see what the latency is between servers and whether you have a leader. This error is only emitted when there is no leader.

josegonzalez · 2017-12-07T18:00:58Z

@dadgar the weird part is that there was an election at this time and then things hit the fan:

    2017/12/07 10:46:26.293963 [ERR] worker: failed to dequeue evaluation: rpc error: eval broker disabled
    2017/12/07 10:46:26.294074 [ERR] worker: failed to dequeue evaluation: rpc error: eval broker disabled
    2017/12/07 10:46:26.294151 [ERR] worker: failed to dequeue evaluation: rpc error: eval broker disabled
    2017/12/07 10:46:26.294249 [ERR] worker: failed to dequeue evaluation: rpc error: eval broker disabled
    2017/12/07 10:46:26.468349 [DEBUG] http: Request /v1/agent/health?type=client (137.632µs)
    2017/12/07 10:46:27 [DEBUG] raft-net: 10.0.1.149:4647 accepted connection from: 10.0.75.59:35894
    2017/12/07 10:46:27.695653 [DEBUG] http: Request /v1/agent/health?type=server (1.184760745s)
    2017/12/07 10:46:29 [DEBUG] memberlist: TCP connection from=127.0.0.1:33378
    2017/12/07 10:46:29 [WARN] raft: Rejecting vote request from 10.0.129.25:4647 since we have a leader: 10.0.75.59:4647
    2017/12/07 10:46:30.004126 [DEBUG] http: Request /v1/metrics?format=prometheus (2.158589ms)
    2017/12/07 10:46:30.113547 [DEBUG] http: Request /v1/nodes?index=968045&stale= (5m0.755567381s)
    2017/12/07 10:46:30 [WARN] raft: Heartbeat timeout from "" reached, starting election
    2017/12/07 10:46:30 [INFO] raft: Node at 10.0.1.149:4647 [Candidate] entering Candidate state in term 216
    2017/12/07 10:46:30 [DEBUG] raft: Votes needed: 3
    2017/12/07 10:46:30 [DEBUG] raft: Vote granted from 10.0.1.149:4647 in term 216. Tally: 1
    2017/12/07 10:46:30 [ERR] raft: Failed to make RequestVote RPC to {Voter 10.0.129.25:4647 10.0.129.25:4647}: EOF
    2017/12/07 10:46:30 [DEBUG] raft: Vote granted from 10.0.75.59:4647 in term 216. Tally: 2
    2017/12/07 10:46:30 [INFO] raft: Duplicate RequestVote for same term: 216
    2017/12/07 10:46:31 [WARN] raft: Election timeout reached, restarting election
    2017/12/07 10:46:31 [INFO] raft: Node at 10.0.1.149:4647 [Candidate] entering Candidate state in term 217
    2017/12/07 10:46:31 [DEBUG] raft: Votes needed: 3
    2017/12/07 10:46:31 [DEBUG] raft: Vote granted from 10.0.1.149:4647 in term 217. Tally: 1
    2017/12/07 10:46:31 [DEBUG] raft: Vote granted from 10.0.75.59:4647 in term 217. Tally: 2
    2017/12/07 10:46:31 [DEBUG] raft: Vote granted from 10.0.129.25:4647 in term 217. Tally: 3
    2017/12/07 10:46:31 [INFO] raft: Election won. Tally: 3
    2017/12/07 10:46:31 [INFO] raft: Node at 10.0.1.149:4647 [Leader] entering Leader state
    2017/12/07 10:46:31 [INFO] raft: Added peer 10.0.145.169:4647, starting replication
    2017/12/07 10:46:31 [INFO] raft: Added peer 10.0.75.59:4647, starting replication
    2017/12/07 10:46:31 [INFO] raft: Added peer 10.0.129.25:4647, starting replication
    2017/12/07 10:46:31.886931 [INFO] nomad: cluster leadership acquired
    2017/12/07 10:46:31 [INFO] raft: pipelining replication to peer {Voter 10.0.75.59:4647 10.0.75.59:4647}
    2017/12/07 10:46:31 [WARN] raft: AppendEntries to {Voter 10.0.129.25:4647 10.0.129.25:4647} rejected, sending older logs (next: 1003300)
    2017/12/07 10:46:31 [INFO] raft: pipelining replication to peer {Voter 10.0.129.25:4647 10.0.129.25:4647}
    2017/12/07 10:46:31.891692 [ERR] worker: failed to dequeue evaluation: rpc error: rpc error: eval broker disabled
    2017/12/07 10:46:32 [WARN] raft: Failed to contact 10.0.145.169:4647 in 503.033546ms

Thats from the node that was the new leader. I ended up restarting that server and then my following deployments succeeded.

preetapan · 2017-12-07T18:47:06Z

@josegonzalez We have a fix for Nomad leader loop in PR #3402 that is going to be in the 0.7.1 release. It fixes a bug in the leader loop that used to cause leader to be stuck until they restart. From your logs, it appears that you might have run into this. We can watch this ticket for a while. If you are able to upgrade to 0.7.1 after we release we should see this not happen anymore..

github-actions · 2022-12-05T02:16:15Z

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

dadgar added the stage/waiting-reply label Oct 6, 2017

dadgar closed this as completed Oct 30, 2017

github-actions bot locked as resolved and limited conversation to collaborators Dec 5, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[0.7] all evaluations blocked due to "failed to dequeue evaluation: eval broker disabled" #3333

[0.7] all evaluations blocked due to "failed to dequeue evaluation: eval broker disabled" #3333

jippi commented Oct 6, 2017 •

edited

Loading

dadgar commented Oct 6, 2017

jippi commented Oct 6, 2017

dadgar commented Oct 12, 2017

jippi commented Oct 13, 2017

dadgar commented Oct 30, 2017

jippi commented Oct 31, 2017

josegonzalez commented Dec 7, 2017

dadgar commented Dec 7, 2017

josegonzalez commented Dec 7, 2017

preetapan commented Dec 7, 2017

github-actions bot commented Dec 5, 2022

[0.7] all evaluations blocked due to "failed to dequeue evaluation: eval broker disabled" #3333

[0.7] all evaluations blocked due to "failed to dequeue evaluation: eval broker disabled" #3333

Comments

jippi commented Oct 6, 2017 • edited Loading

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

dadgar commented Oct 6, 2017

jippi commented Oct 6, 2017

dadgar commented Oct 12, 2017

jippi commented Oct 13, 2017

dadgar commented Oct 30, 2017

jippi commented Oct 31, 2017

josegonzalez commented Dec 7, 2017

dadgar commented Dec 7, 2017

josegonzalez commented Dec 7, 2017

preetapan commented Dec 7, 2017

github-actions bot commented Dec 5, 2022

jippi commented Oct 6, 2017 •

edited

Loading