Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[0.7] all evaluations blocked due to "failed to dequeue evaluation: eval broker disabled" #3333

Closed
jippi opened this issue Oct 6, 2017 · 11 comments

Comments

@jippi
Copy link
Contributor

jippi commented Oct 6, 2017

Nomad version

Nomad v0.7.0-beta1

Operating system and Environment details

Distributor ID:	Ubuntu
Description:	Ubuntu 14.04.5 LTS
Release:	14.04
Codename:	trusty

Issue

All evaluations in my cluster was in pending state.

Tailing the nomad server output I saw a lot of errors like 2017/10/06 08:37:25.583510 [ERR] worker: failed to dequeue evaluation: eval broker disabled.

Restarting the server (which was leader) made all evaluations get processed correctly from the backlog right away (after new leader election)

Reproduction steps

N/A - not sure how it got into this state

Nomad Server logs (if appropriate)

N/A - let me know if there is anything you need, the instance is running in debug mode

Nomad Client logs (if appropriate)

N/A - let me know if there is anything you need, the instance is running in debug mode

@dadgar
Copy link
Contributor

dadgar commented Oct 6, 2017

@jippi Did you have a leader? That error is a sign that there was no cluster leader.

@jippi
Copy link
Contributor Author

jippi commented Oct 6, 2017

@dadgar the cluster showed an active leader, the log in the OP was from the cluster leader.

Asking raft at the time pointed to this node as active leader too, and state otherwise was updated when i did evaluations and job submissions. With and without "AllowStale" query option

@dadgar
Copy link
Contributor

dadgar commented Oct 12, 2017

@jippi Do you have server logs from this period? Has this happened before the beta? Nothing has changed that would effect that system for a while now!

@jippi
Copy link
Contributor Author

jippi commented Oct 13, 2017

@dadgar it's the same cluster i shared logs for on another issue on gitter, so the logs would be in there. The error happened 1h before opening this ticket

@dadgar
Copy link
Contributor

dadgar commented Oct 30, 2017

@jippi Hey had a chance to read through your logs. As suspected, there are leadership transitions causing the evaluation broker to be disabled. The leader is the only server handing out work to the other schedulers, so when there is no leader, the eval broker" is disabled and you get those logs.

    2017/10/05 17:08:38 [DEBUG] raft: Failed to contact 10.30.11.196:4647 in 2.34604883s
    2017/10/05 17:08:39 [DEBUG] raft: Failed to contact 10.30.11.196:4647 in 2.794493545s
    2017/10/05 17:08:39 [DEBUG] raft-net: 10.30.142.211:4647 accepted connection from: 10.30.11.196:44442
    2017/10/05 17:08:39 [WARN] raft: Rejecting vote request from 10.30.11.196:4647 since we have a leader: 10.30.142.211:4647
    2017/10/05 17:08:39 [DEBUG] raft: Failed to contact 10.30.11.196:4647 in 3.273116054s
    2017/10/05 17:08:39 [ERR] raft: peer {Voter 10.30.11.196:4647 10.30.11.196:4647} has newer term, stopping replication
    2017/10/05 17:08:39 [INFO] raft: Node at 10.30.142.211:4647 [Follower] entering Follower state (Leader: "")
    2017/10/05 17:08:39.524351 [INFO] nomad: cluster leadership lost
    2017/10/05 17:08:39 [INFO] raft: aborting pipeline replication to peer {Voter 10.30.95.94:4647 10.30.95.94:4647}
    2017/10/05 17:08:39.524764 [ERR] worker: failed to dequeue evaluation: eval broker disabled
    2017/10/05 17:08:39 [DEBUG] memberlist: TCP connection from=10.30.11.196:36238

I am not sure if you were having an unusual network event but the timing between the servers is excessively large. Your latency between servers should be ~10ms.

@dadgar dadgar closed this as completed Oct 30, 2017
@jippi
Copy link
Contributor Author

jippi commented Oct 31, 2017

@dadgar we run each nomad server in an AZ, other than that there isn't anything special about it. We've only seen this once in the last 8 months, so might be a network fluke somewhere - thank you!

@josegonzalez
Copy link
Contributor

We just saw this happen two days in a row. The symptom is stalled deployments without anything being blocked. To fix this, we've been:

  • killing existing deployments
  • restarting the nomad leader

Its not clear if restarting the leader without killing deployments will work, we've just been doing that as a safety precaution, though the next time this appears, we'll be trying that.

Is there anything we can do to further diagnose this issue? /cc @jippi in case there was something in the logs you saw that stuck out.

@dadgar
Copy link
Contributor

dadgar commented Dec 7, 2017

@josegonzalez I would suggest you look at the logs to see what the latency is between servers and whether you have a leader. This error is only emitted when there is no leader.

@josegonzalez
Copy link
Contributor

@dadgar the weird part is that there was an election at this time and then things hit the fan:

    2017/12/07 10:46:26.293963 [ERR] worker: failed to dequeue evaluation: rpc error: eval broker disabled
    2017/12/07 10:46:26.294074 [ERR] worker: failed to dequeue evaluation: rpc error: eval broker disabled
    2017/12/07 10:46:26.294151 [ERR] worker: failed to dequeue evaluation: rpc error: eval broker disabled
    2017/12/07 10:46:26.294249 [ERR] worker: failed to dequeue evaluation: rpc error: eval broker disabled
    2017/12/07 10:46:26.468349 [DEBUG] http: Request /v1/agent/health?type=client (137.632µs)
    2017/12/07 10:46:27 [DEBUG] raft-net: 10.0.1.149:4647 accepted connection from: 10.0.75.59:35894
    2017/12/07 10:46:27.695653 [DEBUG] http: Request /v1/agent/health?type=server (1.184760745s)
    2017/12/07 10:46:29 [DEBUG] memberlist: TCP connection from=127.0.0.1:33378
    2017/12/07 10:46:29 [WARN] raft: Rejecting vote request from 10.0.129.25:4647 since we have a leader: 10.0.75.59:4647
    2017/12/07 10:46:30.004126 [DEBUG] http: Request /v1/metrics?format=prometheus (2.158589ms)
    2017/12/07 10:46:30.113547 [DEBUG] http: Request /v1/nodes?index=968045&stale= (5m0.755567381s)
    2017/12/07 10:46:30 [WARN] raft: Heartbeat timeout from "" reached, starting election
    2017/12/07 10:46:30 [INFO] raft: Node at 10.0.1.149:4647 [Candidate] entering Candidate state in term 216
    2017/12/07 10:46:30 [DEBUG] raft: Votes needed: 3
    2017/12/07 10:46:30 [DEBUG] raft: Vote granted from 10.0.1.149:4647 in term 216. Tally: 1
    2017/12/07 10:46:30 [ERR] raft: Failed to make RequestVote RPC to {Voter 10.0.129.25:4647 10.0.129.25:4647}: EOF
    2017/12/07 10:46:30 [DEBUG] raft: Vote granted from 10.0.75.59:4647 in term 216. Tally: 2
    2017/12/07 10:46:30 [INFO] raft: Duplicate RequestVote for same term: 216
    2017/12/07 10:46:31 [WARN] raft: Election timeout reached, restarting election
    2017/12/07 10:46:31 [INFO] raft: Node at 10.0.1.149:4647 [Candidate] entering Candidate state in term 217
    2017/12/07 10:46:31 [DEBUG] raft: Votes needed: 3
    2017/12/07 10:46:31 [DEBUG] raft: Vote granted from 10.0.1.149:4647 in term 217. Tally: 1
    2017/12/07 10:46:31 [DEBUG] raft: Vote granted from 10.0.75.59:4647 in term 217. Tally: 2
    2017/12/07 10:46:31 [DEBUG] raft: Vote granted from 10.0.129.25:4647 in term 217. Tally: 3
    2017/12/07 10:46:31 [INFO] raft: Election won. Tally: 3
    2017/12/07 10:46:31 [INFO] raft: Node at 10.0.1.149:4647 [Leader] entering Leader state
    2017/12/07 10:46:31 [INFO] raft: Added peer 10.0.145.169:4647, starting replication
    2017/12/07 10:46:31 [INFO] raft: Added peer 10.0.75.59:4647, starting replication
    2017/12/07 10:46:31 [INFO] raft: Added peer 10.0.129.25:4647, starting replication
    2017/12/07 10:46:31.886931 [INFO] nomad: cluster leadership acquired
    2017/12/07 10:46:31 [INFO] raft: pipelining replication to peer {Voter 10.0.75.59:4647 10.0.75.59:4647}
    2017/12/07 10:46:31 [WARN] raft: AppendEntries to {Voter 10.0.129.25:4647 10.0.129.25:4647} rejected, sending older logs (next: 1003300)
    2017/12/07 10:46:31 [INFO] raft: pipelining replication to peer {Voter 10.0.129.25:4647 10.0.129.25:4647}
    2017/12/07 10:46:31.891692 [ERR] worker: failed to dequeue evaluation: rpc error: rpc error: eval broker disabled
    2017/12/07 10:46:32 [WARN] raft: Failed to contact 10.0.145.169:4647 in 503.033546ms

Thats from the node that was the new leader. I ended up restarting that server and then my following deployments succeeded.

@preetapan
Copy link
Contributor

@josegonzalez We have a fix for Nomad leader loop in PR #3402 that is going to be in the 0.7.1 release. It fixes a bug in the leader loop that used to cause leader to be stuck until they restart. From your logs, it appears that you might have run into this. We can watch this ticket for a while. If you are able to upgrade to 0.7.1 after we release we should see this not happen anymore..

@github-actions
Copy link

github-actions bot commented Dec 5, 2022

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 5, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants