Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Periodic jobs "non-tracked" after server restart #2829

Closed
carlpett opened this issue Jul 13, 2017 · 1 comment · Fixed by #2959
Closed

Periodic jobs "non-tracked" after server restart #2829

carlpett opened this issue Jul 13, 2017 · 1 comment · Fixed by #2959

Comments

@carlpett
Copy link
Contributor

Nomad version

Nomad v0.5.6

Operating system and Environment details

Centos 7, 3 server nodes

Issue

About two days ago, one of the server nodes in our cluster panicked and exited, and was subsequently restarted. However, since then, some of our periodic jobs have not been working. There are a lot of these lines in the server logs:

[ERR] nomad.periodic: force run of periodic job "consul-snapshot" failed: can't force run non-tracked job consul-snapshot
[ERR] nomad: failed to establish leadership: force run of periodic job "consul-snapshot" failed: can't force run non-tracked job consul-snapshot

As well as these:

[ERR] nomad.periodic: failed to dispatch job "logstash-curator": timed out enqueuing operation
[ERR] nomad.client: alloc update failed: timed out enqueuing operation  ### (about 1 of these for 100 of the above)

What are my options here? Just remove the jobs and reschedule? They have been working for several months at least up until two days ago.

Server logs

This is the last few logs from the node that crashed. I'm not sure if it is related or not:

2017/07/11 14:42:12.495355 [ERR] nomad: failed to establish leadership: force run of periodic job "consul-snapshot" failed: can't force run non-tracked job consul-snapshot
2017/07/11 14:42:46.146890 [INFO] fingerprint.consul: consul agent is unavailable
2017/07/11 14:42:46 [WARN] raft: Failed to contact quorum of nodes, stepping down
2017/07/11 14:42:46 [INFO] raft: Node at 192.168.123.154:4647 [Follower] entering Follower state (Leader: "")
2017/07/11 14:42:46.210437 [ERR] nomad.client: Register failed: node is not the leader
2017/07/11 14:42:46.210478 [ERR] client: registration failure: node is not the leader
2017/07/11 14:42:46.210421 [INFO] nomad: cluster leadership lost
2017/07/11 14:42:46 [INFO] raft: aborting pipeline replication to peer {Voter 192.168.123.118:4647 192.168.123.118:4647}
2017/07/11 14:42:46 [INFO] raft: aborting pipeline replication to peer {Voter 192.168.123.116:4647 192.168.123.116:4647}
2017/07/11 14:42:46.213334 [ERR] worker: failed to dequeue evaluation: eval broker disabled
panic: close of closed channel
goroutine 176795521 [running]:
github.com/hashicorp/nomad/nomad.(*PeriodicDispatch).run(0xc42039d3e0)
/opt/gopath/src/github.com/hashicorp/nomad/nomad/periodic.go:325 +0x221
created by github.com/hashicorp/nomad/nomad.(*PeriodicDispatch).Start
/opt/gopath/src/github.com/hashicorp/nomad/nomad/periodic.go:171 +0x71

The job consul-snapshot is a periodic parameterized job. I'm guessing one of the parameterized versions has been crashing for a longer time, since we do not seem to have any snapshots from that consul cluster.

@github-actions
Copy link

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 10, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants