Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

When one of servers are reloaded nomad can decide that some client nodes expire they TTL, and reallocate allocations #3035

Closed
tantra35 opened this issue Aug 16, 2017 · 1 comment · Fixed by #3890

Comments

@tantra35
Copy link
Contributor

Nomad version

Nomad v0.6.0

Issue

If we made reboot of server on that placed nomad server, nomad can think that client nodes expire TTL

For example after reboot on server node(it was leader at that time), we got follow on current leader server:

Aug 16 16:23:53 vol-cl-control-03 nomad[8553]:     2017/08/16 16:23:53.197919 [WARN] nomad.heartbeat: node '08fba0d2-1ee9-a43a-5ade-76d99fc76028' TTL expired
Aug 16 16:23:53 vol-cl-control-03 nomad[8553]:     2017/08/16 16:23:53 [INFO] memberlist: Suspect vol-cl-control-02.global has failed, no acks received
Aug 16 16:23:53 vol-cl-control-03 nomad[8553]:     2017/08/16 16:23:53 [DEBUG] raft: Failed to contact 172.16.9.89:4647 in 37.430211249s
Aug 16 16:23:53 vol-cl-control-03 nomad[8553]:     2017/08/16 16:23:53.335928 [DEBUG] worker: dequeued evaluation 299e62e8-8fde-67f5-f09b-3e149507e3b3
Aug 16 16:23:53 vol-cl-control-03 nomad[8553]:     2017/08/16 16:23:53.336096 [DEBUG] worker: dequeued evaluation 05d16754-2cc1-e0c0-bf74-0d2783b5c328
Aug 16 16:23:53 vol-cl-control-03 nomad[8553]:     2017/08/16 16:23:53.336137 [DEBUG] sched: <Eval '299e62e8-8fde-67f5-f09b-3e149507e3b3' JobID: 'haproxy'>: Total changes: (place 1) (destr
Aug 16 16:23:53 vol-cl-control-03 nomad[8553]: Desired Changes for "haproxy": (place 1) (inplace 0) (destructive 0) (stop 1) (migrate 0) (ignore 1) (canary 0)
Aug 16 16:23:53 vol-cl-control-03 nomad[8553]:     2017/08/16 16:23:53.336224 [DEBUG] sched: <Eval '05d16754-2cc1-e0c0-bf74-0d2783b5c328' JobID: 'townshipDynamoTeamServer'>: Total changes:
Aug 16 16:23:53 vol-cl-control-03 nomad[8553]: Desired Changes for "townshipDynamoTeamServer": (place 1) (inplace 0) (destructive 0) (stop 1) (migrate 0) (ignore 0) (canary 0)
Aug 16 16:23:53 vol-cl-control-03 nomad[8553]:     2017/08/16 16:23:53.416086 [DEBUG] worker: submitted plan for evaluation 299e62e8-8fde-67f5-f09b-3e149507e3b3
Aug 16 16:23:53 vol-cl-control-03 nomad[8553]:     2017/08/16 16:23:53.416139 [DEBUG] sched: <Eval '299e62e8-8fde-67f5-f09b-3e149507e3b3' JobID: 'haproxy'>: setting status to complete
Aug 16 16:23:53 vol-cl-control-03 nomad[8553]:     2017/08/16 16:23:53.499625 [DEBUG] worker: updated evaluation <Eval '299e62e8-8fde-67f5-f09b-3e149507e3b3' JobID: 'haproxy'>
Aug 16 16:23:53 vol-cl-control-03 nomad[8553]:     2017/08/16 16:23:53.499653 [DEBUG] worker: ack for evaluation 299e62e8-8fde-67f5-f09b-3e149507e3b3
Aug 16 16:23:53 vol-cl-control-03 nomad[8553]:     2017/08/16 16:23:53.499673 [DEBUG] worker: submitted plan for evaluation 05d16754-2cc1-e0c0-bf74-0d2783b5c328
Aug 16 16:23:53 vol-cl-control-03 nomad[8553]:     2017/08/16 16:23:53.499687 [DEBUG] sched: <Eval '05d16754-2cc1-e0c0-bf74-0d2783b5c328' JobID: 'townshipDynamoTeamServer'>: setting status
Aug 16 16:23:53 vol-cl-control-03 nomad[8553]:     2017/08/16 16:23:53.565375 [DEBUG] worker: updated evaluation <Eval '05d16754-2cc1-e0c0-bf74-0d2783b5c328' JobID: 'townshipDynamoTeamServ
Aug 16 16:23:53 vol-cl-control-03 nomad[8553]:     2017/08/16 16:23:53.565400 [DEBUG] worker: ack for evaluation 05d16754-2cc1-e0c0-bf74-0d2783b5c328
Aug 16 16:23:53 vol-cl-control-03 nomad[8553]:     2017/08/16 16:23:53 [DEBUG] raft: Failed to contact 172.16.9.89:4647 in 37.878762738s

As you can see nomad writes Aug 16 16:23:53 vol-cl-control-03 nomad[8553]: 2017/08/16 16:23:53.197919 [WARN] nomad.heartbeat: node '08fba0d2-1ee9-a43a-5ade-76d99fc76028' TTL expired , but this decision is wrong because client node with id 08fba0d2-1ee9-a43a-5ade-76d99fc76028, was healthy and worked excellent. As a result of this wrong decision nomad began reallocate allocation placed on that node, but this it fully wrong

@github-actions
Copy link

github-actions bot commented Dec 3, 2022

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Dec 3, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants