Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allocation stopped immediately after start #14850

Closed
valodzka opened this issue Oct 7, 2022 · 8 comments
Closed

Allocation stopped immediately after start #14850

valodzka opened this issue Oct 7, 2022 · 8 comments
Assignees
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/core theme/scheduling type/bug
Milestone

Comments

@valodzka
Copy link
Contributor

valodzka commented Oct 7, 2022

Nomad version

1.3.6

Issue

Sometimes during deployment I see that freshly started allocation immediately stopped by nomad. During the same deployment multiple allocations with identical alloc index (name) created. Example, all allocs share same Eval ID and Job Version (full alloc status here: https://gist.github.com/valodzka/231f3942203a5f39528b52241905a06e):

Alloc 1:

ID                  = d1421efd-6b25-295a-4832-348c09955a8f
Name                = bk.client[35]
Client Status       = complete
Desired Description = alloc not needed due to job update
....

Recent Events:
Time                  Type      Description
2022-10-07T15:09:45Z  Killing   Sent interrupt. Waiting 2m0s before force killing
2022-10-07T15:09:45Z  Received  Task received by client

Alloc 2:

ID                  = f278d901-b4d0-8b7b-8f20-4f1c46a36a10
Name                = bk.client[24]
Client Status       = running

Recent Events:
Time                  Type        Description
2022-10-07T15:10:00Z  Started     Task started by client
2022-10-07T15:09:57Z  Task Setup  Building Task Directory
2022-10-07T15:09:45Z  Received    Task received by client

Placement Metrics
  * Constraint "${meta.only_low_cpu_tasks} = false": 1 nodes excluded by filter
  * Resources exhausted on 31 nodes
  * Dimension "network: port collision" exhausted on 1 nodes
  * Dimension "cpu" exhausted on 13 nodes
  * Dimension "memory" exhausted on 17 nodes

Alloc 3:

ID                  = 246dec93-d65e-9498-1d69-816d727083d7
Name                = bk.client[24]
Client Status       = running

Recent Events:
Time                  Type        Description
2022-10-07T15:09:57Z  Started     Task started by client
2022-10-07T15:09:57Z  Task Setup  Building Task Directory
2022-10-07T15:09:45Z  Received    Task received by client

What is strange here:

  1. Alloc 2 and 3 have identical alloc index/name (with same job version and no canaries used)
  2. Alloc 1 killed the same second as it was started
  3. Alloc 1 "Desired description" is "alloc not needed due to job update" but I'm sure that job definition wasn't updated
  4. Alloc 2 has Dimension "network: port collision" exhausted on 1 nodes. Not sure if it's relevant here but I guess it might be related.

Issue #12797 might be relevant.

Expected Result

  1. Alloc shouldn't be killed without reason
  2. No allocs with same job version should share same alloc index

Nomad Server/Clients logs

I checked logs for both client and server and didn't found anything relevant.

@jrasell
Copy link
Member

jrasell commented Oct 10, 2022

Hi @valodzka and thanks for raising this issue.

Alloc 2 and 3 have identical alloc index/name

This certainly looks identical to #12797 as you pointed out, so cross linked from this issue is useful to provide additional context when that item of work gets roadmapped.

alloc not needed due to job update

This message is used during the allocation reconciliation phase which is triggered when the scheduler receives an updated job specification. It would therefore on the surface seem that something has triggered an update and I wonder if you could share more information about the deployment process you use for Nomad jobs?

I checked logs for both client and server and didn't found anything relevant.

If you have these available from both the leader and the client where the allocation was placed and immediately stopped, it would provide some useful additional context. There may be some non-obvious messages which might provide an idea of what is happening also.

@jrasell jrasell self-assigned this Oct 10, 2022
@jrasell jrasell added this to Needs Triage in Nomad - Community Issues Triage via automation Oct 10, 2022
@jrasell jrasell moved this from Needs Triage to Triaging in Nomad - Community Issues Triage Oct 10, 2022
@valodzka
Copy link
Contributor Author

valodzka commented Oct 10, 2022

It would therefore on the surface seem that something has triggered an update and I wonder if you could share more information about the deployment process you use for Nomad jobs?

We use simple automated system that call flock <lock-file> nomad job run <job-file> for deployment. flock used to block concurrent deployments. So I'm pretty sure that there were not any updates. Also as you can see in full alloc status Job Version remaned the same (1654). And if something modified job version would be increased.

If you have these available from both the leader and the client where the allocation was placed and immediately stopped, it would provide some useful additional context.

Leader:

2022-10-07T15:09:44.828Z [INFO]  nomad: plan for node rejected, refer to https://www.nomadproject.io/s/port-plan-failure for more information: node_id=608e656f-e3b9-6ea0-ea99-c4dfc6bc02e6 reason="reserved alloc port collision: collision when reserving port for alloc 2ac62f30-a072-d797-8023-b6bf089f2ac0: [port 30261 already in use]" eval_id=6cdc78f2-8fa1-cc00-193f-45e923472f6a namespace=default

Client:

2022-10-07T15:09:55.478Z [INFO]  client.driver_mgr.docker: stopped container: container_id=9e6d72ce5f5eced40e72a8f8fd88ccd6b8d54007394901680591407fd8089ab8 driver=docker
2022-10-07T15:09:55.814Z [INFO]  agent: (runner) stopping
2022-10-07T15:09:55.814Z [INFO]  client.gc: marking allocation for GC: alloc_id=75f2d3c4-946e-3977-83d3-d074a12102b4
2022-10-07T15:09:55.816Z [INFO]  agent: (runner) received finish
2022-10-07T15:09:55.906Z [INFO]  client.gc: marking allocation for GC: alloc_id=d1421efd-6b25-295a-4832-348c09955a8f
2022-10-07T15:10:03.655Z [INFO]  client.driver_mgr.docker: stopped container: container_id=7ff101994a24366c00b9184b6555426efb33c56f0186c22152dda0fb9c754df1 driver=docker
2022-10-07T15:10:03.684Z [INFO]  agent: (runner) stopping
2022-10-07T15:10:03.684Z [INFO]  client.gc: marking allocation for GC: alloc_id=6256dfb2-a253-d644-70fe-45cfb4867171
2022-10-07T15:10:03.684Z [INFO]  agent: (runner) received finish
2022-10-07T15:10:03.697Z [INFO]  client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=5175e773-5813-7577-1031-25154e11b043 task=java @module=logmon path=/opt/nomad/data/alloc/5175e773-5813-7577-1031-25154e11>
2022-10-07T15:10:03.698Z [INFO]  client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=5175e773-5813-7577-1031-25154e11b043 task=java @module=logmon path=/opt/nomad/data/alloc/5175e773-5813-7577-1031-25154e11>
2022-10-07T15:10:03.711Z [INFO]  agent: (runner) creating new runner (dry: false, once: false)
2022-10-07T15:10:03.711Z [INFO]  agent: (runner) creating watcher
2022-10-07T15:10:03.712Z [INFO]  agent: (runner) starting
2022-10-07T15:10:03.893Z [INFO]  client.driver_mgr.docker: created container: driver=docker container_id=84127937ca67958655da48a14e8c8204d4ab843a22c33d46e6e04617732cb5f2
2022-10-07T15:10:04.015Z [INFO]  client.driver_mgr.docker: started container: driver=docker container_id=84127937ca67958655da48a14e8c8204d4ab843a22c33d46e6e04617732cb5f2

@valodzka
Copy link
Contributor Author

The issue appeared again. Unfortunately allocations were garbaraga collected before I can inspect them, but in the leader log I see the same message (and it first time it appeared in log since 2022-10-07T15:09:44.828Z), so it might be related:

2022-10-10T18:39:31.503Z [INFO]  nomad: plan for node rejected, refer to https://www.nomadproject.io/s/port-plan-failure for more information: node_id=e725aa89-2bad-bf30-9b7a-3c704722ff40 reason="reserved alloc port collision: collision when reserving port for alloc 8a8c641a-e7b8-dc95-ba80-d3bb5be517f5: [port 26558 already in use]" eval_id=5055fde0-6985-da02-e7db-bb10ac57e819 namespace=default

valodzka pushed a commit to valodzka/nomad-vagrant-lab that referenced this issue Oct 17, 2022
valodzka pushed a commit to valodzka/nomad-vagrant-lab that referenced this issue Oct 17, 2022
@valodzka
Copy link
Contributor Author

Hi @jrasell, I tried to create a reproducer to this issue with vagrant and turns out it's relatively easy to reproduce. Steps:

git clone -b issue_14850 git@github.com:valodzka/nomad-vagrant-lab.git
cd nomad-vagrant-lab
vagrant up
vagrant ssh nomad-a-1    
bash /vagrant/launch-a-1.sh
# repeat ssh/launch it a-2, a-3
bash -x reproduce_14850.sh
# wait few dozen iterations

nomad-vagrant-lab is a basic nomad/consul cluster template without any tweaks.  reproduce_14850.sh runs deployments of simple job with 50 allocations, each of them using 11 dynamic ports. I tried twice and on my machine (old mac book pro) it took around 20-30 deployments to trigger the issue.

@jrasell
Copy link
Member

jrasell commented Oct 17, 2022

Hi @valodzka and thanks for the additional details and context.

nomad: plan for node rejected, refer to https://www.nomadproject.io/s/port-plan-failure for more information: node_id=e725aa89-2bad-bf30-9b7a-3c704722ff40 reason="reserved alloc port collision: collision when reserving port for alloc 8a8c641a-e7b8-dc95-ba80-d3bb5be517f5: [port 26558 already in use]" eval_id=5055fde0-6985-da02-e7db-bb10ac57e819 namespace=default

This is unfortunately a known bug, which despite significant engineering effort, has yet to be resolved. The documentation link within the log line has more information on this and how workarounds for this.

create a reproducer to this issue with vagrant and turns out it's relatively easy to reproduce

This is very useful and thanks for taking the time to create this. It will prove very useful further down the line to engineers looking into this issue.

I am going to move this issue onto our backlog now from initial triage.

@jrasell jrasell added theme/core theme/scheduling stage/accepted Confirmed, and intend to work on. No timeline committment though. and removed stage/waiting-reply labels Oct 17, 2022
@jrasell jrasell moved this from Triaging to Needs Roadmapping in Nomad - Community Issues Triage Oct 17, 2022
@jrasell jrasell removed their assignment Oct 17, 2022
@valodzka
Copy link
Contributor Author

This is unfortunately a known bug, which despite significant engineering effort, has yet to be resolved. The documentation link within the log line has more information on this and how workarounds for this.

@jrasell thank you for reply. I already read the documentation about this log message and if I understood correctly it is about a different bug. It states that "it is possible for these log lines to occur infrequently due to normal cluster conditions", and the problem is only if there are "repeated log lines". I only see this log message only once a few days. Is this considered repeated? Should workaround with a plan rejection tracker that remembers plans only for 5 min help in this case?

@tgross tgross self-assigned this Jan 24, 2023
@tgross
Copy link
Member

tgross commented Jan 24, 2023

Hi @valodzka, sorry about the delayed response on this. I've picked this problem back up and your reproduction is going to be very helpful. In particular this line may be a vital clue for us, as one of the larger Nomad users who's hit this issue also has {"SchedulerAlgorithm":"spread"} set. I'll report back here once I know more.

I only see this log message only once a few days. Is this considered repeated? Should workaround with a plan rejection tracker that remembers plans only for 5 min help in this case?

The behavior we've seen from some users is that a particular node gets "stuck" in a state where the planner consistently rejects plans for a given job, until the client is restarted. If the retried allocation gets placed successfully on the next plan, you may not be hitting the specific issue that the plan rejection tracker is intended to fix.

@tgross
Copy link
Member

tgross commented Mar 9, 2023

This will be closed by #16401, which will ship in Nomad 1.5.1 (plus backports)

@tgross tgross closed this as completed Mar 9, 2023
Nomad - Community Issues Triage automation moved this from Needs Roadmapping to Done Mar 9, 2023
@tgross tgross added this to the 1.5.1 milestone Mar 9, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/core theme/scheduling type/bug
Projects
Development

No branches or pull requests

3 participants