Allocation stopped immediately after start #14850

valodzka · 2022-10-07T19:00:42Z

Nomad version

1.3.6

Issue

Sometimes during deployment I see that freshly started allocation immediately stopped by nomad. During the same deployment multiple allocations with identical alloc index (name) created. Example, all allocs share same Eval ID and Job Version (full alloc status here: https://gist.github.com/valodzka/231f3942203a5f39528b52241905a06e):

Alloc 1:

ID                  = d1421efd-6b25-295a-4832-348c09955a8f
Name                = bk.client[35]
Client Status       = complete
Desired Description = alloc not needed due to job update
....

Recent Events:
Time                  Type      Description
2022-10-07T15:09:45Z  Killing   Sent interrupt. Waiting 2m0s before force killing
2022-10-07T15:09:45Z  Received  Task received by client

Alloc 2:

ID                  = f278d901-b4d0-8b7b-8f20-4f1c46a36a10
Name                = bk.client[24]
Client Status       = running

Recent Events:
Time                  Type        Description
2022-10-07T15:10:00Z  Started     Task started by client
2022-10-07T15:09:57Z  Task Setup  Building Task Directory
2022-10-07T15:09:45Z  Received    Task received by client

Placement Metrics
  * Constraint "${meta.only_low_cpu_tasks} = false": 1 nodes excluded by filter
  * Resources exhausted on 31 nodes
  * Dimension "network: port collision" exhausted on 1 nodes
  * Dimension "cpu" exhausted on 13 nodes
  * Dimension "memory" exhausted on 17 nodes

Alloc 3:

ID                  = 246dec93-d65e-9498-1d69-816d727083d7
Name                = bk.client[24]
Client Status       = running

Recent Events:
Time                  Type        Description
2022-10-07T15:09:57Z  Started     Task started by client
2022-10-07T15:09:57Z  Task Setup  Building Task Directory
2022-10-07T15:09:45Z  Received    Task received by client

What is strange here:

Alloc 2 and 3 have identical alloc index/name (with same job version and no canaries used)
Alloc 1 killed the same second as it was started
Alloc 1 "Desired description" is "alloc not needed due to job update" but I'm sure that job definition wasn't updated
Alloc 2 has Dimension "network: port collision" exhausted on 1 nodes. Not sure if it's relevant here but I guess it might be related.

Issue #12797 might be relevant.

Expected Result

Alloc shouldn't be killed without reason
No allocs with same job version should share same alloc index

Nomad Server/Clients logs

I checked logs for both client and server and didn't found anything relevant.

The text was updated successfully, but these errors were encountered:

jrasell · 2022-10-10T07:23:20Z

Hi @valodzka and thanks for raising this issue.

Alloc 2 and 3 have identical alloc index/name

This certainly looks identical to #12797 as you pointed out, so cross linked from this issue is useful to provide additional context when that item of work gets roadmapped.

alloc not needed due to job update

This message is used during the allocation reconciliation phase which is triggered when the scheduler receives an updated job specification. It would therefore on the surface seem that something has triggered an update and I wonder if you could share more information about the deployment process you use for Nomad jobs?

I checked logs for both client and server and didn't found anything relevant.

If you have these available from both the leader and the client where the allocation was placed and immediately stopped, it would provide some useful additional context. There may be some non-obvious messages which might provide an idea of what is happening also.

valodzka · 2022-10-10T10:39:17Z

It would therefore on the surface seem that something has triggered an update and I wonder if you could share more information about the deployment process you use for Nomad jobs?

We use simple automated system that call flock <lock-file> nomad job run <job-file> for deployment. flock used to block concurrent deployments. So I'm pretty sure that there were not any updates. Also as you can see in full alloc status Job Version remaned the same (1654). And if something modified job version would be increased.

If you have these available from both the leader and the client where the allocation was placed and immediately stopped, it would provide some useful additional context.

Leader:

2022-10-07T15:09:44.828Z [INFO]  nomad: plan for node rejected, refer to https://www.nomadproject.io/s/port-plan-failure for more information: node_id=608e656f-e3b9-6ea0-ea99-c4dfc6bc02e6 reason="reserved alloc port collision: collision when reserving port for alloc 2ac62f30-a072-d797-8023-b6bf089f2ac0: [port 30261 already in use]" eval_id=6cdc78f2-8fa1-cc00-193f-45e923472f6a namespace=default

Client:

2022-10-07T15:09:55.478Z [INFO]  client.driver_mgr.docker: stopped container: container_id=9e6d72ce5f5eced40e72a8f8fd88ccd6b8d54007394901680591407fd8089ab8 driver=docker
2022-10-07T15:09:55.814Z [INFO]  agent: (runner) stopping
2022-10-07T15:09:55.814Z [INFO]  client.gc: marking allocation for GC: alloc_id=75f2d3c4-946e-3977-83d3-d074a12102b4
2022-10-07T15:09:55.816Z [INFO]  agent: (runner) received finish
2022-10-07T15:09:55.906Z [INFO]  client.gc: marking allocation for GC: alloc_id=d1421efd-6b25-295a-4832-348c09955a8f
2022-10-07T15:10:03.655Z [INFO]  client.driver_mgr.docker: stopped container: container_id=7ff101994a24366c00b9184b6555426efb33c56f0186c22152dda0fb9c754df1 driver=docker
2022-10-07T15:10:03.684Z [INFO]  agent: (runner) stopping
2022-10-07T15:10:03.684Z [INFO]  client.gc: marking allocation for GC: alloc_id=6256dfb2-a253-d644-70fe-45cfb4867171
2022-10-07T15:10:03.684Z [INFO]  agent: (runner) received finish
2022-10-07T15:10:03.697Z [INFO]  client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=5175e773-5813-7577-1031-25154e11b043 task=java @module=logmon path=/opt/nomad/data/alloc/5175e773-5813-7577-1031-25154e11>
2022-10-07T15:10:03.698Z [INFO]  client.alloc_runner.task_runner.task_hook.logmon.nomad: opening fifo: alloc_id=5175e773-5813-7577-1031-25154e11b043 task=java @module=logmon path=/opt/nomad/data/alloc/5175e773-5813-7577-1031-25154e11>
2022-10-07T15:10:03.711Z [INFO]  agent: (runner) creating new runner (dry: false, once: false)
2022-10-07T15:10:03.711Z [INFO]  agent: (runner) creating watcher
2022-10-07T15:10:03.712Z [INFO]  agent: (runner) starting
2022-10-07T15:10:03.893Z [INFO]  client.driver_mgr.docker: created container: driver=docker container_id=84127937ca67958655da48a14e8c8204d4ab843a22c33d46e6e04617732cb5f2
2022-10-07T15:10:04.015Z [INFO]  client.driver_mgr.docker: started container: driver=docker container_id=84127937ca67958655da48a14e8c8204d4ab843a22c33d46e6e04617732cb5f2

valodzka · 2022-10-11T10:03:50Z

The issue appeared again. Unfortunately allocations were garbaraga collected before I can inspect them, but in the leader log I see the same message (and it first time it appeared in log since 2022-10-07T15:09:44.828Z), so it might be related:

2022-10-10T18:39:31.503Z [INFO]  nomad: plan for node rejected, refer to https://www.nomadproject.io/s/port-plan-failure for more information: node_id=e725aa89-2bad-bf30-9b7a-3c704722ff40 reason="reserved alloc port collision: collision when reserving port for alloc 8a8c641a-e7b8-dc95-ba80-d3bb5be517f5: [port 26558 already in use]" eval_id=5055fde0-6985-da02-e7db-bb10ac57e819 namespace=default

valodzka · 2022-10-17T10:01:08Z

Hi @jrasell, I tried to create a reproducer to this issue with vagrant and turns out it's relatively easy to reproduce. Steps:

git clone -b issue_14850 git@github.com:valodzka/nomad-vagrant-lab.git
cd nomad-vagrant-lab
vagrant up
vagrant ssh nomad-a-1    
bash /vagrant/launch-a-1.sh
# repeat ssh/launch it a-2, a-3
bash -x reproduce_14850.sh
# wait few dozen iterations

nomad-vagrant-lab is a basic nomad/consul cluster template without any tweaks. reproduce_14850.sh runs deployments of simple job with 50 allocations, each of them using 11 dynamic ports. I tried twice and on my machine (old mac book pro) it took around 20-30 deployments to trigger the issue.

jrasell · 2022-10-17T10:06:18Z

Hi @valodzka and thanks for the additional details and context.

nomad: plan for node rejected, refer to https://www.nomadproject.io/s/port-plan-failure for more information: node_id=e725aa89-2bad-bf30-9b7a-3c704722ff40 reason="reserved alloc port collision: collision when reserving port for alloc 8a8c641a-e7b8-dc95-ba80-d3bb5be517f5: [port 26558 already in use]" eval_id=5055fde0-6985-da02-e7db-bb10ac57e819 namespace=default

This is unfortunately a known bug, which despite significant engineering effort, has yet to be resolved. The documentation link within the log line has more information on this and how workarounds for this.

create a reproducer to this issue with vagrant and turns out it's relatively easy to reproduce

This is very useful and thanks for taking the time to create this. It will prove very useful further down the line to engineers looking into this issue.

I am going to move this issue onto our backlog now from initial triage.

valodzka · 2022-10-17T10:31:46Z

This is unfortunately a known bug, which despite significant engineering effort, has yet to be resolved. The documentation link within the log line has more information on this and how workarounds for this.

@jrasell thank you for reply. I already read the documentation about this log message and if I understood correctly it is about a different bug. It states that "it is possible for these log lines to occur infrequently due to normal cluster conditions", and the problem is only if there are "repeated log lines". I only see this log message only once a few days. Is this considered repeated? Should workaround with a plan rejection tracker that remembers plans only for 5 min help in this case?

tgross · 2023-01-24T16:45:40Z

Hi @valodzka, sorry about the delayed response on this. I've picked this problem back up and your reproduction is going to be very helpful. In particular this line may be a vital clue for us, as one of the larger Nomad users who's hit this issue also has {"SchedulerAlgorithm":"spread"} set. I'll report back here once I know more.

I only see this log message only once a few days. Is this considered repeated? Should workaround with a plan rejection tracker that remembers plans only for 5 min help in this case?

The behavior we've seen from some users is that a particular node gets "stuck" in a state where the planner consistently rejects plans for a given job, until the client is restarted. If the retried allocation gets placed successfully on the next plan, you may not be hitting the specific issue that the plan rejection tracker is intended to fix.

tgross · 2023-03-09T17:23:06Z

This will be closed by #16401, which will ship in Nomad 1.5.1 (plus backports)

valodzka added the type/bug label Oct 7, 2022

jrasell self-assigned this Oct 10, 2022

jrasell added the stage/waiting-reply label Oct 10, 2022

jrasell added this to Needs Triage in Nomad - Community Issues Triage via automation Oct 10, 2022

jrasell moved this from Needs Triage to Triaging in Nomad - Community Issues Triage Oct 10, 2022

valodzka pushed a commit to valodzka/nomad-vagrant-lab that referenced this issue Oct 17, 2022

add reproducer to hashicorp/nomad#14850

749f4bf

valodzka pushed a commit to valodzka/nomad-vagrant-lab that referenced this issue Oct 17, 2022

add reproducer to hashicorp/nomad#14850

2e4a7ae

jrasell added theme/core theme/scheduling stage/accepted Confirmed, and intend to work on. No timeline committment though. and removed stage/waiting-reply labels Oct 17, 2022

jrasell moved this from Triaging to Needs Roadmapping in Nomad - Community Issues Triage Oct 17, 2022

jrasell removed their assignment Oct 17, 2022

tgross self-assigned this Jan 24, 2023

tgross mentioned this issue Mar 8, 2023

scheduling: prevent self-collision in dynamic port network offerings #16401

Merged

tgross closed this as completed Mar 9, 2023

Nomad - Community Issues Triage automation moved this from Needs Roadmapping to Done Mar 9, 2023

tgross added this to the 1.5.1 milestone Mar 9, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allocation stopped immediately after start #14850

Allocation stopped immediately after start #14850

valodzka commented Oct 7, 2022

jrasell commented Oct 10, 2022

valodzka commented Oct 10, 2022 •

edited

Loading

valodzka commented Oct 11, 2022

valodzka commented Oct 17, 2022

jrasell commented Oct 17, 2022

valodzka commented Oct 17, 2022

tgross commented Jan 24, 2023

tgross commented Mar 9, 2023

Allocation stopped immediately after start #14850

Allocation stopped immediately after start #14850

Comments

valodzka commented Oct 7, 2022

Nomad version

Issue

Expected Result

Nomad Server/Clients logs

jrasell commented Oct 10, 2022

valodzka commented Oct 10, 2022 • edited Loading

valodzka commented Oct 11, 2022

valodzka commented Oct 17, 2022

jrasell commented Oct 17, 2022

valodzka commented Oct 17, 2022

tgross commented Jan 24, 2023

tgross commented Mar 9, 2023

valodzka commented Oct 10, 2022 •

edited

Loading