NOMAD_ALLOC_INDEX is not always unique within a single service job version #10727

dpn · 2021-06-09T00:13:44Z

Nomad version

/ # nomad version
Nomad v0.12.10 (6b50c40dc5fc045282ff2a6f978ba7850e43d0d2)

Operating system and Environment details

CentOS Linux 7 (Core)
3.10.0-1160.24.1.el7.x86_64

Issue

According to #6830, NOMAD_ALLOC_INDEX is supposed to be unique across job versions. However, we have discovered a case in one of our clusters where a job where allocs appear to be on the same job version and have duplicate NOMAD_ALLOC_INDEXs:

/ # nomad alloc status a511e040 | grep -E "Job Version|^Name"
Name                = a-job.a-task[8]
Job Version         = 2
/ # nomad alloc status 86c0f7a1 | grep -E "Job Version|^Name"
Name                = a-job.a-task[8]
Job Version         = 2
/ #

This was discovered by our Prometheus metrics exporter complaining about attempting to ship duplicate metrics. Of interest, this job has a task count of 50, and only 4 out of those 50 are duped:

/ # nomad job status a-job | tail -n 50 | awk '{print $1}' | xargs -I {} nomad alloc status {} | grep ^Name | sort | uniq -d
Name                 = a-job.a-task[13]
Name                 = a-job.a-task[15]
Name                 = a-job.a-task[16]
Name                 = a-job.a-task[8]

Interestingly enough, we had the job owner redeploy in attempt to fix this situation and we actually see the exact same NOMAD_ALLOC_INDEXs duplicated:

/ # nomad alloc status 8711bd71 | grep -E "Job Version|^Name"
Name                = a-job.a-task[8]
Job Version         = 3
/ # nomad alloc status df217f1b | grep -E "Job Version|^Name"
Name                = a-job.a-task[8]
Job Version         = 3
/ #

/ # nomad job status a-job | grep running | tail -n 50 | awk '{print $1}' | xargs -I {} nomad alloc status {} | grep -E "Job Version|^Name" | sort | uniq -d
Job Version         = 3
Name                = a-job.a-task[13]
Name                = a-job.a-task[15]
Name                = a-job.a-task[16]
Name                = a-job.a-task[8]

Reproduction steps

Man, I wish I could tell you. We just upgraded the cluster from 0.11.4 to 0.12.10 earlier today so it's probably related to that... We upgraded the server cluster, then rolling upgraded the clients. ~~My guess is that the deployment prior to the cluster upgrade failed, and that the rolling client restart caused alloc migrations which started up the replacement allocs on the "new" job version, but~~ so far I haven't had the bandwidth to attempt a reproduction.

Edit: Scratch that. Deployment history seems fine:

/ # nomad job history a-job
Version     = 2
Stable      = true
Submit Date = 2021-06-08T21:57:04Z

Version     = 1
Stable      = true
Submit Date = 2021-06-08T21:32:10Z

Version     = 0
Stable      = true
Submit Date = 2021-03-23T22:46:13Z
/ #

So we put a hack into our exporter to stop the bleeding for the night. But this also means I can grab logs or anything interesting that will help on your end! Let me know what you need and I'll try to get it!

Expected Result

NOMAD_ALLOC_INDEX should be unique within a job version

Actual Result

NOMAD_ALLOC_INDEX is not unique within a job version

Job file (if appropriate)

Nomad Server logs (if appropriate)

Nothing appears relevant here

Nomad Client logs (if appropriate)

I didn't see anything relevant in here, but let me know if you'd like me to collect them.

The text was updated successfully, but these errors were encountered:

tgross · 2021-06-09T12:33:46Z

Hi @dpn! The relevant text here from #6830 (emphasis added):

The index is unique within a given version of a job, but canaries or failed tasks in a deployment may reuse the index

It's not clear to me from what you've provided what the state history is for the allocations where you're seeing the reused indexes. Especially given you rolled the clients, it's entirely possible you have allocations that got marked lost and were rescheduled, deployments kicked off, etc. Can you provide the jobspec (especially the update and reschedule blocks, but otherwise redacted if necessary), and as much as you can about the job status history?

So we put a hack into our exporter to stop the bleeding for the night.

I'm glad you figured out a hackaround, but you really shouldn't be relying on the uniqueness of the allocation index. The allocation ID is a UUID and is the canonical way to refer to an allocation.

tgross · 2021-11-08T21:39:50Z

Doing some issue cleanup and it looks like we never heard back on this one. Going to close it out.

dpn · 2022-01-11T19:34:45Z

Sorry @tgross for some of these "fire and forget" reports. We're a little understaffed and I just haven't had the bandwidth to follow up on these :(

dpn · 2022-01-25T00:23:14Z

Welp, our monitoring team escalated another reproduction of this issue to me so looks like this issue is back on the menu!

I'm glad you figured out a hackaround, but you really shouldn't be relying on the uniqueness of the allocation index. The allocation ID is a UUID and is the canonical way to refer to an allocation.

Yep, totally understood and this is what I've been telling our customers. Unfortunately it's a heavy lift for them to reconfigure every job which is relying the documented behavior1 so they're looking for an upstream fix.

To kick things off, we're on a newer build than was originally reported, although the rest of the conditions still hold:

/ # nomad version
Nomad v1.0.6 (592cd4565bf726408a03482da2c9fd8a3a1015cf)

Status for the suspect job:

ID            = REDACTED
Name          = REDACTED
Submit Date   = 2022-01-24T12:41:21-08:00
Type          = service
Priority      = 50
Datacenters   = REDACTED
Namespace     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group             Queued  Starting  Running  Failed  Complete  Lost
REDACTED               0       0         3        300     1405      0

Latest Deployment
ID          = b04e3235
Status      = successful
Description = Deployment completed successfully

Deployed
Task Group             Auto Revert  Desired  Placed  Healthy  Unhealthy
REDACTED               true         3        3       3        0

Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created     Modified
c528e0cc  68b33ec7  REDACTED    467      run      running   32m29s ago  31m43s ago
0b1a9d76  fddad6ef  REDACTED    467      run      running   34m42s ago  32m31s ago
f127f273  e7d6f817  REDACTED    467      run      running   35m41s ago  34m44s ago
fa796022  e7d6f817  REDACTED    466      stop     complete  2d18h ago   35m39s ago
be0a849d  68b33ec7  REDACTED    466      stop     complete  2d19h ago   32m27s ago
49b88729  544ff56d  REDACTED    466      stop     complete  2d19h ago   34m41s ago

So, we have 3 allocations, no deployments in progress, the previous deployment went out successfully, no failed tasks for this or the previous job version in the history and the canaries are not enabled.

However, when we inspect those 3 running allocations we see an allocation index of 0 is being reused which seems to conflict with the docs:

alloc-c528e0cc.txt:Name                = REDACTED[1]
alloc-c528e0cc.txt:Job Version         = 467
alloc-0b1a9d76.txt:Name                = REDACTED[0]
alloc-0b1a9d76.txt:Job Version         = 467
alloc-f127f273.txt:Name                = REDACTED[0]
alloc-f127f273.txt:Job Version         = 467

One other interesting tidbit we've discovered is that the allocations from the previous version of the job are displaying the same behavior:

alloc-fa796022.txt:Name                 = REDACTED[0]
alloc-fa796022.txt:Job Version          = 466
alloc-be0a849d.txt:Name                 = REDACTED[1]
alloc-be0a849d.txt:Job Version          = 466
alloc-49b88729.txt:Name                 = REDACTED[0]
alloc-49b88729.txt:Job Version          = 466

The rest of the relevant information can be found in the following files:

alloc-0b1a9d76.txt
alloc-49b88729.txt
alloc-be0a849d.txt
alloc-c528e0cc.txt
alloc-f127f273.txt
alloc-fa796022.txt
job-history.txt
job-inspect.txt
job-status.txt

Hopefully this makes sense- please let me know if there's anything else I can gather. Thanks!

tgross · 2022-01-26T18:37:23Z

Ok, reopening this as that's definitely not the correct behavior assuming no failures or canaries. Note that you are using a fairly old version of Nomad, one that's won't be getting backported bug and security fixes after 1.3.0 goes out in a couple months, so even if we figure out what's wrong here you may end up needing to upgrade to a newer version to get the fix (or backport the patch yourself).

The rest of the relevant information can be found in the following files:

Interesting. The output of nomad deployment status -verbose b04e3235 might be useful to see a more detailed history of that deployment.

dpn · 2022-01-26T23:12:53Z

Perfect, thank you so much! Yep we're aware of the old version- we're a few versions behind but we are upgrading at a quarterly cadence-- we have no expectations on you guys to provide a backport if a fix should becomes available.

That deployment has been GC'd at this point, but I've had the team kick another and attached dumps from this one:

alloc-5ea975aa.txt
alloc-40fc3de8.txt
alloc-6759ed8e.txt
alloc-a028d78e.txt
alloc-d17e6f0a.txt
alloc-d593259d.txt
deployment-9e522f99.txt
job-history.txt
job-status.txt

tgross · 2022-01-27T19:08:56Z

Judging by the version numbers on that job-status.txt, it looks like this is all the same job? Does it happen with every deployment? If so, would it be possible to share the jobspec (even a redacted one might help). My first hypothesis would be that there's a specific update and reschedule combination that can initially cause the issue and then it's "stuck" like that on subsequent updates. If we had a jobspec that might help us reproduce under test.

dpn · 2022-01-27T21:26:44Z

Judging by the version numbers on that job-status.txt, it looks like this is all the same job? Does it happen with every deployment?

Yep, nailed it. On the few jobs we've seen this on it seems that once the job is in this state it does continue for some time- I'm not sure if they've ever become "fixed", but I can dig a bit and maybe shed some light on that.

My first hypothesis would be that there's a specific update and reschedule combination that can initially cause the issue and then it's "stuck" like that on subsequent updates.

Seems plausible from here. We generate our specs via some internal tooling- hopefully the json rendering works for ya:

job-spec.json.txt

Also the job inspect is attached to my earlier post if that helps fill in any gaps from this spec.

dpn · 2022-01-27T21:42:18Z

...I'm not sure if they've ever become "fixed", but I can dig a bit and maybe shed some light on that.

Interestingly enough, the job from the original report is still in this state- so it's managed to survive a version upgrade:

# This is showing all the allocations with the same names: duplicated ALLOC_INDEXs
» nomad job status REDACTED | grep running | grep REDACTED | awk '{print $1}' | xargs -I {} nomad alloc status {} | grep ^Name | sort | uniq -d
Name                = REDACTED[13]
Name                = REDACTED[15]
Name                = REDACTED[16]
Name                = REDACTED[8]

This is a completely different job from the one we've been looking at over the past few days. It's running on a different Nomad cluster, although both of these clusters in question are federated together. Would you be interested in the same dumps for this one?

Edit: And thanks for all of your help on this-- really appreciate your time.

tgross · 2022-02-02T22:16:59Z

So interestingly @DerekStrickland and I were chatting about this bit of code for a project he's working on, and he noticed that the code that stops allocations prefers to pick ones with the highest "name index" but specifically has code to deal with the notion that we could end up with duplicate names here reconcile.go#L844-L876. I'll see if I can work up a test case that "leaks" a name this way.

On the few jobs we've seen this on it seems that once the job is in this state it does continue for some time- I'm not sure if they've ever become "fixed", but I can dig a bit and maybe shed some light on that.

One hypothesis I have is that if the task group count drops below the name index, it'll correct itself. But obviously that's not a good workaround in general and it's totally useless in this case because the problematic name index is [0]!

This is a completely different job from the one we've been looking at over the past few days. It's running on a different Nomad cluster, although both of these clusters in question are federated together. Would you be interested in the same dumps for this one?

I think we're good so far with what you've provided, thanks!

dpn · 2022-02-03T21:56:56Z

Ahh wonderful news! Thanks again for digging in we really appreciate it 🙏

valodzka · 2022-04-27T17:37:55Z

One hypothesis I have is that if the task group count drops below the name index, it'll correct itself.

I can confirm that this works. But it's very inconvenient workaround.

tgross · 2022-06-06T20:48:57Z

I wanted to follow up on this. I think trying to whack-a-mole the problem isn't going to get us the results we want when the design itself isn't really equipped to solve this problem definitively. The only way to strictly enforce this would be to reject plans that have overlapping alloc index at the plan apply step. This would increase the rate of plan rejections, but could also make a few other features like per-alloc volumes for CSI operate more nicely.

I'm going to mark this for further discussion and roadmapping.

…mad#698, hashicorp/nomad#10727, hashicorp/nomad#10039, hashicorp/nomad#1635, hashicorp/nomad#8368 + basic readme

jrasell · 2023-10-17T15:09:49Z

Hello everyone, I just wanted to provide an update to this issue as I have been spending a good amount of time on it recently and have made progress.

Our first approach to reject job plans within the plan applier (linked PR) seemed fine from a code standpoint, but when we discussed it further internally we realised we could create un-schedulable jobs. This is an approach we could therefore not take, and this extends further to producing errors during the reconciliation or scheduling on duplicate allocation indexes. It is therefore required we solve this bug, rather than reject any occurrences we find.

I have been working on understanding the code which has allowed me to create a reproduction of this error. I can now see how we get into this error state and hope to be able to locate the source of this bug soon.

valodzka · 2023-10-17T17:57:32Z

Nice to hear. After this fix #16401 I see this issue much rarer but still see in some cases.

dpn added the type/bug label Jun 9, 2021

tgross added stage/waiting-reply theme/deployments labels Jun 9, 2021

tgross added this to Needs Triage in Nomad - Community Issues Triage via automation Jun 9, 2021

tgross moved this from Needs Triage to In Progress in Nomad - Community Issues Triage Jun 9, 2021

tgross closed this as completed Nov 8, 2021

tgross removed this from In Progress in Nomad - Community Issues Triage Nov 8, 2021

tgross reopened this Jan 26, 2022

tgross added this to Needs Triage in Nomad - Community Issues Triage via automation Jan 26, 2022

tgross self-assigned this Jan 26, 2022

tgross moved this from Needs Triage to In Progress in Nomad - Community Issues Triage Jan 26, 2022

tgross removed the stage/waiting-reply label Feb 2, 2022

tgross mentioned this issue Apr 27, 2022

reconciler: Stopping allocs by name is error prone #12797

Open

lgfa29 mentioned this issue Apr 27, 2022

Not uniq NOMAD_ALLOC_INDEX with same job version #12768

Closed

tgross moved this from In Progress to Needs Roadmapping in Nomad - Community Issues Triage Jun 6, 2022

tgross added the stage/needs-discussion label Jun 6, 2022

tgross removed their assignment Jun 6, 2022

valodzka pushed a commit to valodzka/nomad-workarounds that referenced this issue Jan 9, 2023

add workaround around nomad issues hashicorp/nomad#3093, hashicorp/no…

1f42f71

…mad#698, hashicorp/nomad#10727, hashicorp/nomad#10039, hashicorp/nomad#1635, hashicorp/nomad#8368 + basic readme

tgross self-assigned this Feb 2, 2023

tgross added stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/scheduling and removed stage/needs-discussion labels Feb 2, 2023

tgross changed the title ~~NOMAD_ALLOC_INDEX is not always unique within a single job version~~ NOMAD_ALLOC_INDEX is not always unique within a single service job version Feb 8, 2023

mikenomitch added the hcc/cst Admin - internal label May 22, 2023

mikenomitch unassigned tgross Jun 27, 2023

jrasell self-assigned this Aug 3, 2023

jrasell mentioned this issue Aug 3, 2023

test: add tests for allocNameIndex core funcs #18136

Merged

jrasell mentioned this issue Sep 1, 2023

core: reject plans which include or cause duplicate alloc indexes. #18376

Closed

jrasell mentioned this issue Oct 26, 2023

scheduler: ensure dup alloc names are fixed before plan submit. #18873

Merged

jrasell added a commit that referenced this issue Oct 27, 2023

Merge branch 'main' into gh-10727-sched-fix-method

1240738

jrasell closed this as completed in #18873 Oct 27, 2023

Nomad - Community Issues Triage automation moved this from Needs Roadmapping to Done Oct 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

NOMAD_ALLOC_INDEX is not always unique within a single service job version #10727

NOMAD_ALLOC_INDEX is not always unique within a single service job version #10727

dpn commented Jun 9, 2021 •

edited

Loading

tgross commented Jun 9, 2021

tgross commented Nov 8, 2021

dpn commented Jan 11, 2022

dpn commented Jan 25, 2022

tgross commented Jan 26, 2022

dpn commented Jan 26, 2022

tgross commented Jan 27, 2022

dpn commented Jan 27, 2022 •

edited

Loading

dpn commented Jan 27, 2022 •

edited

Loading

tgross commented Feb 2, 2022

dpn commented Feb 3, 2022

valodzka commented Apr 27, 2022 •

edited

Loading

tgross commented Jun 6, 2022

jrasell commented Oct 17, 2023

valodzka commented Oct 17, 2023

NOMAD_ALLOC_INDEX is not always unique within a single service job version #10727

NOMAD_ALLOC_INDEX is not always unique within a single service job version #10727

Comments

dpn commented Jun 9, 2021 • edited Loading

Nomad version

Operating system and Environment details

Issue

Reproduction steps

Expected Result

Actual Result

Job file (if appropriate)

Nomad Server logs (if appropriate)

Nomad Client logs (if appropriate)

tgross commented Jun 9, 2021

tgross commented Nov 8, 2021

dpn commented Jan 11, 2022

dpn commented Jan 25, 2022

tgross commented Jan 26, 2022

dpn commented Jan 26, 2022

tgross commented Jan 27, 2022

dpn commented Jan 27, 2022 • edited Loading

dpn commented Jan 27, 2022 • edited Loading

tgross commented Feb 2, 2022

dpn commented Feb 3, 2022

valodzka commented Apr 27, 2022 • edited Loading

tgross commented Jun 6, 2022

jrasell commented Oct 17, 2023

valodzka commented Oct 17, 2023

dpn commented Jun 9, 2021 •

edited

Loading

dpn commented Jan 27, 2022 •

edited

Loading

dpn commented Jan 27, 2022 •

edited

Loading

valodzka commented Apr 27, 2022 •

edited

Loading