Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

NOMAD_ALLOC_INDEX is not always unique within a single service job version #10727

Closed
dpn opened this issue Jun 9, 2021 · 15 comments · Fixed by #18873
Closed

NOMAD_ALLOC_INDEX is not always unique within a single service job version #10727

dpn opened this issue Jun 9, 2021 · 15 comments · Fixed by #18873
Assignees
Labels
hcc/cst Admin - internal stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/deployments theme/scheduling type/bug

Comments

@dpn
Copy link

dpn commented Jun 9, 2021

Nomad version

/ # nomad version
Nomad v0.12.10 (6b50c40dc5fc045282ff2a6f978ba7850e43d0d2)

Operating system and Environment details

CentOS Linux 7 (Core)
3.10.0-1160.24.1.el7.x86_64

Issue

According to #6830, NOMAD_ALLOC_INDEX is supposed to be unique across job versions. However, we have discovered a case in one of our clusters where a job where allocs appear to be on the same job version and have duplicate NOMAD_ALLOC_INDEXs:

/ # nomad alloc status a511e040 | grep -E "Job Version|^Name"
Name                = a-job.a-task[8]
Job Version         = 2
/ # nomad alloc status 86c0f7a1 | grep -E "Job Version|^Name"
Name                = a-job.a-task[8]
Job Version         = 2
/ #

This was discovered by our Prometheus metrics exporter complaining about attempting to ship duplicate metrics. Of interest, this job has a task count of 50, and only 4 out of those 50 are duped:

/ # nomad job status a-job | tail -n 50 | awk '{print $1}' | xargs -I {} nomad alloc status {} | grep ^Name | sort | uniq -d
Name                 = a-job.a-task[13]
Name                 = a-job.a-task[15]
Name                 = a-job.a-task[16]
Name                 = a-job.a-task[8]

Interestingly enough, we had the job owner redeploy in attempt to fix this situation and we actually see the exact same NOMAD_ALLOC_INDEXs duplicated:

/ # nomad alloc status 8711bd71 | grep -E "Job Version|^Name"
Name                = a-job.a-task[8]
Job Version         = 3
/ # nomad alloc status df217f1b | grep -E "Job Version|^Name"
Name                = a-job.a-task[8]
Job Version         = 3
/ #

/ # nomad job status a-job | grep running | tail -n 50 | awk '{print $1}' | xargs -I {} nomad alloc status {} | grep -E "Job Version|^Name" | sort | uniq -d
Job Version         = 3
Name                = a-job.a-task[13]
Name                = a-job.a-task[15]
Name                = a-job.a-task[16]
Name                = a-job.a-task[8]

Reproduction steps

Man, I wish I could tell you. We just upgraded the cluster from 0.11.4 to 0.12.10 earlier today so it's probably related to that... We upgraded the server cluster, then rolling upgraded the clients. My guess is that the deployment prior to the cluster upgrade failed, and that the rolling client restart caused alloc migrations which started up the replacement allocs on the "new" job version, but so far I haven't had the bandwidth to attempt a reproduction.

Edit: Scratch that. Deployment history seems fine:

/ # nomad job history a-job
Version     = 2
Stable      = true
Submit Date = 2021-06-08T21:57:04Z

Version     = 1
Stable      = true
Submit Date = 2021-06-08T21:32:10Z

Version     = 0
Stable      = true
Submit Date = 2021-03-23T22:46:13Z
/ #

So we put a hack into our exporter to stop the bleeding for the night. But this also means I can grab logs or anything interesting that will help on your end! Let me know what you need and I'll try to get it!

Expected Result

NOMAD_ALLOC_INDEX should be unique within a job version

Actual Result

NOMAD_ALLOC_INDEX is not unique within a job version

Job file (if appropriate)

Nomad Server logs (if appropriate)

Nothing appears relevant here

Nomad Client logs (if appropriate)

I didn't see anything relevant in here, but let me know if you'd like me to collect them.

@dpn dpn added the type/bug label Jun 9, 2021
@tgross
Copy link
Member

tgross commented Jun 9, 2021

Hi @dpn! The relevant text here from #6830 (emphasis added):

The index is unique within a given version of a job, but canaries or failed tasks in a deployment may reuse the index

It's not clear to me from what you've provided what the state history is for the allocations where you're seeing the reused indexes. Especially given you rolled the clients, it's entirely possible you have allocations that got marked lost and were rescheduled, deployments kicked off, etc. Can you provide the jobspec (especially the update and reschedule blocks, but otherwise redacted if necessary), and as much as you can about the job status history?

So we put a hack into our exporter to stop the bleeding for the night.

I'm glad you figured out a hackaround, but you really shouldn't be relying on the uniqueness of the allocation index. The allocation ID is a UUID and is the canonical way to refer to an allocation.

@tgross tgross added this to Needs Triage in Nomad - Community Issues Triage via automation Jun 9, 2021
@tgross tgross moved this from Needs Triage to In Progress in Nomad - Community Issues Triage Jun 9, 2021
@tgross
Copy link
Member

tgross commented Nov 8, 2021

Doing some issue cleanup and it looks like we never heard back on this one. Going to close it out.

@tgross tgross closed this as completed Nov 8, 2021
@tgross tgross removed this from In Progress in Nomad - Community Issues Triage Nov 8, 2021
@dpn
Copy link
Author

dpn commented Jan 11, 2022

Sorry @tgross for some of these "fire and forget" reports. We're a little understaffed and I just haven't had the bandwidth to follow up on these :(

@dpn
Copy link
Author

dpn commented Jan 25, 2022

Welp, our monitoring team escalated another reproduction of this issue to me so looks like this issue is back on the menu!

I'm glad you figured out a hackaround, but you really shouldn't be relying on the uniqueness of the allocation index. The allocation ID is a UUID and is the canonical way to refer to an allocation.

Yep, totally understood and this is what I've been telling our customers. Unfortunately it's a heavy lift for them to reconfigure every job which is relying the documented behavior1 so they're looking for an upstream fix.

To kick things off, we're on a newer build than was originally reported, although the rest of the conditions still hold:

/ # nomad version
Nomad v1.0.6 (592cd4565bf726408a03482da2c9fd8a3a1015cf)

Status for the suspect job:

ID            = REDACTED
Name          = REDACTED
Submit Date   = 2022-01-24T12:41:21-08:00
Type          = service
Priority      = 50
Datacenters   = REDACTED
Namespace     = default
Status        = running
Periodic      = false
Parameterized = false

Summary
Task Group             Queued  Starting  Running  Failed  Complete  Lost
REDACTED               0       0         3        300     1405      0

Latest Deployment
ID          = b04e3235
Status      = successful
Description = Deployment completed successfully

Deployed
Task Group             Auto Revert  Desired  Placed  Healthy  Unhealthy
REDACTED               true         3        3       3        0

Allocations
ID        Node ID   Task Group  Version  Desired  Status    Created     Modified
c528e0cc  68b33ec7  REDACTED    467      run      running   32m29s ago  31m43s ago
0b1a9d76  fddad6ef  REDACTED    467      run      running   34m42s ago  32m31s ago
f127f273  e7d6f817  REDACTED    467      run      running   35m41s ago  34m44s ago
fa796022  e7d6f817  REDACTED    466      stop     complete  2d18h ago   35m39s ago
be0a849d  68b33ec7  REDACTED    466      stop     complete  2d19h ago   32m27s ago
49b88729  544ff56d  REDACTED    466      stop     complete  2d19h ago   34m41s ago

So, we have 3 allocations, no deployments in progress, the previous deployment went out successfully, no failed tasks for this or the previous job version in the history and the canaries are not enabled.

However, when we inspect those 3 running allocations we see an allocation index of 0 is being reused which seems to conflict with the docs:

alloc-c528e0cc.txt:Name                = REDACTED[1]
alloc-c528e0cc.txt:Job Version         = 467
alloc-0b1a9d76.txt:Name                = REDACTED[0]
alloc-0b1a9d76.txt:Job Version         = 467
alloc-f127f273.txt:Name                = REDACTED[0]
alloc-f127f273.txt:Job Version         = 467

One other interesting tidbit we've discovered is that the allocations from the previous version of the job are displaying the same behavior:

alloc-fa796022.txt:Name                 = REDACTED[0]
alloc-fa796022.txt:Job Version          = 466
alloc-be0a849d.txt:Name                 = REDACTED[1]
alloc-be0a849d.txt:Job Version          = 466
alloc-49b88729.txt:Name                 = REDACTED[0]
alloc-49b88729.txt:Job Version          = 466

The rest of the relevant information can be found in the following files:

alloc-0b1a9d76.txt
alloc-49b88729.txt
alloc-be0a849d.txt
alloc-c528e0cc.txt
alloc-f127f273.txt
alloc-fa796022.txt
job-history.txt
job-inspect.txt
job-status.txt

Hopefully this makes sense- please let me know if there's anything else I can gather. Thanks!

@tgross
Copy link
Member

tgross commented Jan 26, 2022

Ok, reopening this as that's definitely not the correct behavior assuming no failures or canaries. Note that you are using a fairly old version of Nomad, one that's won't be getting backported bug and security fixes after 1.3.0 goes out in a couple months, so even if we figure out what's wrong here you may end up needing to upgrade to a newer version to get the fix (or backport the patch yourself).

The rest of the relevant information can be found in the following files:

Interesting. The output of nomad deployment status -verbose b04e3235 might be useful to see a more detailed history of that deployment.

@tgross tgross reopened this Jan 26, 2022
@tgross tgross added this to Needs Triage in Nomad - Community Issues Triage via automation Jan 26, 2022
@tgross tgross self-assigned this Jan 26, 2022
@tgross tgross moved this from Needs Triage to In Progress in Nomad - Community Issues Triage Jan 26, 2022
@dpn
Copy link
Author

dpn commented Jan 26, 2022

Perfect, thank you so much! Yep we're aware of the old version- we're a few versions behind but we are upgrading at a quarterly cadence-- we have no expectations on you guys to provide a backport if a fix should becomes available.

That deployment has been GC'd at this point, but I've had the team kick another and attached dumps from this one:

alloc-5ea975aa.txt
alloc-40fc3de8.txt
alloc-6759ed8e.txt
alloc-a028d78e.txt
alloc-d17e6f0a.txt
alloc-d593259d.txt
deployment-9e522f99.txt
job-history.txt
job-status.txt

@tgross
Copy link
Member

tgross commented Jan 27, 2022

Judging by the version numbers on that job-status.txt, it looks like this is all the same job? Does it happen with every deployment? If so, would it be possible to share the jobspec (even a redacted one might help). My first hypothesis would be that there's a specific update and reschedule combination that can initially cause the issue and then it's "stuck" like that on subsequent updates. If we had a jobspec that might help us reproduce under test.

@dpn
Copy link
Author

dpn commented Jan 27, 2022

Judging by the version numbers on that job-status.txt, it looks like this is all the same job? Does it happen with every deployment?

Yep, nailed it. On the few jobs we've seen this on it seems that once the job is in this state it does continue for some time- I'm not sure if they've ever become "fixed", but I can dig a bit and maybe shed some light on that.

My first hypothesis would be that there's a specific update and reschedule combination that can initially cause the issue and then it's "stuck" like that on subsequent updates.

Seems plausible from here. We generate our specs via some internal tooling- hopefully the json rendering works for ya:

job-spec.json.txt

Also the job inspect is attached to my earlier post if that helps fill in any gaps from this spec.

@dpn
Copy link
Author

dpn commented Jan 27, 2022

...I'm not sure if they've ever become "fixed", but I can dig a bit and maybe shed some light on that.

Interestingly enough, the job from the original report is still in this state- so it's managed to survive a version upgrade:

# This is showing all the allocations with the same names: duplicated ALLOC_INDEXs
» nomad job status REDACTED | grep running | grep REDACTED | awk '{print $1}' | xargs -I {} nomad alloc status {} | grep ^Name | sort | uniq -d
Name                = REDACTED[13]
Name                = REDACTED[15]
Name                = REDACTED[16]
Name                = REDACTED[8]

This is a completely different job from the one we've been looking at over the past few days. It's running on a different Nomad cluster, although both of these clusters in question are federated together. Would you be interested in the same dumps for this one?

Edit: And thanks for all of your help on this-- really appreciate your time.

@tgross
Copy link
Member

tgross commented Feb 2, 2022

So interestingly @DerekStrickland and I were chatting about this bit of code for a project he's working on, and he noticed that the code that stops allocations prefers to pick ones with the highest "name index" but specifically has code to deal with the notion that we could end up with duplicate names here reconcile.go#L844-L876. I'll see if I can work up a test case that "leaks" a name this way.

On the few jobs we've seen this on it seems that once the job is in this state it does continue for some time- I'm not sure if they've ever become "fixed", but I can dig a bit and maybe shed some light on that.

One hypothesis I have is that if the task group count drops below the name index, it'll correct itself. But obviously that's not a good workaround in general and it's totally useless in this case because the problematic name index is [0]!

This is a completely different job from the one we've been looking at over the past few days. It's running on a different Nomad cluster, although both of these clusters in question are federated together. Would you be interested in the same dumps for this one?

I think we're good so far with what you've provided, thanks!

@dpn
Copy link
Author

dpn commented Feb 3, 2022

Ahh wonderful news! Thanks again for digging in we really appreciate it 🙏

@valodzka
Copy link
Contributor

valodzka commented Apr 27, 2022

One hypothesis I have is that if the task group count drops below the name index, it'll correct itself.

I can confirm that this works. But it's very inconvenient workaround.

@tgross tgross moved this from In Progress to Needs Roadmapping in Nomad - Community Issues Triage Jun 6, 2022
@tgross tgross removed their assignment Jun 6, 2022
@tgross
Copy link
Member

tgross commented Jun 6, 2022

I wanted to follow up on this. I think trying to whack-a-mole the problem isn't going to get us the results we want when the design itself isn't really equipped to solve this problem definitively. The only way to strictly enforce this would be to reject plans that have overlapping alloc index at the plan apply step. This would increase the rate of plan rejections, but could also make a few other features like per-alloc volumes for CSI operate more nicely.

I'm going to mark this for further discussion and roadmapping.

@tgross tgross self-assigned this Feb 2, 2023
@tgross tgross added stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/scheduling and removed stage/needs-discussion labels Feb 2, 2023
@tgross tgross changed the title NOMAD_ALLOC_INDEX is not always unique within a single job version NOMAD_ALLOC_INDEX is not always unique within a single service job version Feb 8, 2023
@mikenomitch mikenomitch added the hcc/cst Admin - internal label May 22, 2023
@jrasell jrasell self-assigned this Aug 3, 2023
@jrasell
Copy link
Member

jrasell commented Oct 17, 2023

Hello everyone, I just wanted to provide an update to this issue as I have been spending a good amount of time on it recently and have made progress.

Our first approach to reject job plans within the plan applier (linked PR) seemed fine from a code standpoint, but when we discussed it further internally we realised we could create un-schedulable jobs. This is an approach we could therefore not take, and this extends further to producing errors during the reconciliation or scheduling on duplicate allocation indexes. It is therefore required we solve this bug, rather than reject any occurrences we find.

I have been working on understanding the code which has allowed me to create a reproduction of this error. I can now see how we get into this error state and hope to be able to locate the source of this bug soon.

@valodzka
Copy link
Contributor

Nice to hear. After this fix #16401 I see this issue much rarer but still see in some cases.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hcc/cst Admin - internal stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/deployments theme/scheduling type/bug
Projects
5 participants