Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changing DeviceIDs after job updated in allocation #7696

Closed
xsikor opened this issue Apr 12, 2020 · 2 comments · Fixed by #7762
Closed

Changing DeviceIDs after job updated in allocation #7696

xsikor opened this issue Apr 12, 2020 · 2 comments · Fixed by #7762

Comments

@xsikor
Copy link

xsikor commented Apr 12, 2020

Nomad version

Nomad v0.10.4 and older

Operating system and Environment details

5.4.30-1-MANJARO
4.15.0-91-generic #92-Ubuntu
Any

Issue

DeviceIDs will be change after update job rule when preview allocation is running without changes in real.
Used nvidia device and qemu driver
In this case scheduler working with an error beacuse try to create new allocation to already used DeviceID

Reproduction steps

Create new job with Resources.Devices != nil
Update this job any fills, but need to save current allocation
DeviceID will be changed by GenericStack

Nomad server logs

//log debug info from https://github.com/hashicorp/nomad/blob/master/scheduler/rank.go#L189
task request resources:
&{Name:nvidia/gpu/p102 Count:1 Constraints:[] Affinities:[]}
create offer for &{Name:nvidia/gpu/p102 Count:1 Constraints:[] Affinities:[]}
device offer &{Vendor:nvidia Type:gpu Name:p102 DeviceIDs:[0c:00.0]}
addReserved reqID {Vendor:nvidia Type:gpu Name:p102} devInst &{Device:0xc000820af0 Instances:map[04:00.0:0 05:00.0:0 06:00.0:1 07:00.0:1 08:00.0:1 09:00.0:1 0a:00.0:1 0b:00.0:0 0c:00.0:0 0d:00.0:0]} deviceIds [0c:00.0]
//nomad logger
[DEBUG] worker: submitted plan for evaluation: eval_id=1d118d67-4f98-fcad-a425-f8cc0288b2dc
[DEBUG] worker.service_sched: setting eval status: eval_id=1d118d67-4f98-fcad-a425-f8cc0288b2dc job_id=8.74.8.71e3d7dc-b3ff-e526-3ecf-758b745de200 namespace=default status=complete

/////////Many others


//log debug info from https://github.com/hashicorp/nomad/blob/master/scheduler/rank.go#L189
task request resources:
&{Name:nvidia/gpu/p102 Count:1 Constraints:[] Affinities:[]}
create offer for &{Name:nvidia/gpu/p102 Count:1 Constraints:[] Affinities:[]}
device offer &{Vendor:nvidia Type:gpu Name:p102 DeviceIDs:[0b:00.0]}
addReserved reqID {Vendor:nvidia Type:gpu Name:p102} devInst &{Device:0xc000820af0 Instances:map[04:00.0:0 05:00.0:0 06:00.0:1 07:00.0:1 08:00.0:1 09:00.0:1 0a:00.0:1 0b:00.0:0 0c:00.0:0 0d:00.0:0]} deviceIds [0b:00.0]
[DEBUG] worker.service_sched: reconciled current state with desired state: eval_id=66725242-1f5b-1286-ed28-64e709a011ef job_id=8.74.8.71e3d7dc-b3ff-e526-3ecf-758b745de200 namespace=default results="Total changes: (place 0) (destructive 0) (inplace 1) (stop 0)
//nomad logger
Created Deployment: "5b23ed79-804a-9a1f-e017-4253bc36d2da"
Deployment Update for ID "10ca81e7-eb09-b443-6efc-75b3f240ca3a": Status "cancelled"; Description "Cancelled due to newer version of job"

Fix

Add restore resources from existing allocation like network in this case
https://github.com/hashicorp/nomad/blob/master/scheduler/util.go#L899

@notnoop notnoop added this to Needs Triage in Nomad - Community Issues Triage via automation Apr 13, 2020
@notnoop notnoop moved this from Needs Triage to Triaged in Nomad - Community Issues Triage Apr 13, 2020
@xsikor
Copy link
Author

xsikor commented Apr 17, 2020

Pushed fix here #7697

@github-actions
Copy link

github-actions bot commented Nov 8, 2022

I'm going to lock this issue because it has been closed for 120 days ⏳. This helps our maintainers find and focus on the active issues.
If you have found a problem that seems similar to this, please open a new issue and complete the issue template so we can capture all the details necessary to investigate further.

@github-actions github-actions bot locked as resolved and limited conversation to collaborators Nov 8, 2022
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants