Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add gpu uuid to the fingerprinted attributes #11

Closed
shumin1027 opened this issue Nov 24, 2022 · 10 comments · Fixed by hashicorp/nomad#15455
Closed

add gpu uuid to the fingerprinted attributes #11

shumin1027 opened this issue Nov 24, 2022 · 10 comments · Fixed by hashicorp/nomad#15455
Assignees
Labels
documentation Improvements or additions to documentation

Comments

@shumin1027
Copy link

I want to use gpu uuid when configuring affinity or constraint,but there is no such attribute in fingerprinted attributes,how should it be achieved?

Fingerprinted Attributes

@lgfa29
Copy link
Contributor

lgfa29 commented Dec 1, 2022

Hi @shumin1027 👋

I believe the UUID is just missing from the documentation as the fingerprinter does report the UUID.

Could you check if the GPU UUID is present when you run nomad node status -verbose <node ID>?

@lgfa29 lgfa29 self-assigned this Dec 1, 2022
@lgfa29 lgfa29 added the documentation Improvements or additions to documentation label Dec 1, 2022
@shumin1027
Copy link
Author

Hi @shumin1027 👋

I believe the UUID is just missing from the documentation as the fingerprinter does report the UUID.

Could you check if the GPU UUID is present when you run nomad node status -verbose <node ID>?

@lgfa29
Thank you,
It's not that the documentation is missing,all fingerprinted attributesare defined here:

const (
// Attribute names and units for reporting Fingerprint output
MemoryAttr = "memory"
PowerAttr = "power"
BAR1Attr = "bar1"
DriverVersionAttr = "driver_version"
CoresClockAttr = "cores_clock"
MemoryClockAttr = "memory_clock"
PCIBandwidthAttr = "pci_bandwidth"
DisplayStateAttr = "display_state"
PersistenceModeAttr = "persistence_mode"
)

And here's how to populate the attribute data, it can be seen that there is indeed no GPU UUID

Attributes: attributesFromFingerprintDeviceData(deviceList[0]),

func attributesFromFingerprintDeviceData(d *nvml.FingerprintDeviceData) map[string]*structs.Attribute {

@shumin1027
Copy link
Author

@lgfa29
By the way,when I try to solve this problem by adding GPU UUID to fingerprinted attributes here:

func attributesFromFingerprintDeviceData(d *nvml.FingerprintDeviceData) map[string]*structs.Attribute {

But a new problem was found:
When a device like the NVIDIA Tesla K80 with a Dual GPU,there will be a conflict :

Devices: devices,
// Assumption made that devices with the same DeviceName have the same
// attributes like amount of memory, power, bar1memory etc
Attributes: attributesFromFingerprintDeviceData(deviceList[0]),

Two GPUs with different UUIDs will be treated as one device with the same fingerprinted attributes

I still don't know how to solve this problem elegantly

@lgfa29
Copy link
Contributor

lgfa29 commented Dec 2, 2022

Oh, you're right @shumin1027, I misunderstood the code. I don't think it's possible to fix this at the plugin level, it seems like a limitation within Nomad.

I opened hashicorp/nomad#15455 to try and fix this. I'm building custom binaries for you to test if you have the chance and I will post them here once they're ready.

@lgfa29
Copy link
Contributor

lgfa29 commented Dec 2, 2022

@shumin1027
Copy link
Author

@lgfa29 It's great, I will continue to test

@lgfa29
Copy link
Contributor

lgfa29 commented Dec 5, 2022

Nice! If the fix for you feel free to close this issue. The expect the PR to be merged soon and for it to be released in the next version of Nomad.

@ruspaul013
Copy link

ruspaul013 commented Jun 19, 2023

Hello @shumin1027, @lgfa29 !
I tried using device.ids as constraint in a job file and every time I get random GPUs instead of the ones I set the UUIDs. Here is the link to nomad forum post that I have made, where are more details about the issue.

Thank you!

@lgfa29
Copy link
Contributor

lgfa29 commented Jun 19, 2023

Hi @ruspaul013 👋

I answered in your post, but TL;DR: yes, you need to upgrade your Nomad clients to a version that includes this change.

@ruspaul013
Copy link

Hello @lgfa29 👋
Thank you so much for the answer. I posted an update on the forum.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
documentation Improvements or additions to documentation
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants