Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

support mig devices #53

Merged
merged 9 commits into from
Aug 22, 2024
Merged

support mig devices #53

merged 9 commits into from
Aug 22, 2024

Conversation

shoenig
Copy link
Member

@shoenig shoenig commented Aug 21, 2024

Incorporates previous PRs by @attachmentgenie and @isidentical while fixing a couple of bugs and adding MIG specific tests to the mock driver implementation.

  • Add rudimentary support for MIGs
  • Record utilizations from parent device
  • feat: switch to _v2 calls for memory infra and reintroducing ecc info. Adding some enhanced error catching for nvml queries
  • chore: bumping to go 1.22
  • chore: introduce MIG support
  • driver: fixup more MIG feature bugs

Closes #3
Closes #27
Closes #40

@shoenig
Copy link
Member Author

shoenig commented Aug 21, 2024

spot check:

$ nvidia-smi -L
GPU 0: NVIDIA A100-SXM4-40GB (UUID: GPU-d8d7e984-7a52-6ad0-676c-222f4be482b9)
GPU 1: NVIDIA A100-SXM4-40GB (UUID: GPU-658c3ec8-b842-8f3d-96b8-e919d699d17f)
GPU 2: NVIDIA A100-SXM4-40GB (UUID: GPU-143664cc-e9e6-dea8-c036-a112f4f4cf2e)
GPU 3: NVIDIA A100-SXM4-40GB (UUID: GPU-41c37062-3644-9e82-f1e5-4cf9832c5294)
GPU 4: NVIDIA A100-SXM4-40GB (UUID: GPU-b2210207-2f27-05df-a93d-8d060db4eabb)
GPU 5: NVIDIA A100-SXM4-40GB (UUID: GPU-6109dd5c-d378-24ee-f344-79b537574ea6)
  MIG 1g.5gb      Device  0: (UUID: MIG-8c4c5a27-e508-5556-b2ea-7e549465741e)
  MIG 1g.5gb      Device  1: (UUID: MIG-1d89abd8-d10a-5361-bd06-77a807048472)
  MIG 1g.5gb      Device  2: (UUID: MIG-c6b63f1b-a006-5803-b4f4-cc1b2867b34b)
  MIG 1g.5gb      Device  3: (UUID: MIG-0f03f623-f28d-5831-bc51-f450dad99b8c)
  MIG 1g.5gb      Device  4: (UUID: MIG-5915594e-2939-5009-89d2-cc81c198bd6f)
  MIG 1g.5gb      Device  5: (UUID: MIG-108eed9d-d296-5037-b840-0a54cccdc3fe)
  MIG 1g.5gb      Device  6: (UUID: MIG-705077ed-ed26-5909-aacd-4842d47b8c76)
GPU 6: NVIDIA A100-SXM4-40GB (UUID: GPU-c2d35f3f-98bc-7b5e-bf23-5b82ebd9a30e)
GPU 7: NVIDIA A100-SXM4-40GB (UUID: GPU-c7488922-4511-39b7-102b-552925241e95)
  MIG 3g.20gb     Device  0: (UUID: MIG-3dbd382d-dc25-5c9f-b925-e2b3c1b1513b)
  MIG 2g.10gb     Device  1: (UUID: MIG-d7128c1b-5510-54ea-9b5f-96975eb31882)
  MIG 1g.5gb      Device  2: (UUID: MIG-d12b4c16-9fd6-522c-8232-3e45bb4cdbfe)
  MIG 1g.5gb      Device  3: (UUID: MIG-7bca62a9-3c21-5ee3-99ab-008cd3f47af5)
$ nomad node status 52 | grep '^nvidia/gpu'
nvidia/gpu/NVIDIA A100-SXM4-40GB MIG 1g.5gb[MIG-0f03f623-f28d-5831-bc51-f450dad99b8c]   <none>
nvidia/gpu/NVIDIA A100-SXM4-40GB MIG 1g.5gb[MIG-108eed9d-d296-5037-b840-0a54cccdc3fe]   <none>
nvidia/gpu/NVIDIA A100-SXM4-40GB MIG 1g.5gb[MIG-1d89abd8-d10a-5361-bd06-77a807048472]   <none>
nvidia/gpu/NVIDIA A100-SXM4-40GB MIG 1g.5gb[MIG-5915594e-2939-5009-89d2-cc81c198bd6f]   <none>
nvidia/gpu/NVIDIA A100-SXM4-40GB MIG 1g.5gb[MIG-705077ed-ed26-5909-aacd-4842d47b8c76]   <none>
nvidia/gpu/NVIDIA A100-SXM4-40GB MIG 1g.5gb[MIG-7bca62a9-3c21-5ee3-99ab-008cd3f47af5]   <none>
nvidia/gpu/NVIDIA A100-SXM4-40GB MIG 1g.5gb[MIG-8c4c5a27-e508-5556-b2ea-7e549465741e]   <none>
nvidia/gpu/NVIDIA A100-SXM4-40GB MIG 1g.5gb[MIG-c6b63f1b-a006-5803-b4f4-cc1b2867b34b]   <none>
nvidia/gpu/NVIDIA A100-SXM4-40GB MIG 1g.5gb[MIG-d12b4c16-9fd6-522c-8232-3e45bb4cdbfe]   <none>
nvidia/gpu/NVIDIA A100-SXM4-40GB MIG 2g.10gb[MIG-d7128c1b-5510-54ea-9b5f-96975eb31882]  <none>
nvidia/gpu/NVIDIA A100-SXM4-40GB MIG 3g.20gb[MIG-3dbd382d-dc25-5c9f-b925-e2b3c1b1513b]  <none>
nvidia/gpu/NVIDIA A100-SXM4-40GB[GPU-143664cc-e9e6-dea8-c036-a112f4f4cf2e]              633 / 40960 MiB
nvidia/gpu/NVIDIA A100-SXM4-40GB[GPU-41c37062-3644-9e82-f1e5-4cf9832c5294]              633 / 40960 MiB
nvidia/gpu/NVIDIA A100-SXM4-40GB[GPU-658c3ec8-b842-8f3d-96b8-e919d699d17f]              633 / 40960 MiB
nvidia/gpu/NVIDIA A100-SXM4-40GB[GPU-d8d7e984-7a52-6ad0-676c-222f4be482b9]              633 / 40960 MiB

@shoenig shoenig marked this pull request as ready for review August 21, 2024 18:13
@shoenig shoenig requested a review from a team as a code owner August 21, 2024 18:13
Copy link
Member

@schmichael schmichael left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🎉

README.md Outdated Show resolved Hide resolved
Comment on lines +181 to +188
// A30/A100 MIG devices have no stats.
//
// https://docs.nvidia.com/datacenter/tesla/mig-user-guide/#telemetry
//
// Is this fixed on H100 or later? Maybe?
if mode == mig || mode == parent {
continue
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be safe to attempt to call DeviceInfoAndStatusByUUID and log/continue on error? I'd just hate for this to be something NVidia fixes in a driver update and then our plugin languishes for months without support because we don't even try.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we log we end up spamming a log line for each MIG device for each period. In the sad case that's 7 MIG devices for 8 GPUs every 30 seconds which is a lot of log spam for the hope Nvidia will fix their stuff.

}
utzEncU := uint(utzEnc)
memUsedU := mem.Used / (1 << 20)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment and/or a const for 1 << 20

shoenig and others added 2 commits August 22, 2024 07:52
Co-authored-by: Michael Schurter <michael.schurter@gmail.com>
@shoenig shoenig merged commit 92c14e6 into main Aug 22, 2024
10 checks passed
@shoenig shoenig deleted the support-mig-devices branch August 22, 2024 13:06
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Nomad plugin nvidia-gpu does not detect multi-instance GPUs
4 participants