Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not showing warning when using a taken GPU #18364

Open
ruspaul013 opened this issue Aug 30, 2023 · 3 comments
Open

Not showing warning when using a taken GPU #18364

ruspaul013 opened this issue Aug 30, 2023 · 3 comments
Labels
hcc/jira stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/devices theme/scheduling type/bug

Comments

@ruspaul013
Copy link

Nomad version

nomad server: Nomad v1.6.1 + patch with #18141
nomad client: Nomad v1.6.1 + patch with #18141

Operating system and Environment details

Plugin "nomad-driver-podman" v0.5.1
Plugin "nomad-device-nvidia" v1.0.0

Issue

I created a patch with the solution provides in #18141 to test on our cluster. While testing I discovered that if a GPU or multiple GPU are already used in jobs, nomad will not give a warning about this and will place the job without using those GPUs.

Reproduction steps

  1. Create a job file where you use multiple GPUs and set constraint based on their UUID.
  2. Create a new job file where you use one or more GPUs that is already used in the first job.

Expected Result

Throw an warning like WARNING: Failed to place all allocations.

Actual Result

Place the job on the client.

Job file (if appropriate)

Job file 1:

job "test-2070-2" {
  datacenters = ["dc1"]
  group "test-2070-2" {

    restart {
        attempts=0
    }
    count=1
    task "test-2070-2" {
        driver = "podman"
        config {
            image = "image_with_gpu"
        }
        resources {
            cpu = 2650
            memory = 8192
            device "nvidia/gpu" {
                count = 2

                constraint {
                    attribute = "${device.model}"
                    value     = "NVIDIA GeForce RTX 2070 SUPER"
                }

                constraint {
                    attribute = "${device.ids}"
                    operator  = "set_contains"
                    value     = "GPU-9b5df054-6f08-f35c-9c4c-5709b19efea5,GPU-1846fc5f-8c71-bfab-00e1-9c190dd88ed7"
                }

            }
        }
    }
  }
}

Job file 2:

job "test-2070-2" {
  datacenters = ["dc1"]
  group "test-2070-2" {

    restart {
        attempts=0
    }
    count=1
    task "test-2070-2" {
        driver = "podman"
        config {
            image = "image_with_gpu"
        }
        resources {
            cpu = 2650
            memory = 8192
            device "nvidia/gpu" {
                count = 1

                constraint {
                    attribute = "${device.model}"
                    value     = "NVIDIA GeForce RTX 2070 SUPER"
                }

                constraint {
                    attribute = "${device.ids}"
                    operator  = "set_contains"
                    value     = "GPU-9b5df054-6f08-f35c-9c4c-5709b19efea5"
                }

            }
        }
    }
  }
}
@tgross tgross added this to Needs Triage in Nomad - Community Issues Triage via automation Sep 8, 2023
@tgross
Copy link
Member

tgross commented Sep 8, 2023

@ruspaul013 I'm surprised that job spec doesn't simply return a validation error, but the constraint block belongs under job, group, or task, not under resources.device. Can you verify this is not working after that's corrected?

@tgross tgross moved this from Needs Triage to Triaging in Nomad - Community Issues Triage Sep 8, 2023
@tgross tgross self-assigned this Sep 8, 2023
@ruspaul013
Copy link
Author

Hello @tgross, thank you for the suggestion. I tried it, but now I get an error even if I do not use constraint block for ids. I can not place a simple job using a GPU.

resources {
      cpu = 3200*4
      memory = 8192
      device "nvidia/gpu" {
          count = 1
    
      }
  }
  
  constraint {
      attribute = "${device.model}"
      value     = "NVIDIA GeForce RTX 4090"
  }

and I get this warning

Scheduler dry-run:
- WARNING: Failed to place all allocations.
  Task Group "paulr_gpu_test" (failed to place 1 allocation):
    * Constraint "${device.model} = NVIDIA GeForce RTX 4090": 2 nodes excluded by filter

but the constraint block belongs under job, group, or task, not under resources.device

I know that constraint block does not belong there, but the example from device block says otherwise.

@lgfa29 lgfa29 added theme/scheduling stage/accepted Confirmed, and intend to work on. No timeline committment though. labels Oct 5, 2023
@lgfa29
Copy link
Contributor

lgfa29 commented Oct 5, 2023

Yeah, that's something I often forget but you can set constraint and affinity at the device level 😅
https://developer.hashicorp.com/nomad/docs/job-specification/device#constraint

But setting the constraint only filters out nodes from scheduling. Taking a look at the code I think the problem is that AllocsFit doesn't take current device usage into account.

Devices are not considered comparable resources, so they're not filtered here:

// Check that the node resources (after subtracting reserved) are a
// super set of those that are being allocated
available := node.ComparableResources()
available.Subtract(node.ComparableReservedResources())
if superset, dimension := available.Superset(used); !superset {
return false, dimension, used, nil
}

And this part of the code only looks for device oversubscription among the allocs being scheduled, so it ignores the ones already running in the node:

// Check devices
if checkDevices {
accounter := NewDeviceAccounter(node)
if accounter.AddAllocs(allocs) {
return false, "device oversubscribed", used, nil
}
}

@lgfa29 lgfa29 moved this from Triaging to Needs Roadmapping in Nomad - Community Issues Triage Oct 5, 2023
@tgross tgross removed their assignment Oct 30, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
hcc/jira stage/accepted Confirmed, and intend to work on. No timeline committment though. theme/devices theme/scheduling type/bug
Projects
Status: Needs Roadmapping
Development

No branches or pull requests

4 participants