Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cannot set GPUs using UUID as constraint #18112

Closed
ruspaul013 opened this issue Aug 1, 2023 · 6 comments · Fixed by #18141
Closed

Cannot set GPUs using UUID as constraint #18112

ruspaul013 opened this issue Aug 1, 2023 · 6 comments · Fixed by #18141

Comments

@ruspaul013
Copy link

Nomad version

nomad server: Nomad v1.5.6
nomad client: Nomad v1.5.6

Operating system and Environment details

Plugin "nomad-driver-podman" v0.5.0
Plugin "nomad-device-nvidia" v1.0.0

Issue

As it was mention in #15455, I could use UUIDs of GPU as attribute to set constraints. But when I want to use specific GPUs based on their UUID, I still get random GPUs.

Reproduction steps

Create a job file where you use multiple GPUs and set constraint based on their UUID.

Expected Result

running nvidia-smi -L inside container and see GPUs with the same UUID from job file.

Actual Result

running nvidia-smi -L inside container and get random GPUs.

Job file (if appropriate)

job "test-2070-2" {
  datacenters = ["dc1"]
  group "test-2070-2" {

    restart {
        attempts=0
    }
    count=1
    task "test-2070-2" {
        driver = "podman"
        config {
            image = "image_with_gpu"
        }
        resources {
            cpu = 2650
            memory = 8192
            device "nvidia/gpu" {
                count = 2

                constraint {
                    attribute = "${device.model}"
                    value     = "NVIDIA GeForce RTX 2070 SUPER"
                }

                constraint {
                    attribute = "${device.ids}"
                    operator  = "set_contains"
                    value     = "GPU-9b5df054-6f08-f35c-9c4c-5709b19efea5,GPU-1846fc5f-8c71-bfab-00e1-9c190dd88ed7"
                }

            }
        }
    }
  }
}

results:

[root@481a2da8e0a9 /]# nvidia-smi -L
GPU 0: NVIDIA GeForce RTX 2070 SUPER (UUID: GPU-d7574813-0b3f-ee8f-39fc-2b48f9dff169)
GPU 1: NVIDIA GeForce RTX 2070 SUPER (UUID: GPU-9b5df054-6f08-f35c-9c4c-5709b19efea5)
@tgross tgross added this to Needs Triage in Nomad - Community Issues Triage via automation Aug 1, 2023
@tgross tgross self-assigned this Aug 1, 2023
@tgross tgross moved this from Needs Triage to Triaging in Nomad - Community Issues Triage Aug 1, 2023
@tgross
Copy link
Member

tgross commented Aug 1, 2023

Hi @ruspaul013! Constraints only inform the scheduler to place the allocations on a client that has those GPU IDs in its fingerprint. Once the allocation has been placed, it's up to the task driver / device driver to select specific GPU devices on the host.

Can you confirm that the client your workloads are being placed on has the UUID you want, but that there just happen to be other GPUs on the same host and the selection is apparently random among them?

@ruspaul013
Copy link
Author

ruspaul013 commented Aug 1, 2023

Hello @tgross. Thank you for the reply! Yes, the placement of the jobs are made on the client I want.

@tgross
Copy link
Member

tgross commented Aug 1, 2023

Ok, it looks like the Reserve API for the device drivers should be able to pass a device ID down to the device driver. But we're probably not doing that in the client's device hook. Let me have a chat with some folks internally about what the right design is there.

tgross added a commit that referenced this issue Aug 3, 2023
When the scheduler assigns a device instance, it iterates over the feasible
devices and then picks the first instance with availability. If the jobspec uses
a constraint on device ID, this can lead to buggy/surprising behavior where the
node's device matches the constraint but then the individual device instance
does not.

Add a second filter based on the `${device.ids}` constraint after selecting a
node's device to ensure the device instance ID falls within the constraint as
well.

Fixes: #18112
@tgross tgross moved this from Triaging to In Progress in Nomad - Community Issues Triage Aug 3, 2023
@tgross
Copy link
Member

tgross commented Aug 3, 2023

Ok, I've been able to reproduce this fairly easily and discovered that the problem is actually in the scheduler and not in the client's device hook (which is good because that's much easier to fix!). The scheduler's AssignDevice block selects the right device on a node but then picks the first device instance that's free. We need to apply the ${device.ids} constraint at that point as well.

To reproduce, I used the following plugin config for our example-fs-device plugin:

plugin "example-fs-device" {
  config {
    dir = "/home/tim/fake-devices"
    list_period = "1s"
    unhealthy_perm = "-rw-rw-rw-"
  }
}

With 15 "device instances":

$ ls ~/tmp/fake-devices
01a1dd44-320f-11ee-afa4-af7e178c4562  034a90a0-320f-11ee-958b-57f5b1d9a6af
02056f30-320f-11ee-8964-2b1f777119c3  03798de2-320f-11ee-9c20-bf96edeb8a38
02350b96-320f-11ee-9586-abdd81e46ffa  03a71622-320f-11ee-9dbe-d70311b5ac9d
0262680c-320f-11ee-b8af-df93f1946025  03d35ae8-320f-11ee-8347-dfcd4da522c6
0290ef42-320f-11ee-9a4b-ab420cd87f26  d50d2264-309e-11ee-b684-6b2b671ed775
02bc8c24-320f-11ee-92f1-a79210183eb7  d9c4576e-309e-11ee-a516-1323547c76ab
02ea22ec-320f-11ee-9429-a3d7dc51e6d5  daaab330-309e-11ee-99af-c7712ad76156
0314f300-320f-11ee-a82d-e3f8314586d5

And using the following jobspec:

jobspec
job "example" {

  group "web" {

    network {
      mode = "bridge"
      port "www" {
        to = 8001
      }
    }

    task "http" {

      driver = "docker"

      config {
        image   = "busybox:1"
        command = "httpd"
        args    = ["-vv", "-f", "-p", "8001", "-h", "/local"]
        ports   = ["www"]
      }

      resources {
        cpu    = 128
        memory = 100

        device "nomad/file/mock" {
          count = 1

          constraint {
            attribute = "${device.ids}"
            operator  = "set_contains"
            value     = "d50d2264-309e-11ee-b684-6b2b671ed775,daaab330-309e-11ee-99af-c7712ad76156"
          }
        }

      }

      template {
        data        = "<html>hello, world</html>"
        destination = "local/index.html"
      }

    }
  }
}

This will fail to get the correct device IDs most of the time (depending on random selection of the 15 devices).

With the patch in #18141 this now works reliably:

$ nomad alloc status 1816
...

Device Stats
nomad/file/mock[daaab330-309e-11ee-99af-c7712ad76156]  3 bytes

tgross added a commit that referenced this issue Aug 3, 2023
When the scheduler assigns a device instance, it iterates over the feasible
devices and then picks the first instance with availability. If the jobspec uses
a constraint on device ID, this can lead to buggy/surprising behavior where the
node's device matches the constraint but then the individual device instance
does not.

Add a second filter based on the `${device.ids}` constraint after selecting a
node's device to ensure the device instance ID falls within the constraint as
well.

Fixes: #18112
@tgross tgross added this to the 1.6.x milestone Aug 3, 2023
@ruspaul013
Copy link
Author

Thank you @tgross! I will try in the next days a local build of nomad with the changes you made.

tgross added a commit that referenced this issue Aug 3, 2023
When the scheduler assigns a device instance, it iterates over the feasible
devices and then picks the first instance with availability. If the jobspec uses
a constraint on device ID, this can lead to buggy/surprising behavior where the
node's device matches the constraint but then the individual device instance
does not.

Add a second filter based on the `${device.ids}` constraint after selecting a
node's device to ensure the device instance ID falls within the constraint as
well.

Fixes: #18112
Nomad - Community Issues Triage automation moved this from In Progress to Done Aug 3, 2023
tgross added a commit that referenced this issue Aug 3, 2023
When the scheduler assigns a device instance, it iterates over the feasible
devices and then picks the first instance with availability. If the jobspec uses
a constraint on device ID, this can lead to buggy/surprising behavior where the
node's device matches the constraint but then the individual device instance
does not.

Add a second filter based on the `${device.ids}` constraint after selecting a
node's device to ensure the device instance ID falls within the constraint as
well.

Fixes: #18112
tgross added a commit that referenced this issue Aug 3, 2023
When the scheduler assigns a device instance, it iterates over the feasible
devices and then picks the first instance with availability. If the jobspec uses
a constraint on device ID, this can lead to buggy/surprising behavior where the
node's device matches the constraint but then the individual device instance
does not.

Add a second filter based on the `${device.ids}` constraint after selecting a
node's device to ensure the device instance ID falls within the constraint as
well.

Fixes: #18112
tgross added a commit that referenced this issue Aug 3, 2023
When the scheduler assigns a device instance, it iterates over the feasible
devices and then picks the first instance with availability. If the jobspec uses
a constraint on device ID, this can lead to buggy/surprising behavior where the
node's device matches the constraint but then the individual device instance
does not.

Add a second filter based on the `${device.ids}` constraint after selecting a
node's device to ensure the device instance ID falls within the constraint as
well.

Fixes: #18112
tgross added a commit that referenced this issue Aug 3, 2023
When the scheduler assigns a device instance, it iterates over the feasible
devices and then picks the first instance with availability. If the jobspec uses
a constraint on device ID, this can lead to buggy/surprising behavior where the
node's device matches the constraint but then the individual device instance
does not.

Add a second filter based on the `${device.ids}` constraint after selecting a
node's device to ensure the device instance ID falls within the constraint as
well.

Fixes: #18112
@tgross
Copy link
Member

tgross commented Aug 3, 2023

#18141 has been merged and will ship in the next regular release of Nomad 1.6.x, with backports to 1.5.x and 1.4.x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Development

Successfully merging a pull request may close this issue.

2 participants