Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

cannot unmarshal string into Go struct field PluginCommandLineFlags.flags.plugin.deviceListStrategy of type []string #410

Open
xuzimianxzm opened this issue Jun 2, 2023 · 15 comments

Comments

@xuzimianxzm
Copy link

I think the follow configuration has a issue: the field deviceListStrategy is a array,but you provide a string. so this will be cause a issue when the init container of nvidia-device-plugin-ctr startting.

cat << EOF > /tmp/dp-example-config0.yaml
version: v1
flags:
  migStrategy: "none"
  failOnInitError: true
  nvidiaDriverRoot: "/"
  plugin:
    passDeviceSpecs: false
    deviceListStrategy: envvar
    deviceIDStrategy: uuid
EOF
@xuzimianxzm
Copy link
Author

and, the another related issue is in the init container of gpu-feature-discovery-init, it requires the field of deviceListStrategy is a string, not a array.

unable to load config: unable to finalize config: unable to parse config file: error parsing config file: unmarshal error: error unmarshaling JSON: while decoding JSON: json: cannot unmarshal array into Go struct field PluginCommandLineFlags.flags.plugin.deviceListStrategy of type string

@elezar
Copy link
Member

elezar commented Jun 2, 2023

Thanks @xuzimianxzm the deviceListStrategy config option was updated to be a string late in the Device Plugin's v0.14.0 release cycle and it seems the changes were never propagated to gpu-feature-discovery. This explains the error you're seeing in your second comment.

It does also seem as if we didn't implement a custom unmarshaller for the deviceListStrategy when extending this in the device plugin.

cc @cdesiniotis

Update: I have reproduced the failure in a unit tests here https://gitlab.com/nvidia/kubernetes/device-plugin/-/merge_requests/294 and we will work on getting a fix released.

@elezar
Copy link
Member

elezar commented Jun 2, 2023

As a workaround, could you specify the deviceListStrategy using the DEVICE_LIST_STRATEGY envvar instead?

@ndacic
Copy link

ndacic commented Jun 29, 2023

@elezar what do you mean? I am facing same issue. Deploying it as a daemon set with Flux, not using helm. Should I create DEVICE_LIST_STRATEGY env variable for container, set its value to envvar and exclude deviceListStrategy: "envvar" from config map

@alekc
Copy link

alekc commented Jul 31, 2023

@ndacic this is how I solved it for me:

nodeSelector:
  nvidia.com/gpu.present: "true"
config:
  map:
    default: |-
      version: v1
      flags:
        migStrategy: "none"
        failOnInitError: true
        nvidiaDriverRoot: "/"
        plugin:
          passDeviceSpecs: false
          deviceListStrategy:
            - envvar
          deviceIDStrategy: uuid
      sharing:
        timeSlicing:
          renameByDefault: false
          resources:
          - name: nvidia.com/gpu
            replicas: 10

@elezar
Copy link
Member

elezar commented Aug 1, 2023

This issue should be addressed in the v0.14.1 release.

@ndacic please let me know if bumping the version does not address your issue so that I can better document the workaround.

@erikschul
Copy link

@elezar This is still a problem with version 0.14.3.

It fails with the official example:

version: v1
sharing:
  timeSlicing:
    resources:
    - name: nvidia.com/gpu
      replicas: 10

but it works with the example given above. Thanks @alekc

@PrakChandra
Copy link

@elezar
I am using version 0.15.0

I need to set replicas to 1 so that I can have full resource access of the GPU node.

My config looks like this

      version: v1
      flags:
        migStrategy: none
      sharing:
        mps:
          default_active_thread_percentage: 10
          resources:
            - name: nvidia.com/gpu
              replicas: 2

So my g4dn.2xlarge instance gives 40 SM, but the pods are to have 20 SM with replicas set to 2.

When I do install version 0.15.0 I get the following error
image

Could you please suggest to me how and where I can configure this replica count as 1 so that I do not get the error

@klueska
Copy link
Contributor

klueska commented May 21, 2024

It's not clear what you hope to accomplish by enabling MPS but setting its replicas to 1. If we allowed you to set replicas to 1, then you would get an MPS server started for the GPU, but only be able to connect 1 workload/pod to it (i.e. no sharing would be possible).

Can you please elaborate on exactly what your expectations are for using MPS? It sounds like maybe time-slicing is more what you are looking for. Either that, or (as I suggested before), maybe you want a way to limit the memory of each workload, but allow them all to share the same compute.

Please clarify what your expectations are. Just saying you want a way to "set replicas to 1" doesn't tell us anything, because that is a disallowed configuration for the reason mentioned above.

@PrakChandra
Copy link

PrakChandra commented May 21, 2024

@klueska

I provisioned an Optimised EKS GPU node g4dn.2xlarge with 1 GPU, configuration as follows

image

In order to have my workloads/pods get scheduled over it, I created the daemonset via helm
helm upgrade -i nvdp nvdp/nvidia-device-plugin --version=0.15.0 --namespace kube-system -f values.yaml

Output:

image

Logs:

image

My Config file is like this and I have updated this in the values.yaml file in order to get MPS sharing so that multiple workloads can be scheduled on the GPU node

  map: 
    default: |-
      version: v1
      flags:
        migStrategy: none
      sharing:
        mps:
          resources:
            - name: nvidia.com/gpu
              replicas: 2

===========================================================================

Issue:

When I set replicas: 2 I get the following output. This output is from one of the pods which is getting scheduled on the GPU node

image

In the above output, the multiprocessor count is 20, however, I need multiprocessor count as 40 so that the workloads can perform efficiently, otherwise with 20 it gets slow.

My expectation:

If I can set the replicas: 1 , then the multiprocessor count would become 40 and the workloads can do the processing efficiently.

I followed this doc and came to this expectation:

Ref: https://github.com/NVIDIA/k8s-device-plugin/tree/release-0.15

image

@elezar
Copy link
Member

elezar commented May 21, 2024

If you set replicas = 1 this is the same as no sharing since you will only expose a single slice that is the same as the entire GPU.

@PrakChandra
Copy link

@elezar @klueska
Although thing didn't work from Helm configuration I was able to figure out the solution.
I tweaked the value for CUDA_MPS_ACTIVE_THREAD_PERCENTAGE to 100 so that my full GPU is accessible to all the pods. And it is working as expected>

Thanks

@PrakChandra
Copy link

@elezar I am stuck with another issue where I am not able to get the GPU metrics.

time="2024-05-22T05:10:27Z" level=info msg="Starting dcgm-exporter"
time="2024-05-22T05:10:27Z" level=info msg="DCGM successfully initialized!"
time="2024-05-22T05:10:27Z" level=info msg="Not collecting DCP metrics: Error getting supported metrics: Profiling is not supported for this group of GPUs or GPU"
time="2024-05-22T05:10:27Z" level=info msg="Pipeline starting"
time="2024-05-22T05:10:27Z" level=info msg="Starting webserver"```

Am I missing something?

```apiVersion: apps/v1
kind: DaemonSet
metadata:
  name: dcgm-exporter
  namespace: monitoring
spec:
  selector:
    matchLabels:
      app: dcgm-exporter
  template:
    metadata:
      labels:
        app: dcgm-exporter
    spec:
      nodeSelector:
        nvidia.com/accelerator: gpu
      #   nvidia.com/lama: gpu
      # nodeSelector:
      #   nvidia.com/accelerator: gpu
      #   nvidia.com/lama: gpu
      # affinity:
      #   nodeAffinity:
      #     requiredDuringSchedulingIgnoredDuringExecution:
      #       nodeSelectorTerms:
      #       - matchExpressions:
      #         # On discrete-GPU based systems NFD adds the following label where 10de is the NVIDIA PCI vendor ID
      #         - key: nvidia.com/accelerator
      #           operator: In
      #           values:
      #           - "gpu"
      #       - matchExpressions:
      #         # On some Tegra-based systems NFD detects the CPU vendor ID as NVIDIA
      #         - key: app
      #           operator: In
      #           values:
      #           - "AI-GPU"
      #           - "AI-GPU-LAMA"
      #       - matchExpressions:
      #         # We allow a GPU deployment to be forced by setting the following label to "true"
      #         - key: nvidia.com/lama
      #           operator: In
      #           values:
      #           - "gpu"        
      tolerations:
        - key: app
          value: AI-GPU
          effect: NoSchedule
          operator: Equal
        - key: nvidia/gpu
          operator: Exists
          effect: NoSchedule
        # - key: app
        #   value: AI-GPU-LAMA
        #   effect: NoSchedule
        #   operator: Equal          
          
        ## this matter nodeSelector to check your gpu node
      containers:
      - name: dcgm-exporter
        image: nvcr.io/nvidia/k8s/dcgm-exporter:2.0.13-2.1.2-ubuntu18.04
        ports:
        - containerPort: 9400
        securityContext:
          capabilities:
            add:
              - SYS_ADMIN```

@elezar
Copy link
Member

elezar commented May 22, 2024

@PrakChandra looking at your issues here, they were not related to the original post. Could you please open new issues instead of extending this thread.

@PrakChandra
Copy link

Sure. Thanks @elezar

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants