-
Notifications
You must be signed in to change notification settings - Fork 314
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
nvidia-device-plugin-daemonset toolkit validation fails with containerd #143
Comments
kubelet logs:
|
kubectl get pods -A
|
@shysank Did you follow the documentation to install with correct config for containerd (i.e --set operator.defaultRuntime=containerd?) https://docs.nvidia.com/datacenter/cloud-native/gpu-operator/getting-started.html#install-the-gpu-operator Also, which version of operator are you trying to install? Can you try with latest 1.5.2 operator and confirm? |
Yes, I ran
I tried with |
This is what I see in
|
Adding @klueska for more inputs regarding error |
Is containerd running? |
Yes
Yes
|
Looking at your initial logs again, it seems that the What do the container logs say:
|
Ah I see this error The driver version in ClusterPolicy is |
@shivamerla Am I missing something here? Or is there a workaround for this? |
I hacked to remove
I was looking into NVIDIA/k8s-device-plugin to see if there is any configuration for /cc @klueska |
@shysank somehow libnvidia libs are not getting injected into the device-plugin pod. With nvidia-container-runtime setup this should work if drivers are successfully loaded. We didn't see with internal testing with |
Sure, I'll try in on a different kind of VM to confirm.
Do you mean setting this |
No, after operator is un-installed, toolkit will reset it back to runc in config.toml. You can confirm this happening and restart containerd just to be sure it takes effect. |
Got it, Thanks! |
The issue is because of a new property called /cc @shivamerla @klueska |
The operator supports both As some background, the For the For the In your particular situation, you seem to be running a version of In your PR you seem to add support for In general, the reason you are seeing the As such, I think the right fix is to add logic in the toolkit container to actually inspect the |
Thanks @klueska for the explanation!
I'll try and submit a pr for this |
@shivamerla is there an approximate timeline when we'll get a new version of toolkit? |
@shivamerla I believe it is published now in v1.4.5 of the toolkit-container (and included in the new 1.6.0 version of the operator) |
@klueska @shysank @shivamerla yes, these changes should be available in v1.4.5 of the toolkit container through v1.6.0 of the operator. @shysank please check this when you get a chance and close the ticket accordingly. |
I tested it with |
Fixed it with 1.6.1 |
@shivamerla I just tested it with I think |
This issue has been resolved:
Then: |
1. Quick Debug Checklist
i2c_core
andipmi_msghandler
loaded on the nodes?kubectl describe clusterpolicies --all-namespaces
)1. Issue or feature description
I am trying to use gpu operator in a kubernetes cluster created using
cluster-api
for azure. As I installed the operator, I'm running into an issue where thenvidia-device-plugin-daemonset
fails to come up, and crashes in init container which tries to run a validation pod. On further inspection, I noticed that it was failing withImageInspectError
. The event log:PS: I'm using
containerd
for container managementThe vm type is azure ncv3 series
2. Steps to reproduce the issue
helm install --wait --generate-name nvidia/gpu-operator --set operator.defaultRuntime=containerd
kubectl -n gpu-operator-resources get pods
The text was updated successfully, but these errors were encountered: