You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
I'm running your device plugin in OpenShift 3.11 which has kubernetes under the hood. I realize you might not have done any testing with OCP but figured you might be able to help. Here is the setup:
Physical host (with edge tpu usb device attached)
OCP cluster virtual machines
I am able to connect the TPU to the physical host and run the python demo code to show it works. I can even assign the USB device to a compute node VM and in the VM run the python demo code to show that the VM sees the device and can talk to it.
Before I do anything with the daemonset I see this on the physical host:
$ lsusb
Bus 002 Device 005: ID 1a6e:089a Global Unichip Corp.
I then use your yaml to deploy the daemonset. One of the pods in the daemonset shows:
oc logs -f edgetpu-device-plugin-52sjt
I0812 16:17:10.264373 1 plugin.go:98] Started gRPC service on plugin socket
I0812 16:17:10.264399 1 plugin.go:101] Started monitoring devices
I0812 16:17:10.264404 1 plugin.go:49] gRPC server started.
I0812 16:17:10.264607 1 plugin.go:118] Opened connection to kubelet socket
I0812 16:17:10.268002 1 server.go:56] Start watching devices
I0812 16:17:10.268025 1 server.go:66] Update a device list
I0812 16:17:10.268092 1 plugin.go:132] Registered device plugin
I0812 16:17:15.369094 1 server.go:150] Edge TPU became active.
I0812 16:17:15.369137 1 server.go:66] Update a device list
So far that all looks good. I then deploy the sample with your yaml file and it comes back with:
oc logs -f edgetpu-demo-9cb92
ERROR: Failed to retrieve TPU context.
ERROR: Node number 0 (edgetpu-custom-op) failed to prepare.
Failed in Tensor allocation, status_code: 1
And then if I go back to the physical host:
lsusb
Bus 002 Device 006: ID 18d1:9302 Google Inc.
It changed from 002:005 to 002:006. It is like the physical host thinks the USB device was disconnected and reconnected. I have see this before I started using your code where I'd run a container on the VM and it would fail, see the device changed on the host, readd device to VM, and run container on VM again...and it works.
Would you have any insight into why talking to the device or somehow assigning it to a container causes this name change? Thank you.
The text was updated successfully, but these errors were encountered:
Update: As long as the device doesn't flip on me then your device plugin works great and I'm able to run your sample code. What is really strange is the device doesn't just change bus IDs it changes vendor IDs and name. Its almost like the first time you talk to it udev gets more info and sees it as a new device....
I'm running your device plugin in OpenShift 3.11 which has kubernetes under the hood. I realize you might not have done any testing with OCP but figured you might be able to help. Here is the setup:
I am able to connect the TPU to the physical host and run the python demo code to show it works. I can even assign the USB device to a compute node VM and in the VM run the python demo code to show that the VM sees the device and can talk to it.
Before I do anything with the daemonset I see this on the physical host:
I then use your yaml to deploy the daemonset. One of the pods in the daemonset shows:
So far that all looks good. I then deploy the sample with your yaml file and it comes back with:
And then if I go back to the physical host:
It changed from 002:005 to 002:006. It is like the physical host thinks the USB device was disconnected and reconnected. I have see this before I started using your code where I'd run a container on the VM and it would fail, see the device changed on the host, readd device to VM, and run container on VM again...and it works.
Would you have any insight into why talking to the device or somehow assigning it to a container causes this name change? Thank you.
The text was updated successfully, but these errors were encountered: