-
Notifications
You must be signed in to change notification settings - Fork 296
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
GPU support #968
Comments
I also need GPU worker node support. If use the gpu-operator provided by nvidia, it can be easily installed without any additional driver operation on the worker node. |
Thanks @moonek . looks like the operator will install docker runtime? hopefully it wont cause any conflict with the default containerd runtime? |
It looks like |
we did some experiment with the device plugin from gce, and got it working with latest nvidia driver on ubuntu1804 with containerd. If anyone is interested I can clean it up and put it out here. |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-testing, kubernetes/test-infra and/or fejta. |
Is there any progress here or help needed? I've seen that key propagation was merged in the latest release but is there at least some design around the feature? I am mostly interested if there are some ideas around VM scheduling cause all the VmWare guides require manual setup of the particular PCI Device for each VM (no scheduling) |
Issues go stale after 90d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-contributor-experience at kubernetes/community. |
First of all let me describe our case: We've got big clusters to run stateless applications that we run on top of vSphere with CAPV. Part of the servers feature T4 Nvidia GPUs that are used for inference only - no training for those cards. The total number of cards that we have is around 200 so automation of provisioning of GPU-enabled VMs for K8s Nodes is really a must. Taking into account that the absolute majority of GPUs widely adopted by enterprises are NVIDIAs for now we can skip all the staff related to installing drivers, device plugins, discovering capabilities of nodes in K8s - all this staff is handled by Nvidia GPU Operator that is actually used by everyone who is running Nvidia GPUs in K8s. The end question here is how we can provision VMs with the desired number of GPUs using CAPV I've put some time into researching the options here:
PCI Passthrough is actually not much of an option because vSphere doesn't have any scheduling feature for PCI Passthrough devices - you can't specify that you need VM with this number of cards and vSphere finds you appropriate hypervisor that has GPUs available - there is no notion of GPU in the first place - it is just a PCI device specified by the full id. Here comes the second problem - in order to PCI Passthrough device you need to specify the exact ID for each device - you can't just make a template for VM and then spawn multiple VMs from it. In order for CAPV to be able to work with this passthrough approach it should have:
I don't believe that manual specification of each PCI device for each VM is much of an option. This leaves us with vGPU option:
In nVidia this is implemented as vCS server (you need additional license for it) but after you have it most of the things are handled by vSphere plugin. You specify GPU Profile for VM - this is the fraction of GPU you want to allocate split by GPU Memory - the profiles are generic (not tied to specific device) and scheduling decision is made by vSphere. The resource allocation policies are rather agile and you can tune how processing time is split between different vGPUs on top of the same physical GPU. Moreover, you can create VM template with particular GPU profile and then spawn multiple nodes from it. So in order to have differently sized Nodes in CAPV you just need several MachineDeloyments from different VM templates. The problem is that current CAPV implementation strips the template down from all devices but Network Adapter and disk (as far as I remember). Maybe there is some way to fix this at least for vGPU case in some cheap and fast way? |
Stale issues rot after 30d of inactivity. If this issue is safe to close now please do so with Send feedback to sig-contributor-experience at kubernetes/community. |
Rotten issues close after 30d of inactivity. Send feedback to sig-contributor-experience at kubernetes/community. |
@fejta-bot: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/reopen |
@jayunit100: Reopened this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. |
/remove-lifecycle rotten |
Is there any work on this? |
The Kubernetes project currently lacks enough contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle stale |
The Kubernetes project currently lacks enough active contributors to adequately respond to all issues. This bot triages un-triaged issues according to the following rules:
You can:
Please send feedback to sig-contributor-experience at kubernetes/community. /lifecycle rotten |
Anyone still seeing this, it looks like you can get pretty close to this w/ templates according to https://github.com/kubernetes-sigs/cluster-api-provider-vsphere/blob/main/docs/gpu-pci.md |
@nicewrld What kind of templates do you mean? |
/kind feature
Describe the solution you'd like
As an operator, I would like to add GPU cards to the worker nodes so data science team can utilize it.
Anything else you would like to add:
there are multiple options to add gpu support:
Reference: https://blogs.vmware.com/apps/2018/07/using-gpus-with-virtual-machines-on-vsphere-part-1-overview.html
Environment:
kubectl version
):/etc/os-release
):The text was updated successfully, but these errors were encountered: