GPU support #968

jzhoucliqr · 2020-07-16T18:33:33Z

/kind feature

Describe the solution you'd like
As an operator, I would like to add GPU cards to the worker nodes so data science team can utilize it.

Anything else you would like to add:

there are multiple options to add gpu support:

PCI passthrough: add dedicated GPU cards from host to VM directly, the card can not be attached to other VMs. Need to find the host which have the GPU cards available, and clone the VM to that specific host, instead of to a resource pool.
vGPU: GPU cards can be shared across multiple VMs.

Reference: https://blogs.vmware.com/apps/2018/07/using-gpus-with-virtual-machines-on-vsphere-part-1-overview.html

Environment:

Cluster-api-provider-vsphere version:
Kubernetes version: (use kubectl version):
OS (e.g. from /etc/os-release):

The text was updated successfully, but these errors were encountered:

moonek · 2020-07-20T09:10:31Z

I also need GPU worker node support.
Relatedly, OVA rebuild for GPU passthrough is also required. (#954)

If use the gpu-operator provided by nvidia, it can be easily installed without any additional driver operation on the worker node.
https://github.com/NVIDIA/gpu-operator

jzhoucliqr · 2020-07-20T18:40:30Z

Thanks @moonek . looks like the operator will install docker runtime? hopefully it wont cause any conflict with the default containerd runtime?

yastij · 2020-07-23T17:45:19Z

cc @detiber @codenrhoden

detiber · 2020-07-23T17:55:01Z

It looks like gpu-operator has an open issue for containerd support, but currently still required docker: NVIDIA/gpu-operator#7

jzhoucliqr · 2020-07-23T19:09:15Z

we did some experiment with the device plugin from gce, and got it working with latest nvidia driver on ubuntu1804 with containerd. If anyone is interested I can clean it up and put it out here.

https://github.com/GoogleCloudPlatform/container-engine-accelerators/blob/master/cmd/nvidia_gpu/README.md

fejta-bot · 2020-10-21T19:14:59Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

stormobile · 2021-01-22T08:23:27Z

Is there any progress here or help needed? I've seen that key propagation was merged in the latest release but is there at least some design around the feature? I am mostly interested if there are some ideas around VM scheduling cause all the VmWare guides require manual setup of the particular PCI Device for each VM (no scheduling)

fejta-bot · 2021-04-22T09:20:14Z

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

stormobile · 2021-05-12T08:42:07Z

First of all let me describe our case: We've got big clusters to run stateless applications that we run on top of vSphere with CAPV. Part of the servers feature T4 Nvidia GPUs that are used for inference only - no training for those cards. The total number of cards that we have is around 200 so automation of provisioning of GPU-enabled VMs for K8s Nodes is really a must.

Taking into account that the absolute majority of GPUs widely adopted by enterprises are NVIDIAs for now we can skip all the staff related to installing drivers, device plugins, discovering capabilities of nodes in K8s - all this staff is handled by Nvidia GPU Operator that is actually used by everyone who is running Nvidia GPUs in K8s. The end question here is how we can provision VMs with the desired number of GPUs using CAPV

I've put some time into researching the options here:

PCI Passthrough

PCI Passthrough is actually not much of an option because vSphere doesn't have any scheduling feature for PCI Passthrough devices - you can't specify that you need VM with this number of cards and vSphere finds you appropriate hypervisor that has GPUs available - there is no notion of GPU in the first place - it is just a PCI device specified by the full id.

Here comes the second problem - in order to PCI Passthrough device you need to specify the exact ID for each device - you can't just make a template for VM and then spawn multiple VMs from it. In order for CAPV to be able to work with this passthrough approach it should have:

Its own inventory of GPUs- PCI IDs - Hypervisors
Track the allocation of GPUs per Hypervisor/VM
Make scheduling decision instead of vSphere

I don't believe that manual specification of each PCI device for each VM is much of an option. This leaves us with vGPU option:

vGPU

In nVidia this is implemented as vCS server (you need additional license for it) but after you have it most of the things are handled by vSphere plugin. You specify GPU Profile for VM - this is the fraction of GPU you want to allocate split by GPU Memory - the profiles are generic (not tied to specific device) and scheduling decision is made by vSphere.

The resource allocation policies are rather agile and you can tune how processing time is split between different vGPUs on top of the same physical GPU.

Moreover, you can create VM template with particular GPU profile and then spawn multiple nodes from it. So in order to have differently sized Nodes in CAPV you just need several MachineDeloyments from different VM templates.

The problem is that current CAPV implementation strips the template down from all devices but Network Adapter and disk (as far as I remember).

Maybe there is some way to fix this at least for vGPU case in some cheap and fast way?

fejta-bot · 2021-06-11T08:54:56Z

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten

fejta-bot · 2021-07-11T09:05:56Z

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

k8s-ci-robot · 2021-07-11T09:05:59Z

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

jayunit100 · 2023-02-24T12:51:41Z

/reopen

k8s-ci-robot · 2023-02-24T12:51:45Z

@jayunit100: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

chrischdi · 2023-08-17T17:50:12Z

/remove-lifecycle rotten

ekarlso · 2023-10-06T22:11:14Z

Is there any work on this?

k8s-triage-robot · 2024-01-29T21:12:49Z

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle stale
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

k8s-triage-robot · 2024-02-28T22:13:16Z

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

After 90d of inactivity, lifecycle/stale is applied
After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

Mark this issue as fresh with /remove-lifecycle rotten
Close this issue with /close
Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

nicewrld · 2024-05-17T23:28:53Z

Anyone still seeing this, it looks like you can get pretty close to this w/ templates according to https://github.com/kubernetes-sigs/cluster-api-provider-vsphere/blob/main/docs/gpu-pci.md

ekarlso · 2024-05-31T10:48:12Z

@nicewrld What kind of templates do you mean?

k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Jul 16, 2020

jzhoucliqr closed this as completed Jul 20, 2020

jzhoucliqr reopened this Jul 20, 2020

yastij added this to the v0.7.x milestone Jul 23, 2020

yastij added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Jul 23, 2020

yastij mentioned this issue Jul 23, 2020

Support for UEFI boot OVAs #954

Closed

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 21, 2020

yastij modified the milestones: v0.7.x, v0.8.0 Nov 16, 2020

yastij removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 2, 2020

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 22, 2021

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 11, 2021

k8s-ci-robot closed this as completed Jul 11, 2021

k8s-ci-robot reopened this Feb 24, 2023

k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Aug 17, 2023

k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 29, 2024

k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU support #968

GPU support #968

jzhoucliqr commented Jul 16, 2020 •

edited

Loading

moonek commented Jul 20, 2020

jzhoucliqr commented Jul 20, 2020

yastij commented Jul 23, 2020

detiber commented Jul 23, 2020

jzhoucliqr commented Jul 23, 2020 •

edited

Loading

fejta-bot commented Oct 21, 2020

stormobile commented Jan 22, 2021

fejta-bot commented Apr 22, 2021

stormobile commented May 12, 2021

fejta-bot commented Jun 11, 2021

fejta-bot commented Jul 11, 2021

k8s-ci-robot commented Jul 11, 2021

jayunit100 commented Feb 24, 2023

k8s-ci-robot commented Feb 24, 2023

chrischdi commented Aug 17, 2023

ekarlso commented Oct 6, 2023

k8s-triage-robot commented Jan 29, 2024

k8s-triage-robot commented Feb 28, 2024

nicewrld commented May 17, 2024

ekarlso commented May 31, 2024

GPU support #968

GPU support #968

Comments

jzhoucliqr commented Jul 16, 2020 • edited Loading

moonek commented Jul 20, 2020

jzhoucliqr commented Jul 20, 2020

yastij commented Jul 23, 2020

detiber commented Jul 23, 2020

jzhoucliqr commented Jul 23, 2020 • edited Loading

fejta-bot commented Oct 21, 2020

stormobile commented Jan 22, 2021

fejta-bot commented Apr 22, 2021

stormobile commented May 12, 2021

fejta-bot commented Jun 11, 2021

fejta-bot commented Jul 11, 2021

k8s-ci-robot commented Jul 11, 2021

jayunit100 commented Feb 24, 2023

k8s-ci-robot commented Feb 24, 2023

chrischdi commented Aug 17, 2023

ekarlso commented Oct 6, 2023

k8s-triage-robot commented Jan 29, 2024

k8s-triage-robot commented Feb 28, 2024

nicewrld commented May 17, 2024

ekarlso commented May 31, 2024

jzhoucliqr commented Jul 16, 2020 •

edited

Loading

jzhoucliqr commented Jul 23, 2020 •

edited

Loading