Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GPU support #968

Open
jzhoucliqr opened this issue Jul 16, 2020 · 20 comments
Open

GPU support #968

jzhoucliqr opened this issue Jul 16, 2020 · 20 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Milestone

Comments

@jzhoucliqr
Copy link
Contributor

jzhoucliqr commented Jul 16, 2020

/kind feature

Describe the solution you'd like
As an operator, I would like to add GPU cards to the worker nodes so data science team can utilize it.

Anything else you would like to add:

there are multiple options to add gpu support:

  1. PCI passthrough: add dedicated GPU cards from host to VM directly, the card can not be attached to other VMs. Need to find the host which have the GPU cards available, and clone the VM to that specific host, instead of to a resource pool.
  2. vGPU: GPU cards can be shared across multiple VMs.

Reference: https://blogs.vmware.com/apps/2018/07/using-gpus-with-virtual-machines-on-vsphere-part-1-overview.html

Environment:

  • Cluster-api-provider-vsphere version:
  • Kubernetes version: (use kubectl version):
  • OS (e.g. from /etc/os-release):
@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Jul 16, 2020
@moonek
Copy link
Contributor

moonek commented Jul 20, 2020

I also need GPU worker node support.
Relatedly, OVA rebuild for GPU passthrough is also required. (#954)

If use the gpu-operator provided by nvidia, it can be easily installed without any additional driver operation on the worker node.
https://github.com/NVIDIA/gpu-operator

@jzhoucliqr
Copy link
Contributor Author

Thanks @moonek . looks like the operator will install docker runtime? hopefully it wont cause any conflict with the default containerd runtime?

@yastij
Copy link
Member

yastij commented Jul 23, 2020

cc @detiber @codenrhoden

@yastij yastij added this to the v0.7.x milestone Jul 23, 2020
@yastij yastij added the priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release. label Jul 23, 2020
@detiber
Copy link
Member

detiber commented Jul 23, 2020

It looks like gpu-operator has an open issue for containerd support, but currently still required docker: NVIDIA/gpu-operator#7

@jzhoucliqr
Copy link
Contributor Author

jzhoucliqr commented Jul 23, 2020

we did some experiment with the device plugin from gce, and got it working with latest nvidia driver on ubuntu1804 with containerd. If anyone is interested I can clean it up and put it out here.

https://github.com/GoogleCloudPlatform/container-engine-accelerators/blob/master/cmd/nvidia_gpu/README.md

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-testing, kubernetes/test-infra and/or fejta.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Oct 21, 2020
@yastij yastij modified the milestones: v0.7.x, v0.8.0 Nov 16, 2020
@yastij yastij removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Dec 2, 2020
@stormobile
Copy link

Is there any progress here or help needed? I've seen that key propagation was merged in the latest release but is there at least some design around the feature? I am mostly interested if there are some ideas around VM scheduling cause all the VmWare guides require manual setup of the particular PCI Device for each VM (no scheduling)

@fejta-bot
Copy link

Issues go stale after 90d of inactivity.
Mark the issue as fresh with /remove-lifecycle stale.
Stale issues rot after an additional 30d of inactivity and eventually close.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Apr 22, 2021
@stormobile
Copy link

First of all let me describe our case: We've got big clusters to run stateless applications that we run on top of vSphere with CAPV. Part of the servers feature T4 Nvidia GPUs that are used for inference only - no training for those cards. The total number of cards that we have is around 200 so automation of provisioning of GPU-enabled VMs for K8s Nodes is really a must.

Taking into account that the absolute majority of GPUs widely adopted by enterprises are NVIDIAs for now we can skip all the staff related to installing drivers, device plugins, discovering capabilities of nodes in K8s - all this staff is handled by Nvidia GPU Operator that is actually used by everyone who is running Nvidia GPUs in K8s. The end question here is how we can provision VMs with the desired number of GPUs using CAPV

I've put some time into researching the options here:

  1. PCI Passthrough

PCI Passthrough is actually not much of an option because vSphere doesn't have any scheduling feature for PCI Passthrough devices - you can't specify that you need VM with this number of cards and vSphere finds you appropriate hypervisor that has GPUs available - there is no notion of GPU in the first place - it is just a PCI device specified by the full id.

Here comes the second problem - in order to PCI Passthrough device you need to specify the exact ID for each device - you can't just make a template for VM and then spawn multiple VMs from it. In order for CAPV to be able to work with this passthrough approach it should have:

  1. Its own inventory of GPUs- PCI IDs - Hypervisors
  2. Track the allocation of GPUs per Hypervisor/VM
  3. Make scheduling decision instead of vSphere

I don't believe that manual specification of each PCI device for each VM is much of an option. This leaves us with vGPU option:

  1. vGPU

In nVidia this is implemented as vCS server (you need additional license for it) but after you have it most of the things are handled by vSphere plugin. You specify GPU Profile for VM - this is the fraction of GPU you want to allocate split by GPU Memory - the profiles are generic (not tied to specific device) and scheduling decision is made by vSphere.

The resource allocation policies are rather agile and you can tune how processing time is split between different vGPUs on top of the same physical GPU.

Moreover, you can create VM template with particular GPU profile and then spawn multiple nodes from it. So in order to have differently sized Nodes in CAPV you just need several MachineDeloyments from different VM templates.

The problem is that current CAPV implementation strips the template down from all devices but Network Adapter and disk (as far as I remember).

Maybe there is some way to fix this at least for vGPU case in some cheap and fast way?

@fejta-bot
Copy link

Stale issues rot after 30d of inactivity.
Mark the issue as fresh with /remove-lifecycle rotten.
Rotten issues close after an additional 30d of inactivity.

If this issue is safe to close now please do so with /close.

Send feedback to sig-contributor-experience at kubernetes/community.
/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Jun 11, 2021
@fejta-bot
Copy link

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

@k8s-ci-robot
Copy link
Contributor

@fejta-bot: Closing this issue.

In response to this:

Rotten issues close after 30d of inactivity.
Reopen the issue with /reopen.
Mark the issue as fresh with /remove-lifecycle rotten.

Send feedback to sig-contributor-experience at kubernetes/community.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@jayunit100
Copy link
Contributor

/reopen

@k8s-ci-robot
Copy link
Contributor

@jayunit100: Reopened this issue.

In response to this:

/reopen

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot reopened this Feb 24, 2023
@chrischdi
Copy link
Member

/remove-lifecycle rotten

@k8s-ci-robot k8s-ci-robot removed the lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. label Aug 17, 2023
@ekarlso
Copy link
Contributor

ekarlso commented Oct 6, 2023

Is there any work on this?

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Jan 29, 2024
@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough active contributors to adequately respond to all issues.

This bot triages un-triaged issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle rotten
  • Close this issue with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle rotten

@k8s-ci-robot k8s-ci-robot added lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. and removed lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. labels Feb 28, 2024
@nicewrld
Copy link

Anyone still seeing this, it looks like you can get pretty close to this w/ templates according to https://github.com/kubernetes-sigs/cluster-api-provider-vsphere/blob/main/docs/gpu-pci.md

@ekarlso
Copy link
Contributor

ekarlso commented May 31, 2024

@nicewrld What kind of templates do you mean?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature. lifecycle/rotten Denotes an issue or PR that has aged beyond stale and will be auto-closed. priority/important-soon Must be staffed and worked on either currently, or very soon, ideally in time for the next release.
Projects
None yet
Development

No branches or pull requests