-
Notifications
You must be signed in to change notification settings - Fork 292
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
This commit adds a proposal which enables GPU support in CAPV using. Signed-off-by: Geetika Batra <geetikab@vmware.com>
- Loading branch information
1 parent
267f4e0
commit 040ad8f
Showing
1 changed file
with
200 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,200 @@ | ||
# GPU support CAPV | ||
|
||
```text | ||
--- | ||
title: GPU support CAPV | ||
authors: | ||
- "@geetikabatra" | ||
reviewers: | ||
- "@vijaykumar" | ||
creation-date: 2021-08-23 | ||
last-updated: 2021-08-25 | ||
status: WIP | ||
``` | ||
|
||
## Table of Contents | ||
|
||
* [Single Controller Multitenancy](#gpu-support) | ||
* [Glossary](#glossary) | ||
* [Summary](#summary) | ||
* [Motivation](#motivation) | ||
* [Goals](#goals) | ||
* [Proposal](#proposal) | ||
* [User Stories](#user-stories) | ||
* [Story 1 - Deploying to multiple vCenters](#story-1---using-vgpu) | ||
* [Story 2 - Deploying multiple clusters from a single account](#story-2---gpu-direct-implementation) | ||
* [Story 3 - Legacy behaviour](#story-3---pci-passthrough-for-single-node-customer) | ||
* [Requirements](#requirements) | ||
* [Functional Requirements](#functional-requirements) | ||
* [Non-Functional Requirements](#non-functional-requirements) | ||
* [Implementation Details/Notes/Constraints](#implementation-detailsnotesconstraints) | ||
* [Current State](#current-state) | ||
* [Proposed Changes](#proposed-changes) | ||
* [Controller Changes](#controller-changes) | ||
* [Clusterctl Changes](#clusterctl-changes) | ||
* [Security Model](#security-model) | ||
* [Roles](#roles) | ||
* [RBAC](#rbac) | ||
* [Write Permissions](#write-permissions) | ||
* [Namespace Restrictions](#namespace-restrictions) | ||
* [CAPV Controller Requirements](#capv-controller-requirements) | ||
* [Risks and Mitigations](#risks-and-mitigations) | ||
* [Caching](#caching) | ||
* [Alternatives](#alternatives) | ||
* [Using only secrets to specify vSphere accounts](#using-only-secrets-to-specify-vsphere-accounts) | ||
* [Benefits](#benefits) | ||
* [Mitigations for current proposal](#mitigations-for-current-proposal) | ||
* [Upgrade Strategy](#upgrade-strategy) | ||
* [Additional Details](#additional-details) | ||
* [Test Plan](#test-plan) | ||
* [Graduation Criteria](#graduation-criteria) | ||
* [Implementation History](#implementation-history) | ||
|
||
## Glossary | ||
|
||
* CAPV - An abbreviation of Cluster API Provider vSphere | ||
|
||
## Summary | ||
|
||
CAPV currently doesnot support GPU based infrastructure. | ||
|
||
This proposal outlines ways to provision GPU support in CAPV. The proposed changes will maintain backwards compatibility and maintain the existing behaviour without any extra user configuration. | ||
|
||
## Motivation | ||
|
||
CAPV does not currently support GPU-accelerated workloads which has limited total addressable market.Competitively, all of the hyperscale cloud providers (AWS, Azure, & GCP) offer GPU-accelerated virtualization and Kubernetes platforms. Many of them additionally offer robust vertically integrated AI/ML solutions building on open source technology. GPU support is a long-requested feature. With this capability, user can leverage GPU support. | ||
|
||
### Goals | ||
|
||
1. To enable GPU Support on Vsphere Kubernetes Cluster. | ||
2. To allow vGPU support for multi node clusters | ||
3. To allow GPU Direct support for multi node clusters | ||
4. To allow PCI passthrough support for single node clusters. | ||
|
||
## Proposal | ||
|
||
### User Stories | ||
|
||
#### Story 1 - using vGPU | ||
|
||
Stacy works at an IT department of a organization where AI/ML models are run every hour and the entire staff wants to leverage GPU supported nodes. VGPU's are able to provide close to bare metal performance. Stacy needs a lot of sharing GPU’s. | ||
vGPU provides a full fleshed cluster from which a virtual GPU can be requested. | ||
|
||
#### Story 2 - GPU Direct implementation | ||
|
||
Tony is an Engineer at a big organization with multiple centers across the Globe. Tony needs a big GPU cluster to address the needs of his | ||
organization. So they intend to use GPU Direc. GPU direct is technically a pool of GPU’s connected by a network card in a kind of peer to peer network. This is the latest technology and gives advantage of shared GPU’s from a pool. GPU direct is the technology which helps in linking multiple GPU’s. Tony wants to leverage this technology so that it is easier for him to manage resource allocation. | ||
|
||
|
||
#### Story 3 - PCI passthrough for single node Customer | ||
|
||
Alex is an Engineer at retail organization that requires single GPU node. They use one node with GPU attached and want to keep things simple. Alex can simply add this GPU connected machine to the cluster and that should do the job. While selecting nodes, Alex can use appropriate labels to run his AI/ML workload on this particular node. PCI passthrough will provide direct GPU support. Challenges that Alex can face is that using passthrough Alex wouln't be able to migrate nodes. | ||
|
||
### Requirements | ||
|
||
#### Functional Requirements | ||
|
||
* FR1: CAPV MUST support vGPUs as first priority. | ||
* FR2: CAPV MUST support GPUDirect as second priority. | ||
* FR3: CAPV MUST support PCI Passthrough as third priority, | ||
|
||
#### Non-Functional Requirements | ||
|
||
* NFR1: Unit tests MUST exist for all 3 supports | ||
* NFR2: e2e tests MUST exist for 3 supports | ||
|
||
### Implementation Details/Notes/Constraints | ||
|
||
#### Current State | ||
|
||
|
||
|
||
#### Proposed Changes | ||
|
||
|
||
|
||
#### Controller Changes | ||
|
||
|
||
|
||
#### Clusterctl Changes | ||
|
||
|
||
|
||
### Security Model | ||
|
||
|
||
|
||
#### Roles | ||
|
||
|
||
|
||
|
||
#### RBAC | ||
|
||
|
||
##### Write Permissions | ||
|
||
|
||
#### Namespace Restrictions | ||
|
||
|
||
|
||
#### CAPV Controller Requirements | ||
|
||
|
||
|
||
### Risks and Mitigations | ||
|
||
#### Caching | ||
|
||
|
||
|
||
## Alternatives | ||
|
||
### Using only secrets to specify vSphere accounts | ||
|
||
|
||
#### Benefits | ||
|
||
* Re-using secrets ensures encryption by default and provides a clear UX signal to end users that the data is meant to be secure | ||
* Keeps clusterctl move straightforward with the 1:1 cluster -> credential relationship | ||
|
||
#### Mitigations for current proposal | ||
|
||
|
||
|
||
## Upgrade Strategy | ||
|
||
|
||
## Additional Details | ||
|
||
### Test Plan | ||
* Unit tests for cluster controller to test behaviour vGPU is requested. | ||
* Unit tests for cluster controller to test behaviour if vGPU works as expected. | ||
* Unit tests for cluster controller to test if multiple vGPU's are requested, do all of them behave as expected. | ||
* Unit tests for cluster controller to test GPU's requested from GPU direct work as expected. | ||
* E2E test with a GPU Direct cluster running in different availability zones. | ||
* Unit tests for a single node PCI passthrough. | ||
|
||
|
||
### Graduation Criteria | ||
|
||
Alpha | ||
|
||
* Support VGPU | ||
* Support GPU Direct | ||
* Support PCI PAssthrough | ||
|
||
Beta | ||
|
||
* Full e2e coverage. | ||
|
||
Stable | ||
|
||
* Two releases since beta. | ||
|
||
## Implementation History | ||
|
||
* 08/23/2021: Initial Proposal |