Skip to content

Commit

Permalink
Proposes support of GPU in CAPV
Browse files Browse the repository at this point in the history
This commit adds a proposal which enables GPU support
in CAPV using.

Signed-off-by: Geetika Batra <geetikab@vmware.com>
  • Loading branch information
geetikabatra committed Aug 26, 2021
1 parent 267f4e0 commit 040ad8f
Showing 1 changed file with 200 additions and 0 deletions.
200 changes: 200 additions & 0 deletions docs/proposal/20210823-gpu-support.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,200 @@
# GPU support CAPV

```text
---
title: GPU support CAPV
authors:
- "@geetikabatra"
reviewers:
- "@vijaykumar"
creation-date: 2021-08-23
last-updated: 2021-08-25
status: WIP
```

## Table of Contents

* [Single Controller Multitenancy](#gpu-support)
* [Glossary](#glossary)
* [Summary](#summary)
* [Motivation](#motivation)
* [Goals](#goals)
* [Proposal](#proposal)
* [User Stories](#user-stories)
* [Story 1 - Deploying to multiple vCenters](#story-1---using-vgpu)
* [Story 2 - Deploying multiple clusters from a single account](#story-2---gpu-direct-implementation)
* [Story 3 - Legacy behaviour](#story-3---pci-passthrough-for-single-node-customer)
* [Requirements](#requirements)
* [Functional Requirements](#functional-requirements)
* [Non-Functional Requirements](#non-functional-requirements)
* [Implementation Details/Notes/Constraints](#implementation-detailsnotesconstraints)
* [Current State](#current-state)
* [Proposed Changes](#proposed-changes)
* [Controller Changes](#controller-changes)
* [Clusterctl Changes](#clusterctl-changes)
* [Security Model](#security-model)
* [Roles](#roles)
* [RBAC](#rbac)
* [Write Permissions](#write-permissions)
* [Namespace Restrictions](#namespace-restrictions)
* [CAPV Controller Requirements](#capv-controller-requirements)
* [Risks and Mitigations](#risks-and-mitigations)
* [Caching](#caching)
* [Alternatives](#alternatives)
* [Using only secrets to specify vSphere accounts](#using-only-secrets-to-specify-vsphere-accounts)
* [Benefits](#benefits)
* [Mitigations for current proposal](#mitigations-for-current-proposal)
* [Upgrade Strategy](#upgrade-strategy)
* [Additional Details](#additional-details)
* [Test Plan](#test-plan)
* [Graduation Criteria](#graduation-criteria)
* [Implementation History](#implementation-history)

## Glossary

* CAPV - An abbreviation of Cluster API Provider vSphere

## Summary

CAPV currently doesnot support GPU based infrastructure.

This proposal outlines ways to provision GPU support in CAPV. The proposed changes will maintain backwards compatibility and maintain the existing behaviour without any extra user configuration.

## Motivation

CAPV does not currently support GPU-accelerated workloads which has limited total addressable market.Competitively, all of the hyperscale cloud providers (AWS, Azure, & GCP) offer GPU-accelerated virtualization and Kubernetes platforms. Many of them additionally offer robust vertically integrated AI/ML solutions building on open source technology. GPU support is a long-requested feature. With this capability, user can leverage GPU support.

### Goals

1. To enable GPU Support on Vsphere Kubernetes Cluster.
2. To allow vGPU support for multi node clusters
3. To allow GPU Direct support for multi node clusters
4. To allow PCI passthrough support for single node clusters.

## Proposal

### User Stories

#### Story 1 - using vGPU

Stacy works at an IT department of a organization where AI/ML models are run every hour and the entire staff wants to leverage GPU supported nodes. VGPU's are able to provide close to bare metal performance. Stacy needs a lot of sharing GPU’s.
vGPU provides a full fleshed cluster from which a virtual GPU can be requested.

#### Story 2 - GPU Direct implementation

Tony is an Engineer at a big organization with multiple centers across the Globe. Tony needs a big GPU cluster to address the needs of his
organization. So they intend to use GPU Direc. GPU direct is technically a pool of GPU’s connected by a network card in a kind of peer to peer network. This is the latest technology and gives advantage of shared GPU’s from a pool. GPU direct is the technology which helps in linking multiple GPU’s. Tony wants to leverage this technology so that it is easier for him to manage resource allocation.


#### Story 3 - PCI passthrough for single node Customer

Alex is an Engineer at retail organization that requires single GPU node. They use one node with GPU attached and want to keep things simple. Alex can simply add this GPU connected machine to the cluster and that should do the job. While selecting nodes, Alex can use appropriate labels to run his AI/ML workload on this particular node. PCI passthrough will provide direct GPU support. Challenges that Alex can face is that using passthrough Alex wouln't be able to migrate nodes.

### Requirements

#### Functional Requirements

* FR1: CAPV MUST support vGPUs as first priority.
* FR2: CAPV MUST support GPUDirect as second priority.
* FR3: CAPV MUST support PCI Passthrough as third priority,

#### Non-Functional Requirements

* NFR1: Unit tests MUST exist for all 3 supports
* NFR2: e2e tests MUST exist for 3 supports

### Implementation Details/Notes/Constraints

#### Current State



#### Proposed Changes



#### Controller Changes



#### Clusterctl Changes



### Security Model



#### Roles




#### RBAC


##### Write Permissions


#### Namespace Restrictions



#### CAPV Controller Requirements



### Risks and Mitigations

#### Caching



## Alternatives

### Using only secrets to specify vSphere accounts


#### Benefits

* Re-using secrets ensures encryption by default and provides a clear UX signal to end users that the data is meant to be secure
* Keeps clusterctl move straightforward with the 1:1 cluster -> credential relationship

#### Mitigations for current proposal



## Upgrade Strategy


## Additional Details

### Test Plan
* Unit tests for cluster controller to test behaviour vGPU is requested.
* Unit tests for cluster controller to test behaviour if vGPU works as expected.
* Unit tests for cluster controller to test if multiple vGPU's are requested, do all of them behave as expected.
* Unit tests for cluster controller to test GPU's requested from GPU direct work as expected.
* E2E test with a GPU Direct cluster running in different availability zones.
* Unit tests for a single node PCI passthrough.


### Graduation Criteria

Alpha

* Support VGPU
* Support GPU Direct
* Support PCI PAssthrough

Beta

* Full e2e coverage.

Stable

* Two releases since beta.

## Implementation History

* 08/23/2021: Initial Proposal

0 comments on commit 040ad8f

Please sign in to comment.