Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Proposal for GPU and PCI passthrough support #1237

Merged
merged 1 commit into from
Mar 28, 2022

Conversation

geetikabatra
Copy link
Contributor

This commit adds a proposal which enables GPU support
in CAPV using.

Signed-off-by: Geetika Batra geetikab@vmware.com

What this PR does / why we need it:

Which issue(s) this PR fixes (optional, in fixes #<issue number>(, fixes #<issue_number>, ...) format, will close the issue(s) when PR gets merged):
Fixes #

Special notes for your reviewer:

Please confirm that if this PR changes any image versions, then that's the sole change this PR makes.

Release note:


@k8s-ci-robot k8s-ci-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Aug 26, 2021
@k8s-ci-robot
Copy link
Contributor

Hi @geetikabatra. Thanks for your PR.

I'm waiting for a kubernetes-sigs member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Aug 26, 2021
@geetikabatra geetikabatra changed the title Proposal for GPU support WIP: Proposal for GPU support Aug 26, 2021
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Aug 26, 2021
@gab-satchi
Copy link
Member

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Aug 26, 2021
docs/proposal/20210823-gpu-support.md Outdated Show resolved Hide resolved

#### Story 3 - PCI passthrough for single node Customer

Alex is an Engineer at retail organization that requires single GPU node. They use one node with GPU attached and want to keep things simple. Alex can simply add this GPU connected machine to the cluster and that should do the job. While selecting nodes, Alex can use appropriate labels to run his AI/ML workload on this particular node. PCI passthrough will provide direct GPU support. Challenges that Alex can face is that using passthrough Alex wouln't be able to migrate nodes.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I suppose this also applies to non-GPU devices, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@MaxRink I don't get your question here. Non GPU will be counted among regular nodes in my opinion.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean passthrough for PCI-E devices that are not GPUs, as PCI-E passthrough shouldnt really care if its a GPU, a NIC or even an HBA

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, If we put it that way, yes. It also applies for non GPU devices.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, precisely this. This proposal should be retitled to cover both GPU and PCI Passthrough support as that's what's being implemented.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@yastij Does it make sense to separate out the PCI Passthrough support to a separate proposal? This could also enable the workflow of provisioning VMs physical NICs attached to the VMs created by CAPV.

And if it is too much work to separate the proposal out, I +1 Naadir's comments on retitling the proposal since PCI passthrough support is not strictly GPU related.

@geetikabatra geetikabatra marked this pull request as ready for review September 14, 2021 17:31
@geetikabatra geetikabatra changed the title WIP: Proposal for GPU support Proposal for GPU support Sep 14, 2021
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Sep 14, 2021
docs/proposal/20210823-gpu-support.md Outdated Show resolved Hide resolved
docs/proposal/20210823-gpu-support.md Outdated Show resolved Hide resolved
docs/proposal/20210823-gpu-support.md Outdated Show resolved Hide resolved
docs/proposal/20210823-gpu-support.md Outdated Show resolved Hide resolved
docs/proposal/20210823-gpu-support.md Outdated Show resolved Hide resolved
@srm09
Copy link
Contributor

srm09 commented Feb 2, 2022

/unassign @gab-satchi
/assign

@k8s-ci-robot k8s-ci-robot assigned srm09 and unassigned gab-satchi Feb 2, 2022
Copy link
Member

@yastij yastij left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

did an API review, a few comments to fix before merging. I also Agree that splitting vgpu and pci made things clearer.

Thanks !

docs/proposal/20210823-gpu-support.md Outdated Show resolved Hide resolved
docs/proposal/20210823-gpu-support.md Outdated Show resolved Hide resolved
docs/proposal/20210823-gpu-support.md Outdated Show resolved Hide resolved
docs/proposal/20210823-gpu-support.md Outdated Show resolved Hide resolved
docs/proposal/20210823-gpu-support.md Outdated Show resolved Hide resolved
docs/proposal/20210823-gpu-support.md Outdated Show resolved Hide resolved
Copy link
Contributor

@srm09 srm09 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One of the missing things here is, changes to the VSphereVMStatus object to bubble the Device info to the user.
Another thing to take into consideration is if any new Conditions are needed to surface error scenarios or maybe just new reasons which can be set for existing conditions if any error happens during VM creation.

docs/proposal/20210823-gpu-support.md Show resolved Hide resolved
docs/proposal/20210823-gpu-support.md Show resolved Hide resolved
docs/proposal/20210823-gpu-support.md Outdated Show resolved Hide resolved
docs/proposal/20210823-gpu-support.md Show resolved Hide resolved
docs/proposal/20210823-gpu-support.md Outdated Show resolved Hide resolved
docs/proposal/20210823-gpu-support.md Outdated Show resolved Hide resolved
docs/proposal/20210823-gpu-support.md Outdated Show resolved Hide resolved
docs/proposal/20210823-gpu-support.md Outdated Show resolved Hide resolved
@k8s-ci-robot
Copy link
Contributor

k8s-ci-robot commented Feb 23, 2022

@geetikabatra: The following test failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
pull-cluster-api-provider-vsphere-test 0389031 link true /test pull-cluster-api-provider-vsphere-test

Full PR test history. Your PR dashboard. Please help us cut down on flakes by linking to an open issue when you hit one in your PR.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository. I understand the commands that are listed here.

@geetikabatra geetikabatra force-pushed the proposal branch 2 times, most recently from c52bcc7 to a49e0d7 Compare March 16, 2022 11:45
@geetikabatra
Copy link
Contributor Author

@srm09 PTAL

docs/proposal/20210823-gpu-support.md Outdated Show resolved Hide resolved

```text
---
title: GPU support in CAPV
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above

docs/proposal/20210823-gpu-support.md Outdated Show resolved Hide resolved
docs/proposal/20210823-gpu-support.md Outdated Show resolved Hide resolved
docs/proposal/20210823-gpu-support.md Outdated Show resolved Hide resolved
docs/proposal/20210823-gpu-support.md Show resolved Hide resolved
docs/proposal/20210823-gpu-support.md Show resolved Hide resolved
Copy link
Contributor

@srm09 srm09 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is missing the updates to the VSphereVMStatus to provide the Device info to the user which was called out in this comment.
I do not see the validation webhook changes in the final proposal, was this change overridden?

Could you please add the two things mentioned above and also take a look at the open conversations and resolve them if appropriate?
Thanks for all the work on the proposal, it is almost there! 👍🏾

docs/proposal/20210823-gpu-support.md Outdated Show resolved Hide resolved
docs/proposal/20210823-gpu-support.md Outdated Show resolved Hide resolved
@geetikabatra
Copy link
Contributor Author

I had pushed an older version of the proposal hence many changes disappeared. Updating the latest one.

@geetikabatra geetikabatra force-pushed the proposal branch 2 times, most recently from 1ce8a90 to 5894c2a Compare March 17, 2022 16:08
Copy link
Contributor

@srm09 srm09 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One final set of changes. @yastij can you take a look at it too?

docs/proposal/20210823-gpu-support.md Outdated Show resolved Hide resolved
docs/proposal/20210823-gpu-support.md Outdated Show resolved Hide resolved
Signed-off-by: geetikab@vmware.com <geetikab@vmware.com>
Co-authored-by: Sagar Muchhal <muchhals@vmware.com>
@geetikabatra
Copy link
Contributor Author

@yastij and @srm09 This can be merged now!

@geetikabatra
Copy link
Contributor Author

One of the missing things here is, changes to the VSphereVMStatus object to bubble the Device info to the user. Another thing to take into consideration is if any new Conditions are needed to surface error scenarios or maybe just new reasons which can be set for existing conditions if any error happens during VM creation.

This is addressed in https://github.com/kubernetes-sigs/cluster-api-provider-vsphere/pull/1237/files#diff-da5c64bcd81497eebce388a7c1e0250d75fd759fb2131341546691e3d7475e41R147

@yastij
Copy link
Member

yastij commented Mar 28, 2022

/approve
/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Mar 28, 2022
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: yastij

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Mar 28, 2022
@k8s-ci-robot k8s-ci-robot merged commit 32d025c into kubernetes-sigs:main Mar 28, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. lgtm "Looks good to me", indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.