Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug] I should be able to create Windows and Ubuntu nodegroups with GPU instances (revert #4243) #5256

Closed
doctorpangloss opened this issue May 13, 2022 · 17 comments · Fixed by #5627
Assignees
Labels
good first issue Good for newcomers kind/improvement priority/important-longterm Important over the long term, but may not be currently staffed and/or may require multiple releases

Comments

@doctorpangloss
Copy link

What were you trying to accomplish?

Create a Windows node group on an NVIDIA GPU instance

What happened?

I receive an error from the code that prevents me from doing that in eksctl. The error appears in https://github.com/cPu1/eksctl/blob/4bcff91151570a1a15a2a0d8d07ae33799500380/pkg/apis/eksctl.io/v1alpha5/validation.go#L286

How to reproduce it?

...
nodeGroups:
  - name: xxx
    instanceType: g5.xlarge
    amiFamily: WindowsServer20H2CoreContainer
eksctl create cluster -f config.yaml

Logs

Error: couldn't create cluster provider from options: NVIDIA GPU instance types are not supported for WindowsServer20H2CoreContainer

Anything else we need to know?

I provision the nodes the rest of the way to GPU support.

Versions

$ eksctl info
eksctl version: 0.97.0
kubectl version: v1.22.5
OS: windows
@Skarlso
Copy link
Contributor

Skarlso commented May 13, 2022

Hello.

Interesting. The code has explicit settings to disallow windows and GPU. Maybe they aren't allowed for EKS.

I can see that they are allowed for regular EC2 instances here https://docs.aws.amazon.com/AWSEC2/latest/WindowsGuide/install-nvidia-driver.html but EKS instances are a bit different.

It's worth checking out if that's the case. We do have a very specific validation against it, so there must be a reason. :)

@doctorpangloss
Copy link
Author

doctorpangloss commented May 13, 2022

I commented out this code in a fork to deploy Windows GPU instance nodegroups. Their EKS optimized AMIs do not come with GPU drivers, but the end user can still deal with that by installing drivers.

@Skarlso
Copy link
Contributor

Skarlso commented May 14, 2022

Gotcha, thanks for checking that out!

@doctorpangloss
Copy link
Author

doctorpangloss commented May 18, 2022

This bug is reporting a design flaw not an engineering flaw. It is my opinion you should not do this check. eksctl could give a warning instead. Maybe change it to that? There will be inertia around this code already being written.

@Skarlso
Copy link
Contributor

Skarlso commented May 19, 2022

That is certainly an option that we will be discussing!

@Himangini Himangini added the priority/backlog Not staffed at the moment. Help wanted. label May 24, 2022
@github-actions
Copy link
Contributor

This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days.

@github-actions github-actions bot added the stale label Jun 24, 2022
@github-actions
Copy link
Contributor

This issue was closed because it has been stalled for 5 days with no activity.

@cPu1 cPu1 removed the stale label Jun 30, 2022
@cPu1 cPu1 reopened this Jun 30, 2022
@Himangini Himangini added the kind/feature New feature or request label Jul 19, 2022
@Himangini
Copy link
Collaborator

@cPu1 to add comment on this

We need to spike on the issue to see if the feature can be done

Outcome : Document finding here for spike

Timebox: 1-2 days

@doctorpangloss
Copy link
Author

Here's a PR you can review: #5520

@cPu1
Copy link
Contributor

cPu1 commented Jul 19, 2022

@doctorpangloss, the original behaviour of allowing creation of Windows GPU nodegroups was an oversight. The validation was later added to avoid giving the illusion that Windows GPU nodegroups are supported out of the box, when in fact, EKS optimized Windows AMIs do not ship with the GPU drivers installed. So we're planning to address this change in behaviour by evaluating adding support for installing the GPU drivers for Windows as part of the node bootstrapping process, not by removing the validation.

@doctorpangloss
Copy link
Author

doctorpangloss commented Jul 19, 2022

So we're planning to address this change in behaviour by evaluating adding support for installing the GPU drivers for Windows as part of the node bootstrapping process, not by removing the validation.

Sounds like a different problem.

avoid giving the illusion that Windows GPU nodegroups are supported out of the box

They are supported as far as I can tell.

You should remove the validation for now. The validation was well meaning but wrong. I understand there will be inertia around this code already being written.

I can tell you more about installing the GPU drivers maybe in a different place, like a dedicated driver bootstrapping ticket. Even if you install the drivers, apps will not be able to use the GPU. EKS doesn't even support hostProcess containers yet. This is way beyond the scope of this ticket.

You guys can include documentation that says that Windows GPU nodes are not very useful. But why? There are 0 .yaml files - Kubernetes manifests and Helm Charts - indexed on all of public GitHub that use GPUs on Windows.

@cPu1
Copy link
Contributor

cPu1 commented Jul 19, 2022

They are supported out of the box. What do you mean? I'm running Windows GPU node groups happily right now.

Are you running them without installing the GPU drivers or running any additional steps after eksctl has created the nodegroup? eksctl supports GPU instances for Amazon Linux 2 and Bottlerocket because they provide GPU-specific AMIs that ship with the drivers installed and support running GPU-accelerated workloads without additional configuration after nodegroup creation. This is not the case for Windows.

Anyway, we have decided to remove the validation and add a warning message because Windows is not a priority right now.

@Himangini Himangini removed the kind/feature New feature or request label Jul 19, 2022
@Himangini
Copy link
Collaborator

@doctorpangloss As noted above by @cPu1 we decided that we will remove this validation and add a warning message instead to the user. Since you already have a PR open that removes the validation would you be able to add the warning message instead?

@doctorpangloss
Copy link
Author

add a warning message instead to the user.

I could but you guys don't release how poorly I know golang.

@Himangini
Copy link
Collaborator

add a warning message instead to the user.

I could but you guys don't release how poorly I know golang.

@doctorpangloss alright, we'll get the fix added. Can you please close your PR?

@doctorpangloss
Copy link
Author

done thank you!

@Himangini Himangini added the good first issue Good for newcomers label Aug 3, 2022
@Himangini Himangini added priority/important-longterm Important over the long term, but may not be currently staffed and/or may require multiple releases and removed priority/backlog Not staffed at the moment. Help wanted. labels Aug 12, 2022
@TiberiuGC TiberiuGC self-assigned this Aug 12, 2022
@cPu1 cPu1 changed the title [Bug] I should be able to create a Windows nodegroup with GPU instances (revert #4243) [Bug] I should be able to create Windows and Ubuntu nodegroups with GPU instances (revert #4243) Aug 16, 2022
@cPu1
Copy link
Contributor

cPu1 commented Aug 16, 2022

I have edited the title to relax the validation for Ubuntu nodegroups as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
good first issue Good for newcomers kind/improvement priority/important-longterm Important over the long term, but may not be currently staffed and/or may require multiple releases
Projects
None yet
5 participants