-
Notifications
You must be signed in to change notification settings - Fork 1.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Bug] I should be able to create Windows and Ubuntu nodegroups with GPU instances (revert #4243) #5256
Comments
Hello. Interesting. The code has explicit settings to disallow windows and GPU. Maybe they aren't allowed for EKS. I can see that they are allowed for regular EC2 instances here https://docs.aws.amazon.com/AWSEC2/latest/WindowsGuide/install-nvidia-driver.html but EKS instances are a bit different. It's worth checking out if that's the case. We do have a very specific validation against it, so there must be a reason. :) |
I commented out this code in a fork to deploy Windows GPU instance nodegroups. Their EKS optimized AMIs do not come with GPU drivers, but the end user can still deal with that by installing drivers. |
Gotcha, thanks for checking that out! |
This bug is reporting a design flaw not an engineering flaw. It is my opinion you should not do this check. |
That is certainly an option that we will be discussing! |
This issue is stale because it has been open 30 days with no activity. Remove stale label or comment or this will be closed in 5 days. |
This issue was closed because it has been stalled for 5 days with no activity. |
@cPu1 to add comment on this We need to spike on the issue to see if the feature can be done Outcome : Document finding here for spike Timebox: 1-2 days |
Here's a PR you can review: #5520 |
@doctorpangloss, the original behaviour of allowing creation of Windows GPU nodegroups was an oversight. The validation was later added to avoid giving the illusion that Windows GPU nodegroups are supported out of the box, when in fact, EKS optimized Windows AMIs do not ship with the GPU drivers installed. So we're planning to address this change in behaviour by evaluating adding support for installing the GPU drivers for Windows as part of the node bootstrapping process, not by removing the validation. |
Sounds like a different problem.
They are supported as far as I can tell. You should remove the validation for now. The validation was well meaning but wrong. I understand there will be inertia around this code already being written. I can tell you more about installing the GPU drivers maybe in a different place, like a dedicated driver bootstrapping ticket. Even if you install the drivers, apps will not be able to use the GPU. EKS doesn't even support hostProcess containers yet. This is way beyond the scope of this ticket. You guys can include documentation that says that Windows GPU nodes are not very useful. But why? There are 0 .yaml files - Kubernetes manifests and Helm Charts - indexed on all of public GitHub that use GPUs on Windows. |
Are you running them without installing the GPU drivers or running any additional steps after eksctl has created the nodegroup? eksctl supports GPU instances for Amazon Linux 2 and Bottlerocket because they provide GPU-specific AMIs that ship with the drivers installed and support running GPU-accelerated workloads without additional configuration after nodegroup creation. This is not the case for Windows. Anyway, we have decided to remove the validation and add a warning message because Windows is not a priority right now. |
@doctorpangloss As noted above by @cPu1 we decided that we will remove this validation and add a warning message instead to the user. Since you already have a PR open that removes the validation would you be able to add the warning message instead? |
I could but you guys don't release how poorly I know golang. |
@doctorpangloss alright, we'll get the fix added. Can you please close your PR? |
done thank you! |
I have edited the title to relax the validation for Ubuntu nodegroups as well. |
What were you trying to accomplish?
Create a Windows node group on an NVIDIA GPU instance
What happened?
I receive an error from the code that prevents me from doing that in
eksctl
. The error appears in https://github.com/cPu1/eksctl/blob/4bcff91151570a1a15a2a0d8d07ae33799500380/pkg/apis/eksctl.io/v1alpha5/validation.go#L286How to reproduce it?
Logs
Anything else we need to know?
I provision the nodes the rest of the way to GPU support.
Versions
The text was updated successfully, but these errors were encountered: