-
Notifications
You must be signed in to change notification settings - Fork 957
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add support for Elastic Fabric Adapter (EFA) to Karpenter #3127
Comments
IIUC, this falls under the scope of custom resources, which should be handled with #2390. @jonathan-innis can you confirm this? |
We could also look to add early support for Dynamic Resource Allocation. See https://kubernetes.io/blog/2022/12/15/dynamic-resource-allocation/ EFAs are an interesting bit of (virtual) hardware because the OS-bypass networking can only happen within the same subnet. That then could mean that the scheduler needs to be aware of that limitation in order to place Pods appropriately. There are some other considerations, such as what security group to use for the EFAs. I could imagine a large cluster having more than one kind of EFA, perhaps each different kind is associated with a different security group. |
As a first step, it might be okay to leave the single subnet setup up to the user. They could configure a single provisioner with EFA support. Placement groups would also need to be setup, so it may be convenient to set both of those up at the Provisioner level and then target that provisioner. Security groups would also be setup at the provisioner level as normal. |
Until this feature is implemented, a temporary workaround is documented here: |
This feature is really important for making distributed training on EKS viable and is currently forming a bottleneck for us training large models. Its especially hard getting the launch template working well in combination with mounting nvme disks. |
Problem Statement:
Some workloads [distributed training, simulations, HPC applications] require high performance networking on AWS provided by instances that are enabled with Elastic Fabric Adapter. The specific instance types are documented here. Currently Karpenter does not recognize EFA resource requests or limits specified in Kubernetes manifests as described below.
When such a manifest is applied to a Karpenter-enabled cluster, the Karpenter controller produces an error like the following:
Feature Request:
Add capability in Karpenter to recognize resource
vpc.amazonaws.com/efa
, identify, and provision a suitable EC2 instance type with EFA enabled.The text was updated successfully, but these errors were encountered: