Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Full EFA attachment for non-public IPs #2271

Merged
merged 3 commits into from
Feb 10, 2025
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions docs/docs/concepts/fleets.md
Original file line number Diff line number Diff line change
Expand Up @@ -47,8 +47,8 @@ This ensures all instances are provisioned in the same backend and region with o
??? info "AWS"
`dstack` automatically enables the Elastic Fabric Adapter for all
[EFA-capable instance types :material-arrow-top-right-thin:{ .external }](https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa.html#efa-instance-types){:target="_blank"}.
Currently, only one EFA interface is enabled per instance, regardless of its maximum capacity.
This will change once [this issue :material-arrow-top-right-thin:{ .external }](https://github.com/dstackai/dstack/issues/1804){:target="_blank"} is resolved.
If the `aws` backend config has `public_ips: false` set, `dstack` enables the maximum number of interfaces supported by the instance.
Otherwise, if instances have public IPs, only one EFA interface is enabled per instance due to AWS limitations.

> The `cluster` placement is supported only for `aws`, `azure`, `gcp`, `oci`, and `vultr`
> backends.
Expand Down
1 change: 1 addition & 0 deletions src/dstack/_internal/core/backends/aws/compute.py
Original file line number Diff line number Diff line change
Expand Up @@ -240,6 +240,7 @@ def create_instance(
allocate_public_ip=allocate_public_ip,
placement_group_name=instance_config.placement_group_name,
enable_efa=enable_efa,
max_efa_interfaces=max_efa_interfaces,
reservation_id=instance_config.reservation,
is_capacity_block=is_capacity_block,
)
Expand Down
24 changes: 22 additions & 2 deletions src/dstack/_internal/core/backends/aws/resources.py
Original file line number Diff line number Diff line change
Expand Up @@ -140,6 +140,7 @@ def create_instances_struct(
allocate_public_ip: bool = True,
placement_group_name: Optional[str] = None,
enable_efa: bool = False,
max_efa_interfaces: int = 0,
reservation_id: Optional[str] = None,
is_capacity_block: bool = False,
) -> Dict[str, Any]:
Expand Down Expand Up @@ -183,7 +184,7 @@ def create_instances_struct(
# AWS allows specifying either NetworkInterfaces for specific subnet_id
# or instance-level SecurityGroupIds in case of no specific subnet_id, not both.
if subnet_id is not None:
# Even if the instance type supports multiple cards, we always request only one interface
# If the instance type supports multiple cards, we request multiple interfaces only if not allocate_public_ip
# due to the limitation: "AssociatePublicIpAddress [...] You cannot specify more than one
# network interface in the request".
# Error message: "(InvalidParameterCombination) when calling the RunInstances operation:
Expand All @@ -199,9 +200,28 @@ def create_instances_struct(
"DeviceIndex": 0,
"SubnetId": subnet_id,
"Groups": [security_group_id],
"InterfaceType": "efa" if enable_efa else "interface",
"InterfaceType": "efa" if max_efa_interfaces > 0 else "interface",
},
]

if max_efa_interfaces > 1 and allocate_public_ip is False:
for i in range(1, max_efa_interfaces):
# Set to efa-only to use interfaces exclusively for GPU-to-GPU communication
interface_type = "efa-only"
if instance_type == "p5.48xlarge":
# EFA configuration for P5 instances:
# https://docs.aws.amazon.com/AWSEC2/latest/UserGuide/efa-acc-inst-types.html#efa-for-p5
interface_type = "efa" if i % 4 == 0 else "efa-only"
struct["NetworkInterfaces"].append(
{
"AssociatePublicIpAddress": allocate_public_ip,
"NetworkCardIndex": i,
"DeviceIndex": 1,
"SubnetId": subnet_id,
"Groups": [security_group_id],
"InterfaceType": interface_type,
}
)
else:
struct["SecurityGroupIds"] = [security_group_id]

Expand Down