Avoid detaching ENIs on nodes being drained #1223

mogren · 2020-09-18T17:21:18Z

What would you like to be added:
We should prevent ipamd from trying to free ENIs when a node is about to terminated.

For spot instances we could do something similar to the aws-node-termination-handler and check some metadata endpoints.

For the case where a node is cordoned off before being terminated, meaning it is marked as "unschedulable", we should be able to check this node taints before trying to attach or detach any ENIs.

Why is this needed:
Since there is no EC2 API call to directly "delete" an ENI that is attached, instead they first have to be detached, which takes a few seconds, then deleted. If the instance gets terminated after the ENI has been detached, but before it has been deleted, it will be leaked. This leaked ENI can prevent Security Groups and VPCs from being deleted and require manual clean up.

Related issues: #608 #69, #690

jayanthvn · 2021-01-21T21:29:23Z

There are 2 parts to this - 1. There is an internal tracking ticket with Ec2 team to see if this can be handled. 2. IPAMD can check if the node is marked as unschedulable then prevent ENI deletion. But the cleaner approach is #1.

jayanthvn · 2021-01-27T20:36:41Z

Pending POC on the IPAMD changes. Also following up with ec2 team if this can be handled internally.

hurriyetgiray-ping · 2021-06-22T10:32:02Z

Do we have an ETA on this fix please? This is causing us a lot of issues, and substantial coding to workaround it.

hurriyetgiray-ping · 2021-06-23T09:45:25Z

Can this be tagged as a bug please, as this is more than a feature request?

jayanthvn · 2021-06-28T20:25:31Z

Hi @hurriyetgiray-ping,

What is the CNI version you are using and are you having short lived clusters in the account? currently we have a background thread - https://github.com/aws/amazon-vpc-cni-k8s/blob/master/pkg/awsutils/awsutils.go#L382 which cleans up leaked ENIs in the account and I agree this would need atleast one aws-node pod running in the account.

hurriyetgiray-ping · 2021-06-29T11:36:20Z

Hello @jayanthvn. Thank you for your response and also for tagging this issue as a bug. CNI version is V1.7.5 as per the below kubectl describe.

bash-4.2$ kubectl describe daemonset aws-node --namespace kube-system | grep Image | cut -d "/" -f 2
amazon-k8s-cni-init:v1.7.5
amazon-k8s-cni:v1.7.5

Would the background thread you mention help us? How frequently does it run?

Not sure what qualifies a cluster's life span as 'short', but this particular use case could possibly qualify as one. We have an EKS cluster but we manage the nodes ourselves. As part of our end-to-end test suite, we create nodes, performs tests and then delete them via Cloudformation. Here are the related CloudFormation stack events and timelines showing the delete error.


2021-06-23 16:51:27 UTC+0100 eks-stateful-node-xxx DELETE_FAILED The following resource(s) failed to delete: [WorkerNodeSecurityGroup].
2021-06-23 16:51:26 UTC+0100 WorkerNodeSecurityGroup DELETE_FAILED resource sg-xxx has a dependent object (Service: AmazonEC2; Status Code: 400; Error Code: DependencyViolation; Request ID: xxx; Proxy: null)

2021-06-23 16:07:10 UTC+0100 WorkerNodeSecurityGroup CREATE_COMPLETE  (creates sg-xxx)

2021-06-23 16:06:59 UTC+0100 eks-stateful-node-xxx CREATE_IN_PROGRESS	Transformation succeeded
2021-06-23 16:06:52 UTC+0100 eks-stateful-node-xxx CREATE_IN_PROGRESS

johngmyers · 2021-12-31T20:36:37Z

kOps is also having problems with ENIs being leaked by amazon-vpc upon cluster deletion.

I suspect the window between creation of an ENI and attaching it with DeleteOnTermination set is also a factor.

github-actions · 2022-04-13T00:15:49Z

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

Nuru · 2022-04-13T09:06:21Z

People have been complaining about this since 2019. Can we get this taken care of now please?

jayanthvn · 2022-04-13T14:18:37Z

@Nuru - #1927 should mitigate this issue up to some extent but we are actively working with EC2 team on the implementation of detach and delete calls.

bryantbiggs · 2022-04-13T14:35:33Z

@jayanthvn anything preventing #1927 from being merged?

jayanthvn · 2022-04-13T15:00:06Z

@bryantbiggs - It is pending code review. We are tracking it for 1.11.1 release. I will provide an ETA soon.

github-actions · 2022-06-13T00:17:04Z

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

jayanthvn · 2022-06-13T14:20:56Z

/not stale

EladGabay · 2022-07-21T09:31:47Z

Hi, any update here?
Thanks

jayanthvn · 2022-07-21T13:55:16Z

#1927 mitigates the issue to certain extent. But we are actively working with EC2/VPC team on fixing the backend calls.

github-actions · 2022-09-21T17:03:44Z

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

Nuru · 2022-09-23T08:03:57Z

/not stale

Nuru · 2022-09-23T18:53:04Z

@jayanthvn how do we get this marked as "not stale"?

github-actions · 2022-11-24T00:03:17Z

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

jayanthvn · 2022-11-24T00:07:53Z

/not stale

kgyovai · 2023-01-18T12:16:25Z

@Nuru - I had a similar use case to the one you described in #608. I also encountered the same error. When provisioning a managed node group with Terraform (aws_eks_node_group resource type) and then performing a subsequent destroy operation, the security group(s) assigned to the node group instance(s) could not be deleted because one or more ENIs provisioned by AWS through the managed node group were associated with the security group(s) as illustrated below.

Versions

EKS: 1.24
Terraform: 1.3.6
Terraform AWS Provider: 4.48.0
aws-vpc-cni: v1.12.0-eksbuild.2

My Solution

I ended up creating a null_resource with a local-exec destroy-time provisioner to delete any ENIs that were created by AWS through the managed node group and had already been detached. This is not a perfect solution if you are provisioning and destroying more than one set of managed node groups in an account at the same time. It works for my use case of standing up a single cluster with node groups and then tearing it all down.

Here is an example of the solution in action during a destroy operation using Terraform.

Code

/* This resource is used to ensure that any "dangling" ENIs are deleted before the destroy operation is performed on
 * the security group assigned to the managed node group. This is often, but not always, necessary because managed node
 * groups result in the creation of AWS-managed ENIs that cannot be evaluated through the use of Terraform state. The
 * AWS-managed ENIs often detach during the destroy operation but are not always deleted. The root cause of this
 * behavior is because the EC2 API requires the "detach" and "delete" invocations to be performed separately.
 * See https://github.com/aws/amazon-vpc-cni-k8s/issues/608
 */
resource "null_resource" "dangling_eni_cleanup" {
  provisioner "local-exec" {
    command     = "./delete-detached-enis.sh"
    when        = destroy
    working_dir = path.module
  }
  depends_on = [
    aws_security_group.this
  ]
}

#!/bin/sh

# This script was created to provide a solution to "dangling" ENIs that occur during the deletion of a managed EKS node
# group. https://github.com/aws/amazon-vpc-cni-k8s/issues/608#issuecomment-694654359

set -e

query_dangling_enis() {
  aws ec2 describe-network-interfaces \
    --query 'NetworkInterfaces[?Status==`available`&&contains(Description, `aws-K8S`) == `true`].NetworkInterfaceId'
}

delete_eni() {
  echo "Deleting detached ENI: $1"
  aws ec2 delete-network-interface --network-interface-id $1
  echo "The interface was deleted."
}

echo "Initiating dangling ENI cleanup."
query_dangling_enis | jq -rc '.[]' | while read eni; do
  delete_eni $eni
done
echo "All dangling ENIs have been removed."

github-actions · 2023-03-20T00:03:05Z

This issue is stale because it has been open 60 days with no activity. Remove stale label or comment or this will be closed in 14 days

kuzaxak · 2023-03-21T20:09:12Z

/not stale

github-actions · 2023-05-12T19:12:41Z

⚠️COMMENT VISIBILITY WARNING⚠️

Comments on closed issues are hard for our team to see.
If you need more assistance, please open a new issue that references this one.
If you wish to keep having a conversation with other community members under this issue feel free to do so.

mogren added enhancement feature request labels Sep 18, 2020

mogren mentioned this issue Sep 18, 2020

Available ENIs left dangling after node termination #608

Closed

Nuru mentioned this issue Sep 18, 2020

Remove autoscaler permissions from worker role cloudposse/terraform-aws-eks-node-group#34

Merged

jayanthvn added this to To do in Feature Requests Nov 4, 2020

jayanthvn assigned nithu0115 and jayanthvn Feb 10, 2021

jayanthvn mentioned this issue Apr 29, 2021

Dangling ENIs without any association with Instances #1447

Closed

jayanthvn added the bug label Jun 28, 2021

jayanthvn added this to Needs triage in Bugs Oct 28, 2021

olemarkus mentioned this issue Dec 31, 2021

Add support for graceful node shutdown kubernetes/kops#12994

Merged

jayanthvn mentioned this issue Jan 27, 2022

Leftover Network Interfaces after scaling down #1825

Closed

mbevc1 mentioned this issue Jan 27, 2022

Leftover Network Interfaces after scaling down aws/karpenter-provider-aws#1148

Closed

vikasmb mentioned this issue Feb 25, 2022

Fix kops cluster create for weekly runs #1842

Closed

jayanthvn mentioned this issue Mar 15, 2022

Prevent allocate/free ENIs when node is marked noSchedule #1927

Merged

github-actions bot added the stale Issue or PR is stale label Apr 13, 2022

github-actions bot removed the stale Issue or PR is stale label Apr 14, 2022

bryantbiggs mentioned this issue Apr 29, 2022

Error deleting security group: DependencyViolation terraform-aws-modules/terraform-aws-eks#2048

Closed

niallthomson mentioned this issue May 4, 2022

bug: Terraform destroy times out due to orphaned ENI aws-samples/eks-workshop-v2#5

Closed

github-actions bot added the stale Issue or PR is stale label Jun 13, 2022

github-actions bot removed the stale Issue or PR is stale label Jun 14, 2022

github-actions bot added the stale Issue or PR is stale label Sep 21, 2022

github-actions bot removed the stale Issue or PR is stale label Sep 24, 2022

nlamirault mentioned this issue Oct 19, 2022

EKS : Can't delete cluster using Github Actions portefaix/portefaix-kubernetes#3062

Closed

github-actions bot added the stale Issue or PR is stale label Nov 24, 2022

github-actions bot removed the stale Issue or PR is stale label Nov 25, 2022

github-actions bot added the stale Issue or PR is stale label Mar 20, 2023

github-actions bot removed the stale Issue or PR is stale label Mar 22, 2023

nlamirault mentioned this issue Apr 7, 2023

EKS : Can't delete cluster using Github Actions portefaix/portefaix-infrastructure#322

Open

jayanthvn mentioned this issue May 9, 2023

Introduce DISABLE_LEAKED_ENI_CLEANUP to disable leaked ENI cleanup task #2370

Merged

jdn5126 closed this as completed May 12, 2023

Bugs automation moved this from Needs triage to Closed May 12, 2023

bryantbiggs mentioned this issue Sep 13, 2023

fix: Ensure aws_eks_addons with before_compute flag get destoyed after compute resources terraform-aws-modules/terraform-aws-eks#2743

Closed

ytsarev mentioned this issue Dec 20, 2023

Fix addon instantiation ordering and clean deletion upbound/configuration-aws-eks#11

Merged

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid detaching ENIs on nodes being drained #1223

Avoid detaching ENIs on nodes being drained #1223

mogren commented Sep 18, 2020

jayanthvn commented Jan 21, 2021

jayanthvn commented Jan 27, 2021

hurriyetgiray-ping commented Jun 22, 2021

hurriyetgiray-ping commented Jun 23, 2021

jayanthvn commented Jun 28, 2021

hurriyetgiray-ping commented Jun 29, 2021

johngmyers commented Dec 31, 2021

github-actions bot commented Apr 13, 2022

Nuru commented Apr 13, 2022

jayanthvn commented Apr 13, 2022

bryantbiggs commented Apr 13, 2022

jayanthvn commented Apr 13, 2022

github-actions bot commented Jun 13, 2022

jayanthvn commented Jun 13, 2022

EladGabay commented Jul 21, 2022

jayanthvn commented Jul 21, 2022

github-actions bot commented Sep 21, 2022

Nuru commented Sep 23, 2022

Nuru commented Sep 23, 2022 •

edited

github-actions bot commented Nov 24, 2022

jayanthvn commented Nov 24, 2022

kgyovai commented Jan 18, 2023

github-actions bot commented Mar 20, 2023

kuzaxak commented Mar 21, 2023

github-actions bot commented May 12, 2023

Avoid detaching ENIs on nodes being drained #1223

Avoid detaching ENIs on nodes being drained #1223

Comments

mogren commented Sep 18, 2020

jayanthvn commented Jan 21, 2021

jayanthvn commented Jan 27, 2021

hurriyetgiray-ping commented Jun 22, 2021

hurriyetgiray-ping commented Jun 23, 2021

jayanthvn commented Jun 28, 2021

hurriyetgiray-ping commented Jun 29, 2021

johngmyers commented Dec 31, 2021

github-actions bot commented Apr 13, 2022

Nuru commented Apr 13, 2022

jayanthvn commented Apr 13, 2022

bryantbiggs commented Apr 13, 2022

jayanthvn commented Apr 13, 2022

github-actions bot commented Jun 13, 2022

jayanthvn commented Jun 13, 2022

EladGabay commented Jul 21, 2022

jayanthvn commented Jul 21, 2022

github-actions bot commented Sep 21, 2022

Nuru commented Sep 23, 2022

Nuru commented Sep 23, 2022 • edited

github-actions bot commented Nov 24, 2022

jayanthvn commented Nov 24, 2022

kgyovai commented Jan 18, 2023

Versions

My Solution

Code

github-actions bot commented Mar 20, 2023

kuzaxak commented Mar 21, 2023

github-actions bot commented May 12, 2023

⚠️COMMENT VISIBILITY WARNING⚠️

Nuru commented Sep 23, 2022 •

edited