Fix/aws asg unsafe decommission 5829 #6911

ravisinha0506 · 2024-06-10T16:12:20Z

This merge resolves an issue in the Kubernetes Cluster Autoscaler where actual instances within AWS Auto Scaling Groups (ASGs) were incorrectly decommissioned instead of placeholders. The updates ensure that placeholders are exclusively targeted for scaling down under conditions where recent scaling activities have failed. This prevents the accidental termination of active nodes and enhances the reliability of the autoscaler in AWS environments.

What type of PR is this?

/kind bug

Optionally add one or more of the following kinds if applicable:
/kind api-change
/kind deprecation
/kind failing-test
/kind flake
/kind regression
-->

What this PR does / why we need it:

This PR prevents the Kubernetes Cluster Autoscaler from erroneously decommissioning actual nodes during scale-down operations in AWS environments, which could lead to unintended service disruptions.

Which issue(s) this PR fixes:

Fixes #5829

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Fix an issue in the Kubernetes Cluster Autoscaler where actual AWS instances could be incorrectly scaled down instead of placeholders.

linux-foundation-easycla · 2024-06-10T16:12:25Z

The committers listed above are authorized under a signed CLA.

✅ login: ravisinha0506 (9a0830e, d9f8217, b7ba76f, 4a5d281, f9c05fb, 3c5a97d)

k8s-ci-robot · 2024-06-10T16:12:30Z

Hi @ravisinha0506. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

drmorr0 · 2024-06-10T16:16:05Z

Hi, can you please keep all of the changes in a single PR? It's very difficult in GH to compare changes and responses to feedback (or even remember what the feedback was) if you open a new PR for every change.

k8s-triage-robot · 2024-06-10T16:36:36Z

This PR may require API review.

If so, when the changes are ready, complete the pre-review checklist and request an API review.

Status of requested reviews is tracked in the API Review project.

ravisinha0506 · 2024-06-11T21:00:07Z

Hi, can you please keep all of the changes in a single PR? It's very difficult in GH to compare changes and responses to feedback (or even remember what the feedback was) if you open a new PR for every change.

If you are referring to this PR, it was authored by a different person. Hence, I have created a new PR for this change.

Regarding the PR you mentioned (#6818), it was submitted by someone else. Therefore, I have opened a new pull request for the proposed modification.

drmorr0 · 2024-06-11T21:13:22Z

I don't understand. Multiple people can commit things to a single PR, and this change is clearly based on the previous one. It's going to be very difficult to review if a new PR gets opened every single time a change is made (right now I don't even know which one is the correct one to review, since they are both still open).

I can add some comments here if this is going to be the PR to reference in the future, but in that case can we please close the other one? Also please make sure we have a CLA signed and fix the issues with the linter in test-and-verify.

kmsarabu · 2024-06-13T00:07:55Z

Hi @drmorr0, Sorry for the multiple submissions. We have now closed the old PR and will use this PR for the review going forward. Please review and provide feedback. Thanks.

drmorr0 · 2024-06-13T15:33:41Z

cluster-autoscaler/cloudprovider/aws/auto_scaling_groups.go

+			placeHolderInstancesCount, commonAsg.Name)
+
+		// Retrieve the most recent scaling activity to determine its success state.
+		isRecentScalingActivitySuccess, err = m.getMostRecentScalingActivity(commonAsg)


In the meeting, we talked about not getting the scaling activity at all, just getting the list of instances in the ASG and seeing if the number of placeholders was incorrect. I think we can just get rid of this function entirely.

Some times scaling activities can take longer to complete, so I wanted to ensure that scaling has indeed failed. But since we are looking at total active count if ec2 instances, I think it should be okay to remove this logic.

drmorr0 · 2024-06-13T15:41:59Z

cluster-autoscaler/cloudprovider/aws/auto_scaling_groups.go

+				placeHolderInstancesCount, commonAsg.Name)
+			return nil
+		} else {
+			asgDetail, err := m.getDescribeAutoScalingGroupResults(commonAsg)


Please call m.awsService.getAutoscalingGroupsByNames here

drmorr0 · 2024-06-13T15:43:34Z

cluster-autoscaler/cloudprovider/aws/auto_scaling_groups.go

@@ -352,6 +406,33 @@ func (m *asgCache) DeleteInstances(instances []*AwsInstanceRef) error {
 	return nil
 }

+func (m *asgCache) getDescribeAutoScalingGroupResults(commonAsg *asg) (*autoscaling.Group, error) {


Please remove, we don't need this function.

Done. Replaced with existing getAutoscalingGroupByName api

drmorr0 · 2024-06-13T15:43:40Z

cluster-autoscaler/cloudprovider/aws/auto_scaling_groups.go

@@ -624,3 +705,55 @@ func (m *asgCache) buildInstanceRefFromAWS(instance *autoscaling.Instance) AwsIn
 func (m *asgCache) Cleanup() {
 	close(m.interrupt)
 }
+
+func (m *asgCache) getMostRecentScalingActivity(asg *asg) (bool, error) {


Same as above.

dims · 2024-06-17T17:57:09Z

/ok-to-test
/release-note-none

dims · 2024-06-17T18:00:25Z

/assign @drmorr0

drmorr0 · 2024-06-18T20:55:29Z

/remove-area vertical-pod-autoscaler

drmorr0

I am trying to think through the implications of this again. In the current version of code, the ASG size got decremented each time a placeholder was removed, which led to (potentially) some real instances getting terminated. This was problematic for the reasons we've discussed, but one "positive" outcome is that the ASG size ended up being the "correct" size from cluster autoscaler's perspective.

With this change, we will no longer delete real instances, but if cluster autoscaler asks the cloud provider to delete 10 instances, there is a possibility that fewer of them will be deleted than requested. As best as I can tell, cluster autoscaler doesn't actually care. It looks as though the AWS provider schedules a refresh of the ASG cache on the next main loop iteration, so at that point it should be up to date, and as far as I can tell nothing else really depends on "all the nodes" being deleted before the start of the next main loop.

But, my point is I'd like a second opinion on this change -- @gjtempleton are you available to take a look?

cluster-autoscaler/cloudprovider/aws/auto_scaling_groups.go

cluster-autoscaler/cloudprovider/aws/aws_cloud_provider_test.go

drmorr0 · 2024-06-18T21:17:01Z

cluster-autoscaler/cloudprovider/aws/aws_cloud_provider_test.go

@@ -603,7 +602,7 @@ func TestDeleteNodesWithPlaceholder(t *testing.T) {
 	err = asgs[0].DeleteNodes([]*apiv1.Node{node})
 	assert.NoError(t, err)
 	a.AssertNumberOfCalls(t, "SetDesiredCapacity", 1)
-	a.AssertNumberOfCalls(t, "DescribeAutoScalingGroupsPages", 1)
+	a.AssertNumberOfCalls(t, "DescribeAutoScalingGroupsPages", 2)


I think we need another test here that verifies the behaviour we're expecting -- specifically, create an ASG with (say) 3 instances and a desired capacity of (say) 10, and then call DeleteNodes with 7 placeholders and 2 real nodes. Then we should see that the target size of the ASG goes down to 4.

For example, say the ASG has initially has nodes [i-0000, i-0001, i-0002] and we set the desired capacity from 3 to 10. Cluster Autoscaler will update its asgCache to include [i-0000, i-0001, i-0002, placeholder-3, placeholder-4, placeholder-5, placeholder-6, placeholder-7, placeholder-8, placeholder-9]. AWS is able to satisfy three of the requests to get a "real" state of [i-0000, i-0001, i-0002, i-0003, i-0004, i-0005, placeholder-6, placeholder-7, placeholder-8, placeholder-9]. However, CA hasn't updated its cache. Then it calls DeleteInstances with the following list of instances: [i-0000, i-0001, placeholder-3, placeholder-4, placeholder-5, placeholder-6, placeholder-7, placeholder-8, placeholder-9]. The result of this DeleteInstances call should be that the ASG has nodes [i-0002, i-0003, i-0004, i-0005] and a desired capacity of 4.

Done. I have added context on the unit test to avoid any confusion.

kmsarabu · 2024-06-21T19:05:16Z

/test all

k8s-ci-robot · 2024-06-21T19:05:31Z

@kmsarabu: No jobs can be run with /test all.
The following commands are available to trigger optional jobs:

/test pull-cluster-autoscaler-e2e-azure
/test pull-kubernetes-e2e-autoscaling-vpa-full

In response to this:

/test all

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

drmorr0 · 2024-06-24T16:30:00Z

/lgtm

drmorr0 · 2024-06-24T16:33:10Z

/assign @gjtempleton

drmorr0 · 2024-06-24T16:33:55Z

I think this addresses my concerns and is a better state than what we have now, but I'd still like a second set of eyes on it before we merge.

gjtempleton

One small nit, but generally happy enough that the trade-off mentioned by @drmorr0 is only potentially going to result in instances hanging around for one extra iteration, so small increase in cost.}

Happy to merge once the log line's been updated.

cluster-autoscaler/cloudprovider/aws/auto_scaling_groups.go

gjtempleton · 2024-06-26T21:09:56Z

Thanks!
/lgtm
/approve

k8s-ci-robot · 2024-06-26T21:10:02Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: gjtempleton, ravisinha0506

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~cluster-autoscaler/cloudprovider/aws/OWNERS~~ [gjtempleton]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

…ubernetes#6911

k8s-ci-robot added the kind/failing-test Categorizes issue or PR as related to a consistently or frequently failing test. label Jun 10, 2024

k8s-ci-robot requested a review from drmorr0 June 10, 2024 16:12

k8s-ci-robot added the kind/flake Categorizes issue or PR as related to a flaky test. label Jun 10, 2024

k8s-ci-robot requested a review from gjtempleton June 10, 2024 16:12

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jun 10, 2024

drmorr0 suggested changes Jun 13, 2024

View reviewed changes

k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. and removed cncf-cla: no Indicates the PR's author has not signed the CNCF CLA. labels Jun 14, 2024

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Jun 17, 2024

k8s-ci-robot assigned drmorr0 Jun 17, 2024

ravisinha0506 force-pushed the master branch from 04ebe8e to e816936 Compare June 17, 2024 18:32

k8s-ci-robot removed the area/vertical-pod-autoscaler label Jun 18, 2024

drmorr0 suggested changes Jun 18, 2024

View reviewed changes

handling deletion of actual instances along with placeholders

b7ba76f

ravisinha0506 added 2 commits June 21, 2024 12:21

updating unit test name

d9f8217

resolving gofmt issue

f9c05fb

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 24, 2024

k8s-ci-robot assigned gjtempleton Jun 24, 2024

gjtempleton reviewed Jun 25, 2024

View reviewed changes

cluster-autoscaler/cloudprovider/aws/auto_scaling_groups.go Outdated Show resolved Hide resolved

fixing log comments

3c5a97d

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 26, 2024

k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Jun 26, 2024

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Jun 26, 2024

k8s-ci-robot merged commit 92258fb into kubernetes:master Jun 26, 2024
6 checks passed

kmsarabu added a commit to kmsarabu/autoscaler that referenced this pull request Jul 8, 2024

1.28 Backport fix - Fix/aws asg unsafe decommission kubernetes#5829 k…

eb43120

…ubernetes#6911

kmsarabu added a commit to kmsarabu/autoscaler that referenced this pull request Jul 8, 2024

1.28 Backport fix - Fix/aws asg unsafe decommission kubernetes#5829 k…

423d0e6

…ubernetes#6911

Shubham82 mentioned this pull request Jul 11, 2024

Don't scale down ASGs when placeholder nodes are discovered #5846

Closed

MaciekPytel mentioned this pull request Jul 26, 2024

Faster handling of failed scale ups #7087

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix/aws asg unsafe decommission 5829 #6911

Fix/aws asg unsafe decommission 5829 #6911

ravisinha0506 commented Jun 10, 2024 •

edited

Loading

linux-foundation-easycla bot commented Jun 10, 2024 •

edited

Loading

k8s-ci-robot commented Jun 10, 2024

drmorr0 commented Jun 10, 2024

k8s-triage-robot commented Jun 10, 2024

ravisinha0506 commented Jun 11, 2024

drmorr0 commented Jun 11, 2024

kmsarabu commented Jun 13, 2024

drmorr0 Jun 13, 2024

ravisinha0506 Jun 14, 2024

drmorr0 Jun 13, 2024

ravisinha0506 Jun 14, 2024

drmorr0 Jun 13, 2024

ravisinha0506 Jun 14, 2024

drmorr0 Jun 13, 2024

ravisinha0506 Jun 14, 2024

dims commented Jun 17, 2024

dims commented Jun 17, 2024

drmorr0 commented Jun 18, 2024

drmorr0 left a comment

drmorr0 Jun 18, 2024 •

edited

Loading

ravisinha0506 Jun 20, 2024

kmsarabu commented Jun 21, 2024

k8s-ci-robot commented Jun 21, 2024

drmorr0 commented Jun 24, 2024

drmorr0 commented Jun 24, 2024

drmorr0 commented Jun 24, 2024

gjtempleton left a comment •

edited

Loading

gjtempleton commented Jun 26, 2024

k8s-ci-robot commented Jun 26, 2024

Fix/aws asg unsafe decommission 5829 #6911

Fix/aws asg unsafe decommission 5829 #6911

Conversation

ravisinha0506 commented Jun 10, 2024 • edited Loading

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR fixes:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

linux-foundation-easycla bot commented Jun 10, 2024 • edited Loading

k8s-ci-robot commented Jun 10, 2024

drmorr0 commented Jun 10, 2024

k8s-triage-robot commented Jun 10, 2024

ravisinha0506 commented Jun 11, 2024

drmorr0 commented Jun 11, 2024

kmsarabu commented Jun 13, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dims commented Jun 17, 2024

dims commented Jun 17, 2024

drmorr0 commented Jun 18, 2024

drmorr0 left a comment

Choose a reason for hiding this comment

drmorr0 Jun 18, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kmsarabu commented Jun 21, 2024

k8s-ci-robot commented Jun 21, 2024

drmorr0 commented Jun 24, 2024

drmorr0 commented Jun 24, 2024

drmorr0 commented Jun 24, 2024

gjtempleton left a comment • edited Loading

Choose a reason for hiding this comment

gjtempleton commented Jun 26, 2024

k8s-ci-robot commented Jun 26, 2024

ravisinha0506 commented Jun 10, 2024 •

edited

Loading

linux-foundation-easycla bot commented Jun 10, 2024 •

edited

Loading

drmorr0 Jun 18, 2024 •

edited

Loading

gjtempleton left a comment •

edited

Loading