-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix/aws asg unsafe decommission 5829 #6911
Conversation
Hi @ravisinha0506. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Hi, can you please keep all of the changes in a single PR? It's very difficult in GH to compare changes and responses to feedback (or even remember what the feedback was) if you open a new PR for every change. |
This PR may require API review. If so, when the changes are ready, complete the pre-review checklist and request an API review. Status of requested reviews is tracked in the API Review project. |
If you are referring to this PR, it was authored by a different person. Hence, I have created a new PR for this change. Regarding the PR you mentioned (#6818), it was submitted by someone else. Therefore, I have opened a new pull request for the proposed modification. |
I don't understand. Multiple people can commit things to a single PR, and this change is clearly based on the previous one. It's going to be very difficult to review if a new PR gets opened every single time a change is made (right now I don't even know which one is the correct one to review, since they are both still open). I can add some comments here if this is going to be the PR to reference in the future, but in that case can we please close the other one? Also please make sure we have a CLA signed and fix the issues with the linter in test-and-verify. |
Hi @drmorr0, Sorry for the multiple submissions. We have now closed the old PR and will use this PR for the review going forward. Please review and provide feedback. Thanks. |
placeHolderInstancesCount, commonAsg.Name) | ||
|
||
// Retrieve the most recent scaling activity to determine its success state. | ||
isRecentScalingActivitySuccess, err = m.getMostRecentScalingActivity(commonAsg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the meeting, we talked about not getting the scaling activity at all, just getting the list of instances in the ASG and seeing if the number of placeholders was incorrect. I think we can just get rid of this function entirely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some times scaling activities can take longer to complete, so I wanted to ensure that scaling has indeed failed. But since we are looking at total active count if ec2 instances, I think it should be okay to remove this logic.
placeHolderInstancesCount, commonAsg.Name) | ||
return nil | ||
} else { | ||
asgDetail, err := m.getDescribeAutoScalingGroupResults(commonAsg) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please call m.awsService.getAutoscalingGroupsByNames
here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
@@ -352,6 +406,33 @@ func (m *asgCache) DeleteInstances(instances []*AwsInstanceRef) error { | |||
return nil | |||
} | |||
|
|||
func (m *asgCache) getDescribeAutoScalingGroupResults(commonAsg *asg) (*autoscaling.Group, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please remove, we don't need this function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Replaced with existing getAutoscalingGroupByName api
@@ -624,3 +705,55 @@ func (m *asgCache) buildInstanceRefFromAWS(instance *autoscaling.Instance) AwsIn | |||
func (m *asgCache) Cleanup() { | |||
close(m.interrupt) | |||
} | |||
|
|||
func (m *asgCache) getMostRecentScalingActivity(asg *asg) (bool, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done
/ok-to-test |
/assign @drmorr0 |
/remove-area vertical-pod-autoscaler |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am trying to think through the implications of this again. In the current version of code, the ASG size got decremented each time a placeholder was removed, which led to (potentially) some real instances getting terminated. This was problematic for the reasons we've discussed, but one "positive" outcome is that the ASG size ended up being the "correct" size from cluster autoscaler's perspective.
With this change, we will no longer delete real instances, but if cluster autoscaler asks the cloud provider to delete 10 instances, there is a possibility that fewer of them will be deleted than requested. As best as I can tell, cluster autoscaler doesn't actually care. It looks as though the AWS provider schedules a refresh of the ASG cache on the next main loop iteration, so at that point it should be up to date, and as far as I can tell nothing else really depends on "all the nodes" being deleted before the start of the next main loop.
But, my point is I'd like a second opinion on this change -- @gjtempleton are you available to take a look?
@@ -603,7 +602,7 @@ func TestDeleteNodesWithPlaceholder(t *testing.T) { | |||
err = asgs[0].DeleteNodes([]*apiv1.Node{node}) | |||
assert.NoError(t, err) | |||
a.AssertNumberOfCalls(t, "SetDesiredCapacity", 1) | |||
a.AssertNumberOfCalls(t, "DescribeAutoScalingGroupsPages", 1) | |||
a.AssertNumberOfCalls(t, "DescribeAutoScalingGroupsPages", 2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need another test here that verifies the behaviour we're expecting -- specifically, create an ASG with (say) 3 instances and a desired capacity of (say) 10, and then call DeleteNodes
with 7 placeholders and 2 real nodes. Then we should see that the target size of the ASG goes down to 4.
For example, say the ASG has initially has nodes [i-0000, i-0001, i-0002]
and we set the desired capacity from 3 to 10. Cluster Autoscaler will update its asgCache to include [i-0000, i-0001, i-0002, placeholder-3, placeholder-4, placeholder-5, placeholder-6, placeholder-7, placeholder-8, placeholder-9]
. AWS is able to satisfy three of the requests to get a "real" state of [i-0000, i-0001, i-0002, i-0003, i-0004, i-0005, placeholder-6, placeholder-7, placeholder-8, placeholder-9]
. However, CA hasn't updated its cache. Then it calls DeleteInstances
with the following list of instances: [i-0000, i-0001, placeholder-3, placeholder-4, placeholder-5, placeholder-6, placeholder-7, placeholder-8, placeholder-9]
. The result of this DeleteInstances
call should be that the ASG has nodes [i-0002, i-0003, i-0004, i-0005]
and a desired capacity of 4.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. I have added context on the unit test to avoid any confusion.
/test all |
@kmsarabu: No jobs can be run with
In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
/lgtm |
/assign @gjtempleton |
I think this addresses my concerns and is a better state than what we have now, but I'd still like a second set of eyes on it before we merge. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One small nit, but generally happy enough that the trade-off mentioned by @drmorr0 is only potentially going to result in instances hanging around for one extra iteration, so small increase in cost.}
Happy to merge once the log line's been updated.
Thanks! |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: gjtempleton, ravisinha0506 The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
This merge resolves an issue in the Kubernetes Cluster Autoscaler where actual instances within AWS Auto Scaling Groups (ASGs) were incorrectly decommissioned instead of placeholders. The updates ensure that placeholders are exclusively targeted for scaling down under conditions where recent scaling activities have failed. This prevents the accidental termination of active nodes and enhances the reliability of the autoscaler in AWS environments.
What type of PR is this?
/kind bug
Optionally add one or more of the following kinds if applicable:
/kind api-change
/kind deprecation
/kind failing-test
/kind flake
/kind regression
-->
What this PR does / why we need it:
This PR prevents the Kubernetes Cluster Autoscaler from erroneously decommissioning actual nodes during scale-down operations in AWS environments, which could lead to unintended service disruptions.
Which issue(s) this PR fixes:
Fixes #5829
Special notes for your reviewer:
Does this PR introduce a user-facing change?
Fix an issue in the Kubernetes Cluster Autoscaler where actual AWS instances could be incorrectly scaled down instead of placeholders.