Faster handling of failed scale ups #7087

bskiba · 2024-07-24T11:51:13Z

What type of PR is this?

/kind feature

What this PR does / why we need it:

Instead of breaking the loop after deleting failed scale up nodes, clean up the state and continue. Most importantly, update the target for the affected node group, so that the deleted nodes are not considered upcoming.

This PR speeds up handling of failed scale ups, useful especially with multiple quota or stockout errors across the cluster.

Special notes for your reviewer:

The PR consists of 3 commits. The actual change is in commit #3 while commits #1 and #2 are refactors necessary for the actual change.

Does this PR introduce a user-facing change?

Faster handling of failed scale ups, useful especially with multiple quota or stockout errors across the cluster.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

bskiba · 2024-07-24T11:51:25Z

/assign @MaciekPytel

MaciekPytel · 2024-07-26T10:41:34Z

This seems like you've maybe forgot to include part of changes to static_autoscaler in the commit? You're doing cleanup in clusterstate (which I need to look at more to understand it), but you're still exiting the loop after cleanup. I think you probably need to stop returning bool from deleteCreatedNodesWithErrors and remove the conditional return after the call?

MaciekPytel · 2024-07-26T12:45:58Z

cluster-autoscaler/clusterstate/clusterstate.go

+	klog.V(4).Infof("Updating state after failed scale up for %s nodeGroup", nodeGroup.Id())
+
+	csr.InvalidateNodeInstancesCacheEntry(nodeGroup)
+	targetSize, err := nodeGroup.TargetSize()


You're effectively doing the exact same thing that csr.Recalculate() is doing, just for a single nodeGroup. I can't say I like that:

I think having all those specialized methods add to complexity. It would be much easier to reason about just a single recalculate path.

It's obviously brittle to have two different implementations here - I can easily see someone updating one of those and not the other.

With the above in mind, I'd suggest one of the following refactors:

Change this to just call (either via a wrapper in csr or directly in static_autoscaler)
csr.InvalidateNodeInstancesCacheEntry(nodeGroup) csr.Recalculate() // maybe just once if deletedAny

Change Recalculate() to accept a nodeGroup (or list of nodeGroups?) as a parameter if you're worried about the extra cost of calling TargetSize() for every NG. I strongly suspect this may be a premature optimization and may not be worth the added complexity:

TargetSize() really should be cached at cloudprovider side and actual API calls should only happen for NGs where something was deleted.

We're already doing full Recalculate() in scale-up and scale-down paths, both of which should happen more often than the path you're changing - and we seem to think it's fine.

I guess if you go this way you may also change other Recalculate() calls to be more selective? Funnily enough we have an old TODO for it https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/core/scaleup/orchestrator/orchestrator.go#L548. I doubt @losipiuk will come back to CA to pick it up (you're always welcome though, we'd love to have you back!).

I went with using Recalculate once if deletedAny. I think you're right that doing this on a subset of nodeGroups is a premature optimisation.

cluster-autoscaler/core/static_autoscaler.go

MaciekPytel · 2024-07-26T12:51:26Z

cluster-autoscaler/core/static_autoscaler.go

+			err = a.clusterStateRegistry.HandleDeletingFailedScaleUpNodes(nodeGroup)
+			if err != nil {
+				returnErr = fmt.Errorf("Failed to clean up state after deleting failed nodes from %s node group: %v", nodeGroup.Id(), err)
+				klog.Warningf(returnErr.Error())


I'd log this on Error() level - I'd argue any error that causes us to break the loop should be flagged as a serious problem

Actually, Recalculate() doesn't return that error, so we're no longer breaking the loop after recent changes. Do you think that will cause issues?

cluster-autoscaler/core/static_autoscaler.go

MaciekPytel · 2024-07-26T13:32:55Z

I think this would work (at least as far as making upcoming nodes calculation account for deleted nodes) and the fact that we seem to already do exactly this type of update on CSR on scale-up/scale-down gives me some confidence.

Still it is pretty scary to do a partial csr recalculate here - even if we do it in scale-up/down that happens much later in the loop and I don't think we really use csr all that much at this point. I would be tempted to do a full UpdateNodes(), but that seems like it can create as many problems as it solves:

pro: no tricky partial recalculations; doing a full recalculation seems much safer from perspective of having csr in an internally consistent state
con: what if we get more create errors in the meantime? I think it's safe to proceed without cleaning them up, but I also think it's safe to proceed with Recalculate() - and I'm not sure which of those I'm more confident in.
- At least with GCE provider, it is very likely that we would get new instances in error state as MIGs tend to return those errors gradually.

@drmorr0 @gjtempleton Do you see any risk of continuing the loop here causing some problems with AWS placeholders? Looking at #6911 I think it should be fine to continue loop after deleting the placeholders -> TargetSize() of ASG seems to be updated when the instances are deleted, which I think is the only thing we need from cloudprovider implementation in order to safely proceed.

bskiba · 2024-07-30T12:23:44Z

I've updated this PR to use Recalculate() as suggested @MaciekPytel. If anyone sees issues with this approach, do let me know.

bskiba · 2024-08-01T08:17:47Z

/assign @gjtempleton
@gjtempleton can you take a look and verify if this approach is OK per Maciek's comment

MaciekPytel · 2024-08-05T12:25:22Z

cluster-autoscaler/clusterstate/clusterstate_test.go

@@ -176,6 +176,73 @@ func TestHasNodeGroupStartedScaleUp(t *testing.T) {
 	}
 }

+func TestDeletingFailedScaleUpNodes(t *testing.T) {


It's nitpicking, but I kinda don't like the name of this test. The test itself is checking that UpcomingNodes update correctly after nodeGroup resize + csr.Recalculate() which is an important assumption for the logic in core/ related to deleting failed scale-up nodes, but doesn't really have anything to do with logic related to scale-up failures in CSR. Just looking at csr it feels impossible to understand what's going on here.

Admittedly, I don't know if renaming to TestUpcomingNodesAfterResizeAndRecalculate - it's a mouthful and probably not more clear at all. Maybe what we're missing is a comment that explains what is being tested and why?

Fair point. I've renamed the test and added a comment.

MaciekPytel · 2024-08-05T12:29:49Z

/lgtm
/approve
/hold

Left a single nitpick, but overall looks good. Really happy to finally see that issue fixed :)
Leaving a hold for @gjtempleton or @drmorr0 to have a chance to look, if they want. I'll bring it up on today sig meeting. I don't really expect this version to cause any problems and I'm thinking about invoking a lazy consensus and merging this later this week unless anyone objects.

k8s-ci-robot · 2024-08-05T12:29:57Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bskiba, MaciekPytel

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~cluster-autoscaler/OWNERS~~ [MaciekPytel]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

bskiba · 2024-08-12T08:54:41Z

@MaciekPytel I addressed the last comment, thanks a lot for your review. Are we OK to merge this now?

Clean up cluster state after removing failed scale up nodes, so that the loop can continue. Most importantly, update the target for the affected node group, so that the deleted nodes are not considered upcoming.

MaciekPytel · 2024-08-13T09:23:00Z

/lgtm
/hold cancel

k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Jul 24, 2024

k8s-ci-robot requested a review from aleksandra-malinowska July 24, 2024 11:51

k8s-ci-robot added the area/cluster-autoscaler label Jul 24, 2024

k8s-ci-robot requested a review from vadasambar July 24, 2024 11:51

k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jul 24, 2024

k8s-ci-robot assigned MaciekPytel Jul 24, 2024

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jul 24, 2024

bskiba force-pushed the stockout-handling branch 2 times, most recently from c766715 to 2e31839 Compare July 26, 2024 12:08

MaciekPytel reviewed Jul 26, 2024

View reviewed changes

cluster-autoscaler/core/static_autoscaler.go Outdated Show resolved Hide resolved

MaciekPytel reviewed Jul 26, 2024

View reviewed changes

cluster-autoscaler/core/static_autoscaler.go Outdated Show resolved Hide resolved

bskiba force-pushed the stockout-handling branch from 2e31839 to c569ac5 Compare July 30, 2024 12:15

k8s-ci-robot assigned gjtempleton Aug 1, 2024

MaciekPytel reviewed Aug 5, 2024

View reviewed changes

k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Aug 5, 2024

k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 5, 2024

bskiba force-pushed the stockout-handling branch from c569ac5 to dc53cb9 Compare August 12, 2024 08:52

k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 12, 2024

Do not break the loop after removing failed scale up nodes

939123c

Clean up cluster state after removing failed scale up nodes, so that the loop can continue. Most importantly, update the target for the affected node group, so that the deleted nodes are not considered upcoming.

bskiba force-pushed the stockout-handling branch from dc53cb9 to 939123c Compare August 12, 2024 10:04

k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Aug 13, 2024

k8s-ci-robot merged commit aec7b75 into kubernetes:master Aug 13, 2024
6 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster handling of failed scale ups #7087

Faster handling of failed scale ups #7087

bskiba commented Jul 24, 2024

bskiba commented Jul 24, 2024

MaciekPytel commented Jul 26, 2024

MaciekPytel Jul 26, 2024

bskiba Jul 30, 2024

MaciekPytel Jul 26, 2024

bskiba Jul 30, 2024

MaciekPytel commented Jul 26, 2024

bskiba commented Jul 30, 2024

bskiba commented Aug 1, 2024

MaciekPytel Aug 5, 2024

bskiba Aug 12, 2024

MaciekPytel commented Aug 5, 2024

k8s-ci-robot commented Aug 5, 2024

bskiba commented Aug 12, 2024

MaciekPytel commented Aug 13, 2024

Faster handling of failed scale ups #7087

Faster handling of failed scale ups #7087

Conversation

bskiba commented Jul 24, 2024

What type of PR is this?

What this PR does / why we need it:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

bskiba commented Jul 24, 2024

MaciekPytel commented Jul 26, 2024

MaciekPytel Jul 26, 2024

Choose a reason for hiding this comment

bskiba Jul 30, 2024

Choose a reason for hiding this comment

MaciekPytel Jul 26, 2024

Choose a reason for hiding this comment

bskiba Jul 30, 2024

Choose a reason for hiding this comment

MaciekPytel commented Jul 26, 2024

bskiba commented Jul 30, 2024

bskiba commented Aug 1, 2024

MaciekPytel Aug 5, 2024

Choose a reason for hiding this comment

bskiba Aug 12, 2024

Choose a reason for hiding this comment

MaciekPytel commented Aug 5, 2024

k8s-ci-robot commented Aug 5, 2024

bskiba commented Aug 12, 2024

MaciekPytel commented Aug 13, 2024