Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster handling of failed scale ups #7087

Merged
merged 1 commit into from
Aug 13, 2024

Conversation

bskiba
Copy link
Member

@bskiba bskiba commented Jul 24, 2024

What type of PR is this?

/kind feature

What this PR does / why we need it:

Instead of breaking the loop after deleting failed scale up nodes, clean up the state and continue. Most importantly, update the target for the affected node group, so that the deleted nodes are not considered upcoming.

This PR speeds up handling of failed scale ups, useful especially with multiple quota or stockout errors across the cluster.

Special notes for your reviewer:

The PR consists of 3 commits. The actual change is in commit #3 while commits #1 and #2 are refactors necessary for the actual change.

Does this PR introduce a user-facing change?

Faster handling of failed scale ups, useful especially with multiple quota or stockout errors across the cluster.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:


@k8s-ci-robot k8s-ci-robot added the kind/feature Categorizes issue or PR as related to a new feature. label Jul 24, 2024
@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Jul 24, 2024
@bskiba
Copy link
Member Author

bskiba commented Jul 24, 2024

/assign @MaciekPytel

@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Jul 24, 2024
@MaciekPytel
Copy link
Contributor

This seems like you've maybe forgot to include part of changes to static_autoscaler in the commit? You're doing cleanup in clusterstate (which I need to look at more to understand it), but you're still exiting the loop after cleanup. I think you probably need to stop returning bool from deleteCreatedNodesWithErrors and remove the conditional return after the call?

@bskiba bskiba force-pushed the stockout-handling branch 2 times, most recently from c766715 to 2e31839 Compare July 26, 2024 12:08
klog.V(4).Infof("Updating state after failed scale up for %s nodeGroup", nodeGroup.Id())

csr.InvalidateNodeInstancesCacheEntry(nodeGroup)
targetSize, err := nodeGroup.TargetSize()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're effectively doing the exact same thing that csr.Recalculate() is doing, just for a single nodeGroup. I can't say I like that:

  • I think having all those specialized methods add to complexity. It would be much easier to reason about just a single recalculate path.
  • It's obviously brittle to have two different implementations here - I can easily see someone updating one of those and not the other.

With the above in mind, I'd suggest one of the following refactors:

  • Change this to just call (either via a wrapper in csr or directly in static_autoscaler)
    csr.InvalidateNodeInstancesCacheEntry(nodeGroup)
    csr.Recalculate() // maybe just once if deletedAny
    
  • Change Recalculate() to accept a nodeGroup (or list of nodeGroups?) as a parameter if you're worried about the extra cost of calling TargetSize() for every NG. I strongly suspect this may be a premature optimization and may not be worth the added complexity:
    • TargetSize() really should be cached at cloudprovider side and actual API calls should only happen for NGs where something was deleted.
    • We're already doing full Recalculate() in scale-up and scale-down paths, both of which should happen more often than the path you're changing - and we seem to think it's fine.
    • I guess if you go this way you may also change other Recalculate() calls to be more selective? Funnily enough we have an old TODO for it https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/core/scaleup/orchestrator/orchestrator.go#L548. I doubt @losipiuk will come back to CA to pick it up (you're always welcome though, we'd love to have you back!).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went with using Recalculate once if deletedAny. I think you're right that doing this on a subset of nodeGroups is a premature optimisation.

err = a.clusterStateRegistry.HandleDeletingFailedScaleUpNodes(nodeGroup)
if err != nil {
returnErr = fmt.Errorf("Failed to clean up state after deleting failed nodes from %s node group: %v", nodeGroup.Id(), err)
klog.Warningf(returnErr.Error())
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd log this on Error() level - I'd argue any error that causes us to break the loop should be flagged as a serious problem

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually, Recalculate() doesn't return that error, so we're no longer breaking the loop after recent changes. Do you think that will cause issues?

@MaciekPytel
Copy link
Contributor

I think this would work (at least as far as making upcoming nodes calculation account for deleted nodes) and the fact that we seem to already do exactly this type of update on CSR on scale-up/scale-down gives me some confidence.

Still it is pretty scary to do a partial csr recalculate here - even if we do it in scale-up/down that happens much later in the loop and I don't think we really use csr all that much at this point. I would be tempted to do a full UpdateNodes(), but that seems like it can create as many problems as it solves:

  • pro: no tricky partial recalculations; doing a full recalculation seems much safer from perspective of having csr in an internally consistent state
  • con: what if we get more create errors in the meantime? I think it's safe to proceed without cleaning them up, but I also think it's safe to proceed with Recalculate() - and I'm not sure which of those I'm more confident in.
    • At least with GCE provider, it is very likely that we would get new instances in error state as MIGs tend to return those errors gradually.

@drmorr0 @gjtempleton Do you see any risk of continuing the loop here causing some problems with AWS placeholders? Looking at #6911 I think it should be fine to continue loop after deleting the placeholders -> TargetSize() of ASG seems to be updated when the instances are deleted, which I think is the only thing we need from cloudprovider implementation in order to safely proceed.

@bskiba
Copy link
Member Author

bskiba commented Jul 30, 2024

I've updated this PR to use Recalculate() as suggested @MaciekPytel. If anyone sees issues with this approach, do let me know.

@bskiba
Copy link
Member Author

bskiba commented Aug 1, 2024

/assign @gjtempleton
@gjtempleton can you take a look and verify if this approach is OK per Maciek's comment

@@ -176,6 +176,73 @@ func TestHasNodeGroupStartedScaleUp(t *testing.T) {
}
}

func TestDeletingFailedScaleUpNodes(t *testing.T) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's nitpicking, but I kinda don't like the name of this test. The test itself is checking that UpcomingNodes update correctly after nodeGroup resize + csr.Recalculate() which is an important assumption for the logic in core/ related to deleting failed scale-up nodes, but doesn't really have anything to do with logic related to scale-up failures in CSR. Just looking at csr it feels impossible to understand what's going on here.

Admittedly, I don't know if renaming to TestUpcomingNodesAfterResizeAndRecalculate - it's a mouthful and probably not more clear at all. Maybe what we're missing is a comment that explains what is being tested and why?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fair point. I've renamed the test and added a comment.

@MaciekPytel
Copy link
Contributor

/lgtm
/approve
/hold

Left a single nitpick, but overall looks good. Really happy to finally see that issue fixed :)
Leaving a hold for @gjtempleton or @drmorr0 to have a chance to look, if they want. I'll bring it up on today sig meeting. I don't really expect this version to cause any problems and I'm thinking about invoking a lazy consensus and merging this later this week unless anyone objects.

@k8s-ci-robot k8s-ci-robot added do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. lgtm "Looks good to me", indicates that a PR is ready to be merged. labels Aug 5, 2024
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: bskiba, MaciekPytel

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Aug 5, 2024
@k8s-ci-robot k8s-ci-robot removed the lgtm "Looks good to me", indicates that a PR is ready to be merged. label Aug 12, 2024
@bskiba
Copy link
Member Author

bskiba commented Aug 12, 2024

@MaciekPytel I addressed the last comment, thanks a lot for your review. Are we OK to merge this now?

Clean up cluster state after removing failed scale up nodes,
so that the loop can continue. Most importantly, update the
target for the affected node group, so that the deleted nodes
are not considered upcoming.
@MaciekPytel
Copy link
Contributor

/lgtm
/hold cancel

@k8s-ci-robot k8s-ci-robot added lgtm "Looks good to me", indicates that a PR is ready to be merged. and removed do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. labels Aug 13, 2024
@k8s-ci-robot k8s-ci-robot merged commit aec7b75 into kubernetes:master Aug 13, 2024
6 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/cluster-autoscaler cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/feature Categorizes issue or PR as related to a new feature. lgtm "Looks good to me", indicates that a PR is ready to be merged. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants