-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Faster handling of failed scale ups #7087
Conversation
/assign @MaciekPytel |
This seems like you've maybe forgot to include part of changes to static_autoscaler in the commit? You're doing cleanup in clusterstate (which I need to look at more to understand it), but you're still exiting the loop after cleanup. I think you probably need to stop returning bool from deleteCreatedNodesWithErrors and remove the conditional return after the call? |
c766715
to
2e31839
Compare
klog.V(4).Infof("Updating state after failed scale up for %s nodeGroup", nodeGroup.Id()) | ||
|
||
csr.InvalidateNodeInstancesCacheEntry(nodeGroup) | ||
targetSize, err := nodeGroup.TargetSize() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're effectively doing the exact same thing that csr.Recalculate() is doing, just for a single nodeGroup. I can't say I like that:
- I think having all those specialized methods add to complexity. It would be much easier to reason about just a single recalculate path.
- It's obviously brittle to have two different implementations here - I can easily see someone updating one of those and not the other.
With the above in mind, I'd suggest one of the following refactors:
- Change this to just call (either via a wrapper in csr or directly in static_autoscaler)
csr.InvalidateNodeInstancesCacheEntry(nodeGroup) csr.Recalculate() // maybe just once if deletedAny
- Change Recalculate() to accept a nodeGroup (or list of nodeGroups?) as a parameter if you're worried about the extra cost of calling TargetSize() for every NG. I strongly suspect this may be a premature optimization and may not be worth the added complexity:
- TargetSize() really should be cached at cloudprovider side and actual API calls should only happen for NGs where something was deleted.
- We're already doing full Recalculate() in scale-up and scale-down paths, both of which should happen more often than the path you're changing - and we seem to think it's fine.
- I guess if you go this way you may also change other Recalculate() calls to be more selective? Funnily enough we have an old TODO for it https://github.com/kubernetes/autoscaler/blob/master/cluster-autoscaler/core/scaleup/orchestrator/orchestrator.go#L548. I doubt @losipiuk will come back to CA to pick it up (you're always welcome though, we'd love to have you back!).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went with using Recalculate once if deletedAny. I think you're right that doing this on a subset of nodeGroups is a premature optimisation.
err = a.clusterStateRegistry.HandleDeletingFailedScaleUpNodes(nodeGroup) | ||
if err != nil { | ||
returnErr = fmt.Errorf("Failed to clean up state after deleting failed nodes from %s node group: %v", nodeGroup.Id(), err) | ||
klog.Warningf(returnErr.Error()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd log this on Error() level - I'd argue any error that causes us to break the loop should be flagged as a serious problem
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Actually, Recalculate() doesn't return that error, so we're no longer breaking the loop after recent changes. Do you think that will cause issues?
I think this would work (at least as far as making upcoming nodes calculation account for deleted nodes) and the fact that we seem to already do exactly this type of update on CSR on scale-up/scale-down gives me some confidence. Still it is pretty scary to do a partial csr recalculate here - even if we do it in scale-up/down that happens much later in the loop and I don't think we really use csr all that much at this point. I would be tempted to do a full UpdateNodes(), but that seems like it can create as many problems as it solves:
@drmorr0 @gjtempleton Do you see any risk of continuing the loop here causing some problems with AWS placeholders? Looking at #6911 I think it should be fine to continue loop after deleting the placeholders -> TargetSize() of ASG seems to be updated when the instances are deleted, which I think is the only thing we need from cloudprovider implementation in order to safely proceed. |
2e31839
to
c569ac5
Compare
I've updated this PR to use Recalculate() as suggested @MaciekPytel. If anyone sees issues with this approach, do let me know. |
/assign @gjtempleton |
@@ -176,6 +176,73 @@ func TestHasNodeGroupStartedScaleUp(t *testing.T) { | |||
} | |||
} | |||
|
|||
func TestDeletingFailedScaleUpNodes(t *testing.T) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's nitpicking, but I kinda don't like the name of this test. The test itself is checking that UpcomingNodes update correctly after nodeGroup resize + csr.Recalculate() which is an important assumption for the logic in core/ related to deleting failed scale-up nodes, but doesn't really have anything to do with logic related to scale-up failures in CSR. Just looking at csr it feels impossible to understand what's going on here.
Admittedly, I don't know if renaming to TestUpcomingNodesAfterResizeAndRecalculate - it's a mouthful and probably not more clear at all. Maybe what we're missing is a comment that explains what is being tested and why?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fair point. I've renamed the test and added a comment.
/lgtm Left a single nitpick, but overall looks good. Really happy to finally see that issue fixed :) |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: bskiba, MaciekPytel The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
c569ac5
to
dc53cb9
Compare
@MaciekPytel I addressed the last comment, thanks a lot for your review. Are we OK to merge this now? |
Clean up cluster state after removing failed scale up nodes, so that the loop can continue. Most importantly, update the target for the affected node group, so that the deleted nodes are not considered upcoming.
dc53cb9
to
939123c
Compare
/lgtm |
What type of PR is this?
/kind feature
What this PR does / why we need it:
Instead of breaking the loop after deleting failed scale up nodes, clean up the state and continue. Most importantly, update the target for the affected node group, so that the deleted nodes are not considered upcoming.
This PR speeds up handling of failed scale ups, useful especially with multiple quota or stockout errors across the cluster.
Special notes for your reviewer:
The PR consists of 3 commits. The actual change is in commit
#3
while commits#1
and#2
are refactors necessary for the actual change.Does this PR introduce a user-facing change?
Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: