-
Notifications
You must be signed in to change notification settings - Fork 4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add atomic scale down option for node groups #5695
Conversation
CC @x13n |
deletionStartTime := time.Now() | ||
defer func() { metrics.UpdateDuration(metrics.ScaleDownNodeDeletion, time.Now().Sub(deletionStartTime)) }() | ||
|
||
results, ts := a.nodeDeletionTracker.DeletionResults() | ||
scaleDownStatus := &status.ScaleDownStatus{NodeDeleteResults: results, NodeDeleteResultsAsOf: ts} | ||
|
||
emptyToDelete, drainToDelete := a.cropNodesToBudgets(empty, drain) | ||
if len(emptyToDelete) == 0 && len(drainToDelete) == 0 { | ||
emptyIndividualToDelete, drainIndividualToDelete, emptyAtomicToDelete, drainAtomicToDelete := a.cropNodesToBudgets(emptyIndividual, drainIndividual, emptyAtomic, drainAtomic) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@x13n
My idea of getting rid of the increasing number of pools here would be to create a dedicated struct for a bucket of nodes to scale down, for example:
type nodeBucket struct{ nodeGroup cloudprovider.NodeGroup nodes []*apiv1.Node atomic bool drain bool }
This could be populated by the Planner and processed by each component in appropriate order (e.g. cropNodesToBudgets would go through atomic first, delete would go through empty first). I think we could also move cropNodesToBudgets()
out of the actuator.go if we want to keep its size limited, the rest seems more tightly coupled with the actuation logic. Please let me know WDYT.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm actually starting to think that cropping to budget shouldn't really be done in actuator. The actuator should do what it is told to do: drain and delete a bunch of nodes. If we move this logic out of actuator (perhaps to a dedicated scale down set processor), a lot of the code here will become simpler. The only remaining issue will be the batching logic. One idea to deal with it would be to have batching criteria adjusted per node group. For most node groups it would be "wait for N nodes or T time, whichever comes first". Atomic node groups would set N equal to number of nodes (so it would have to be dynamic) and set T to +inf. WDYT?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I moved it out of the main actuator object, however I wanted to limit the amount of changes this PR introduces, so I still left it in the actuator directory. Please LMK if you're OK with it.
Regarding the batching logic, after our offline discussion I ended up with a wrapper over batcher, that will queue nodes for deletion within one pass of scale-down loop, but it will roll them back if any other node fails.
29d36ab
to
f14d91a
Compare
/assign |
@@ -154,17 +157,63 @@ func (p *Planner) NodesToDelete(_ time.Time) (empty, needDrain []*apiv1.Node) { | |||
// downs already in progress. If we pass the empty nodes first, they will be first | |||
// to get deleted, thus we decrease chances of hitting the limit on non-empty scale down. | |||
append(emptyRemovable, needDrainRemovable...), | |||
p.context.AutoscalingOptions.MaxScaleDownParallelism) | |||
// No need to limit the number of nodes, since it will happen later, in the actuation stage. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Instead of passing math.MaxInt
, this param should be just removed. It is effectively going to be unused anyway.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's still needed for sequential scaledown, for now I didn't want to change the api too much.
deletionStartTime := time.Now() | ||
defer func() { metrics.UpdateDuration(metrics.ScaleDownNodeDeletion, time.Now().Sub(deletionStartTime)) }() | ||
|
||
results, ts := a.nodeDeletionTracker.DeletionResults() | ||
scaleDownStatus := &status.ScaleDownStatus{NodeDeleteResults: results, NodeDeleteResultsAsOf: ts} | ||
|
||
emptyToDelete, drainToDelete := a.cropNodesToBudgets(empty, drain) | ||
if len(emptyToDelete) == 0 && len(drainToDelete) == 0 { | ||
emptyIndividualToDelete, drainIndividualToDelete, emptyAtomicToDelete, drainAtomicToDelete := a.cropNodesToBudgets(emptyIndividual, drainIndividual, emptyAtomic, drainAtomic) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm actually starting to think that cropping to budget shouldn't really be done in actuator. The actuator should do what it is told to do: drain and delete a bunch of nodes. If we move this logic out of actuator (perhaps to a dedicated scale down set processor), a lot of the code here will become simpler. The only remaining issue will be the batching logic. One idea to deal with it would be to have batching criteria adjusted per node group. For most node groups it would be "wait for N nodes or T time, whichever comes first". Atomic node groups would set N equal to number of nodes (so it would have to be dynamic) and set T to +inf. WDYT?
12fc952
to
aa01b73
Compare
// ScaleDownBudgetProcessor is responsible for keeping the number of nodes deleted in parallel within defined limits. | ||
type ScaleDownBudgetProcessor struct { | ||
ctx *context.AutoscalingContext | ||
nodeDeletionTracker *deletiontracker.NodeDeletionTracker |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
NodeDeletionTracker
should really be an implementation detail of actuator. All you need here is an interface that can give you DeletionsInProgress
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done - used the ActuationStatus inteface
drainBudget := bp.ctx.MaxDrainParallelism - len(drainInProgress) | ||
|
||
emptyToDelete = []*nodeBucket{} | ||
for _, bucket := range emptyAtomic { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The following two loops are essentially identical, please factor them out to a function.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done - I agree I should've done it right away, I just got a little bit overwhelmed by other changes (the ones I actually ended up reverting).
cluster-autoscaler/core/scaledown/actuation/group_deletion_scheduler.go
Outdated
Show resolved
Hide resolved
0316e4f
to
4a98cf6
Compare
// ScaleDownBudgetProcessor is responsible for keeping the number of nodes deleted in parallel within defined limits. | ||
type ScaleDownBudgetProcessor struct { | ||
ctx *context.AutoscalingContext | ||
actuationStatus scaledown.ActuationStatus |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why is the status a part of the processor? The status will change with time, so should be passed through a param to CropNodes
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the naming is misleading here... the actual class that is used there is named "NodeDeletionTracker", which clearly specifies that it should have up-to-date status, but the interface defined for it's users is called "ActuationStatus" which suggests something that stays the same over time. Not sure what was the intention behind that.
Regarding this particular comment, sure, I can pass actuation status to "CropNodes", although my preference would be to treat "ActuationStatus" as an active tracker of ongoing deletions (that it is), not as a representation of a single status.
//drainIndividual, drainAtomic := bp.groupByNodeGroup(drain) | ||
|
||
emptyInProgress, drainInProgress := bp.actuationStatus.DeletionsInProgress() | ||
parallelismBudget := bp.ctx.MaxScaleDownParallelism - len(emptyInProgress) - len(drainInProgress) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: I think you could encapsulate the budget updates in a separate object, to avoid arithmetic here and to make the budget calculations unit testable.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not really convinced about that... I was able to mock the budgets by setting AutoscalingOptions in the autoscaling context. I also don't think splitting this calculation helps readability. I.e. if someone is interested in the budget calculation logic, it's easier to actually read that directly in this ~20-line function than to navigate to a different file.
Previous "CropNodes" function of ScaleDownBudgetProcessor had an assumption that atomically-scaled node groups should be classified as "empty" or "drain" as a whole, however Cluster Autoscaler may classify some of the nodes from a single group as "empty" and other as "drain".
Thanks for all the changes and apologies for taking so long to review this! /lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: kawych, x13n The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
@kawych is there any issue/doc explaining why we need this feature? I came across this PR when I was trying to resolve merge conflicts in #5672 (also changes |
@vadasambar Sorry, I don't have any more detailed references to provide. But scale up/down speed is not the goal here - instead, this change allows all-or-nothing provisioning of the nodes within a node group: a scale-up will happen in one batch, scaling directly to the maximum node group size. It's useful if you want to avoid partial scale-ups (while the subsequent scale-ups could fail if there is not enough capacity), so that you don't have to pay for infrastructure until you have enough capacity to actually run your workloads. Note that the cloud provider also has to ensure that a scale-up will not end up in a partial state. |
@kawych thank you for the reply.
Sorry if this is a silly question. As a user wouldn't you expect CA to scale up (even if partially) so that there's space for Pending pods instead of trying to do it all at once? I am not sure I understand the problem we are trying to solve here. I am guessing this is for workloads like jobs which need to run all at once (but I don't understand why partial scaling wouldn't work here). I guess I am confused as to what benefit waiting to scale until enough capacity is available has. |
@vadasambar Your example is correct, this is applicable mostly for batch workloads that work collectively and should be scheduled at once. Additionally, the infrastructure might allow for customized placement of the VMs, but that would benefit from knowing the exact number of nodes upfront. |
Thank you for the reply and explaining the use-case @kawych. |
autoscalingOptions, err := nodeGroup.GetOptions(ctx.NodeGroupDefaults) | ||
if err != nil { | ||
klog.Errorf("Failed to get autoscaling options for node group %s: %v", nodeGroup.Id(), err) | ||
continue | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Cloud Providers were previously free to return nil, cloudprovider.ErrNotImplemented
for NodeGroup.GetOptions()
. All 5 cloudprovider NodeGroups that I spot checked return this, including Hetzner
which I am using.
With the current master
branch I get an error from these lines when cluster-autoscaler tries to scale down my node group. Causing cluster-autoscaler
to never scale down:
E0815 10:24:08.433379 1 post_filtering_processor.go:59] Failed to get autoscaling options for node group pool1: Not implemented
I think we can add an explicit check for cloudprovider.ErrNotImplemented
here and consider it for the standard scale down.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure, makes sense. Apologies for introducing the bug, I'll try and follow up with a fix and a test case to detect issues like that in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What type of PR is this?
/kind feature
What this PR does / why we need it:
This PR adds a node group AtomicScaleDown option, that allows for all-or-nothing scale down of the node group.
Which issue(s) this PR fixes:
N/A
Special notes for your reviewer:
Does this PR introduce a user-facing change?