Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Subtract toBeDeleted nodes from number of upcoming nodes; cleanup toBeDeleted taints from all nodes, not only ready ones #4211

Conversation

alfredkrohmer
Copy link
Contributor

Fixes #3949

…eDeleted taints from all nodes, not only ready ones
@k8s-ci-robot k8s-ci-robot added cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Jul 21, 2021
Copy link
Member

@x13n x13n left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add some unit tests that would fail without this change?

@k8s-ci-robot k8s-ci-robot added size/M Denotes a PR that changes 30-99 lines, ignoring generated files. and removed size/S Denotes a PR that changes 10-29 lines, ignoring generated files. labels Sep 14, 2021
@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: alfredkrohmer
To complete the pull request process, please assign feiskyer after the PR has been reviewed.
You can assign the PR to them by writing /assign @feiskyer in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@alfredkrohmer
Copy link
Contributor Author

@x13n Added a unit test in TestUpcomingNodes. Without the fix, the test is failing with:

--- FAIL: TestUpcomingNodes (0.00s)
    clusterstate_test.go:518: 
                Error Trace:    clusterstate_test.go:518
                Error:          "map[ng1:%!s(int=6) ng2:%!s(int=1) ng3:%!s(int=2) ng5:%!s(int=1)]" should not contain "ng5"
                Test:           TestUpcomingNodes

@alfredkrohmer alfredkrohmer force-pushed the bugfix/include-tobedeletednodes-in-calculation-of-upcoming-nodes branch from 76b3d5f to a1c801f Compare September 14, 2021 14:36
@x13n
Copy link
Member

x13n commented Sep 17, 2021

I read the original issue and while I understand the need to clean up taints on all nodes instead of just ready ones, I don't understand why we're reducing the number of nodes that should come up by subtracting the nodes that are being deleted.

Suppose:

  1. there are 10 nodes in the node group
  2. one of them starts being scaled down, target size is now 9
  3. before all pods are successfully evicted, new unschedulable pods appear in the cluster, increasing the target size back to 10
  4. new node won't get scheduled now, because we're going to subtract the deleted node from the target size
  5. pods will stay pending until the old node is deleted - only after that happens, the new node will get provisioned

@alfredkrohmer
Copy link
Contributor Author

I don't understand why we're reducing the number of nodes that should come up by subtracting the nodes that are being deleted

That's not the case here, this is not being subtracted from the nodes that should come up, it's subtracted from the number of nodes that are supposedly coming up right now (that cluster-autoscaler thinks are coming up right now).

@x13n
Copy link
Member

x13n commented Oct 25, 2021

That's not the case here, this is not being subtracted from the nodes that should come up, it's subtracted from the number of nodes that are supposedly coming up right now (that cluster-autoscaler thinks are coming up right now).

The number of nodes that are supposedly coming up right now will later impact the number of nodes that should come up: CA will think that there's no need to create more nodes, since they are coming up already.

@alfredkrohmer
Copy link
Contributor Author

The number of nodes that are supposedly coming up right now will later impact the number of nodes that should come up

Exactly

CA will think that there's no need to create more nodes, since they are coming up already.

My PR is reducing the number of nodes that CA thinks is coming up by subtracting the number of nodes being deleted - otherwise CA thinks these nodes that are being deleted are actually coming up right now which is not true. This actually increases (not decreases) the number of nodes that should come up.

@x13n
Copy link
Member

x13n commented Nov 9, 2021

Hm, you're right, this will increase, not decrease the number of upcoming nodes. That brings about another concern though: if some nodes are tainted for deletion, but don't actually end up deleted, we may exceed the max size. Nodes that are being deleted can end up staying around e.g. if CA is restarted in the meantime.

@alfredkrohmer
Copy link
Contributor Author

Nodes that are being deleted can end up staying around e.g. if CA is restarted in the meantime.

That's the other part of the PR where deletion taints are properly removed on CA startup.

@x13n
Copy link
Member

x13n commented Nov 10, 2021

That's the other part of the PR where deletion taints are properly removed on CA startup.

Proper removal of deletion taints is not my concern here. We may still exceed max # of nodes in the following scenario:

  1. Node X is tainted for deletion
  2. Another node Y starts getting created as a part of scale up
  3. CA crashes before X is actually deleted
  4. New CA process clears deletion taint from node X

Now we have both X and Y up and running even though they might be exceeding the limit.

@alfredkrohmer
Copy link
Contributor Author

alfredkrohmer commented Nov 11, 2021

I don't see how this behavior could be triggered by my PR. My PR doesn't touch the scale-up logic; again, it only changes the number of nodes CA thinks are coming up right now, it doesn't change the number of nodes it thinks that are there. I don't see how this would cause exceeding the max number of nodes in a NG.

@alfredkrohmer
Copy link
Contributor Author

@x13n I think I found what you mean. Here:

if context.MaxNodesTotal > 0 && len(nodes)+newNodes+len(upcomingNodes) > context.MaxNodesTotal {
klog.V(1).Infof("Capping size to max cluster total size (%d)", context.MaxNodesTotal)
newNodes = context.MaxNodesTotal - len(nodes) - len(upcomingNodes)

the number of upcoming nodes is considered to check if the maximum number of nodes in the whole cluster would be exceeded. If you want, I can add/subtract the deleted nodes here as well, then it's the same behavior as before. However, I'm not sure what the best course of action would actually be here (allow to exceed the max while nodes are being deleted vs risk that scale up is not possible when max would be exceeded).

As for the per-node group max number of nodes, this shouldn't be affected by my change, as this check considers only the actual number of nodes in the node group (independent of the state of the node in k8s):

totalCapacity := 0
for _, ng := range groups {
currentSize, err := ng.TargetSize()
if err != nil {
return []ScaleUpInfo{}, errors.NewAutoscalerError(
errors.CloudProviderError,
"failed to get node group size: %v", err)
}
maxSize := ng.MaxSize()
if currentSize == maxSize {
// group already maxed, ignore it
continue
}
info := ScaleUpInfo{
Group: ng,
CurrentSize: currentSize,
NewSize: currentSize,
MaxSize: maxSize}
scaleUpInfos = append(scaleUpInfos, info)
if maxSize-currentSize > 0 {
totalCapacity += maxSize - currentSize
}
}
if totalCapacity < newNodes {
klog.V(2).Infof("Requested scale-up (%v) exceeds node group set capacity, capping to %v", newNodes, totalCapacity)
newNodes = totalCapacity
}

@x13n
Copy link
Member

x13n commented Nov 23, 2021

Apologies for late response. Yes, I meant the logic in

if context.MaxNodesTotal > 0 && len(nodes)+newNodes+len(upcomingNodes) > context.MaxNodesTotal {
klog.V(1).Infof("Capping size to max cluster total size (%d)", context.MaxNodesTotal)
newNodes = context.MaxNodesTotal - len(nodes) - len(upcomingNodes)

I think deleted nodes should also be considered here. In edge cases when deletion takes arbitrarily long time, we could easily exceed the maximum if deleted nodes are not accounted for in the scale up logic.

@alfredkrohmer
Copy link
Contributor Author

alfredkrohmer commented Nov 26, 2021

I tried to somehow get the number of deleted nodes at this point in the code to add / subtract them and so I went through the call graph up to where ScaleUp is called and I suspect I found more errors in the surrounding implementation.

One thing upfront that I found: len(nodes) (here in the ScaleUp function) does contain the deleted nodes - but only if they are not cordoned; hence:

  • for not-cordoned "deleted" nodes: the current calculation is actually broken, it's off by n (number of deleted nodes) in the other direction, i.e. already preventing scale up at max-n because they would be contained twice in the sum, in len(nodes) and in len(upcomingNodes), my change would fix this.
  • for cordoned "deleted" nodes: the current calculation is correct and my change would break it by allowing to scale up to max+n; see the following points for a proposed solution

In the RunOnce function here:

allNodes, readyNodes, typedErr := a.obtainNodeLists(a.CloudProvider)

allNodes and readyNodes are retrieved (allNodes includes unready and cordoned nodes, readyNodes doesn't)

In the same function, readyNodes is used three times in a relevant way.

First time to derive node group infos from the ready nodes:

nodeInfosForGroups, autoscalerError := a.processors.TemplateNodeInfoProvider.Process(autoscalingContext, readyNodes, daemonsets, a.ignoredTaints)

Second time for another check that the max number of nodes in the cluster is not yet reached:

} else if a.MaxNodesTotal > 0 && len(readyNodes) >= a.MaxNodesTotal {

Third time to finally call the ScaleUp function:

scaleUpStatus, typedErr = ScaleUp(autoscalingContext, a.processors, a.clusterStateRegistry, unschedulablePodsToHelp, readyNodes, daemonsets, nodeInfosForGroups, a.ignoredTaints)

I believe it would be more correct to pass allNodes instead in these three uses of readyNodes.

  1. In the first use, I think it makes sense to build node group info from unready nodes (e.g. if it's the only one from this node group).
  2. In the second use, as per your argument, cordoned and unready nodes should count into the max number of nodes.
  3. In the third use (the actual problem I was trying to solve), this would make len(nodes) (here) contain the cordoned nodes and hence they would also be considered correctly for the max number of nodes in the cluster - so this would also make my previous change correct for cordoned "deleted" nodes.

Any thoughts?

@x13n
Copy link
Member

x13n commented Dec 21, 2021

Apologies again for super slow response. I had to take some time to go through the existing code. I think the proposal makes sense, I'm just a bit worried I didn't think about some edge case though - historically, clusterstate changes were a source of "interesting" bugs - and would prefer to keep this change behind a feature flag because of this. Can be on by default and can be removed in a few versions if there are no issues, but it might be safer to have a quick rollback mechanism. WDYT?

@k8s-triage-robot
Copy link

The Kubernetes project currently lacks enough contributors to adequately respond to all issues and PRs.

This bot triages issues and PRs according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue or PR as fresh with /remove-lifecycle stale
  • Mark this issue or PR as rotten with /lifecycle rotten
  • Close this issue or PR with /close
  • Offer to help out with Issue Triage

Please send feedback to sig-contributor-experience at kubernetes/community.

/lifecycle stale

@k8s-ci-robot k8s-ci-robot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 21, 2022
@x13n
Copy link
Member

x13n commented Mar 22, 2022

/remove-lifecycle stale

Maybe a feature flag is an overkill, but would appreciate another pair of eyes on this. @MaciekPytel - can you take a look?

@k8s-ci-robot k8s-ci-robot removed the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Mar 22, 2022
fookenc added a commit to fookenc/autoscaler that referenced this pull request Apr 7, 2022
…des from being considered from available capacity to schedule Pending Pods to.
@MaciekPytel
Copy link
Contributor

Sorry for missing this for so long. I understand the problem being solved. I'm worried that this PR may fix it, but at the same time introduce an 'opposite' issue.

  1. Today we count a node that is intended to be deleted as "Deleted". Because Deleted is mutually exclusive with Ready/Unready and because of how upcoming nodes calculation work, CA overestimates the number of upcoming nodes and doesn't trigger scale-up.
  2. It can take time for node controller to remove kubernetes node object after the underlying VM has already been deleted. During this time the node also shows up in clusterstarte as Deleted. If we subtracted Deleted from TargetSize as proposed in this PR we would underestimate the number of upcoming nodes (we're subtracting a node that no longer counts towards TargetSize, as the VM no longer exists on cloudprovider side). Underestimating the number of upcoming nodes would most likely lead to too much scale-up.

The description of Deleted in Readiness struct makes it pretty clear to me that the intent of 'Deleted' node in Readiness is the case 2 above. And I think the current implementation addresses this problem correctly. If Deleted nodes are defined as the nodes that no longer exist on provider side, than we shouldn't subtract them from TargetSize.

I think the bug is that we're incorrectly identifying nodes as Deleted and the way to fix both problems 1 and 2 would be to fix that. Unfortunately that would probably require a slightly bigger code change: I think we should check if a particular instance exists on cloudprovider side as part of readiness calculation. This is very similar to how we identify unregistered nodes, just doing the check in other direction (check if there is a matching instance for a given node).

WDYT @alfredkrohmer @x13n ?

@fookenc
Copy link
Contributor

fookenc commented Apr 27, 2022

Hi @MaciekPytel, thank you for taking a look! I wouldn't want to introduce new issues into the autoscaler, but I'm concerned that the readiness calculation isn't aware of nodes being drained & deleted.

To be exact, the readiness calculation doesn't consider taints that may exist on the nodes. If a node has the ToBeDeleted taint, it has an effect of NoSchedule. Pods wouldn't be capable of being scheduled against the node, but the autoscaler believes that enough resources exist to handle the pending pods. Unfortunately this blocks scale ups indefinitely while the node is being drained and deleted. This can be observed in the logs obtained in the bug #4456.

I1105 11:07:58.593642 1 filter_out_schedulable.go:157] Pod default.app-1-cf767bfc7-m5672 marked as unschedulable can be scheduled on node template-node-for-node-group-2-az2-6211936831103862438-0. Ignoring in scale up.
I1105 11:07:58.593798 1 filter_out_schedulable.go:157] Pod default.app-1-cf767bfc7-jtcc4 marked as unschedulable can be scheduled on node template-node-for-node-group-2-az2-6211936831103862438-0. Ignoring in scale up.
I1105 11:07:58.593818 1 filter_out_schedulable.go:170] 0 pods were kept as unschedulable based on caching
I1105 11:07:58.593825 1 filter_out_schedulable.go:171] 2 pods marked as unschedulable can be scheduled.

CA seem to be thinking that the above node can be used to schedule the un-schedulable pod but it is actually being deleted due to under-utilisation. Due to a max termination grace period of 3 hours (and cause the pods already running on the node failed to grace-fully terminate before that), CA had to wait 3 hours for all pods to be removed from the affected node. During this interval, there was no scale up triggered by CA even though there were pods that needed to be scheduled.

I have found that scaling up the pods enough to exceed the capacity/resources available of the tainted node will result in a new node being added. Given that the Deleted field is the count of nodes with the ToBeDeleted taint, I assumed that this value could only be zero or 1 in typical scenarios.

One alternative that I've been exploring is to have a separate controller monitor for the ToBeDeleted taint, and triggering an over provision a new node with Pause pods to prevent scale ups from being blocked. This does move the burden of overprovisioning outside of the autoscaler instead of having the readiness check adjusted to remove upcoming nodes that shouldn't be considered viable.

Please let me know your thoughts! Thanks!

@MaciekPytel
Copy link
Contributor

Given that the Deleted field is the count of nodes with the ToBeDeleted taint

I think this is the problem. It's not intended to be the number of nodes with ToBeDeleted taint. It's intended to be a number of nodes that don't have a matching VM on cloudprovider side, because the VM has already been deleted.

My guess is that looking at a taint seemed like an easy proxy for nodes that have already been deleted by CA 1. As we can see, it isn't. I think the right way to fix this issue without introducing new problems is to stop identifying nodes that are being deleted as already deleted. This should fix your scenario, as the node that is being deleted would not be counted as Deleted, meaning that it would be counted as either Ready or Unready, both of which are subtracted from upcoming nodes.

It also means that nodes that have actually been deleted (in the sense that VM no longer exists and it's just a kubernetes object that haven't been cleaned up yet) would still not be included in upcoming nodes calculation. This part is correct - those nodes no longer count towards NodeGroup.TargetSize() and so should not be subtracted from this number.

We already get list of cloudprovider instances in UpdateNodes() before the updateReadinessStats() call. I think we could use this data in updateReadinessStats() to only count node as Deleted if there is no corresponding instance.

The potentially tricky part would be understanding if there is any negative interaction with node_instances_cache (do we need to call InvalidateNodeInstancesCacheEntry() on scale-down? I'm guessing yes, but I haven't really dived into the code deeply enough to know).

Footnotes

  1. This makes more sense if you consider when this logic was implemented. Back than node deletion was synchronous (ie. CA would completely block for the entire process), so from CA perspective tainting and deleting the node was effectively atomic.

@x13n
Copy link
Member

x13n commented Apr 28, 2022

I think that part makes sense to me. I'd still clear node taints from all nodes on startup though - I don't see a reason to keep a (possibly intermittently) unready node tainted when CA restarts. It wouldn't automatically resume deletion anyway.

@fookenc
Copy link
Contributor

fookenc commented Apr 29, 2022

I made a commit based on the suggestions provided, but not certain that I got it right. Within the updateReadinessStats, I switched to using the Cluster State Registry of the latest nodes unregistered in the cloud provider.

https://github.com/fookenc/autoscaler/blob/049e8ad0fc79bf2d6b9ef18b933efbb3d7d8415f/cluster-autoscaler/clusterstate/clusterstate.go#L541-L548

I also made a change to the GetReadinessState function to include the ToBeDeleted taint key to mark any node as Unready. I don't know if this is necessary, but felt that it made sense to mark it that way.

https://github.com/fookenc/autoscaler/blob/049e8ad0fc79bf2d6b9ef18b933efbb3d7d8415f/cluster-autoscaler/utils/kubernetes/ready.go#L72-L77

After building and deploying the changes from this commit1 to a test cluster, the scale ups were no longer being blocked when a node was tainted with ToBeDeleted. However, I'm not sure that these code changes exactly match to what was described. Additionally, I didn't make any attempts to address the scale down that may need to call InvalidateNodeInstancesCacheEntry.

Footnotes

  1. https://github.com/fookenc/autoscaler/commit/049e8ad0fc79bf2d6b9ef18b933efbb3d7d8415f#

@MaciekPytel
Copy link
Contributor

I'd still clear node taints from all nodes on startup though

+1. No reason to keep taints. I don't think we ever made conscious decision to skip any nodes when un-tainting, we just implicitly assumed readyNodes ~= allNodes as normally nodes removed by CA don't become unready until deleted on cloudprovider side. I'm guessing this is an issue when using --cordon-node-before-terminating (which would explain why we haven't noticed it before - we don't set this in any of our testing).

@MaciekPytel
Copy link
Contributor

https://github.com/fookenc/autoscaler/blob/049e8ad0fc79bf2d6b9ef18b933efbb3d7d8415f/cluster-autoscaler/clusterstate/clusterstate.go#L541-L548

I don't think that implements what I described. An "unregistered node" is a VM that has no matching Node object in k8s. I think a "deleted node" should be a Node object in k8s that has no VM. I don't think we can infer this information from unregistered nodes, we need something similar to getNotRegisteredNodes() working "in other direction" (ie. for each node see if there is a matching instance").

I also made a change to the GetReadinessState function to include the ToBeDeleted taint key to mark any node as Unready. I don't know if this is necessary, but felt that it made sense to mark it that way.

Hmm, good point. It certainly seems logical, but I'm not sure if this doesn't have some undesired side effects (ex. marking nodepool as unhealthy during large scale-down?).

@fookenc
Copy link
Contributor

fookenc commented May 19, 2022

Hi @MaciekPytel & @x13n,

I've created a new pull request that incorporates some of the changes (tests for UpcomingNodes, clearing taints from ALL nodes on startup) and new changes to identify those nodes that no longer have a corresponding VM on the cloud provider. Please let me know your feedback on the proposed changes. If possible, I'd like to have a discussion regarding the InvalidateNodeInstancesCacheEntry during scale down.

Thanks!

@k8s-ci-robot k8s-ci-robot added the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 4, 2022
@k8s-ci-robot k8s-ci-robot removed the needs-rebase Indicates a PR cannot be merged because it has merge conflicts with HEAD. label Jul 13, 2022
@x13n
Copy link
Member

x13n commented Aug 11, 2022

This has been obsoleted by #5054 now.

/close

@k8s-ci-robot
Copy link
Contributor

@x13n: Closed this PR.

In response to this:

This has been obsoleted by #5054 now.

/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@alfredkrohmer
Copy link
Contributor Author

@x13n Correct me if I'm wrong, but I belive #5054 doesn't actually address the main problem described in #3949. Our problem is not nodes deleted in the cloud provider and still having a k8s node, but rather that nodes that are being drained by cluster-autoscaler for binpacking are counted as "upcoming" nodes. Or is the PR that you linked fixing the problem by rewiring a bunch of logic so that it will be fixed "on the way, too"?

@x13n
Copy link
Member

x13n commented Oct 6, 2022

The "right way" to address #3949 was discussed above in this PR. In order to get rid of the bug, CA needs to only identify deleted (as in - VM removed) nodes as Deleted. That requires CA to be able to determine whether a given node has a matching instance in a given cloud provider (which is what #5054 is about).

Once proper counting of deleted instances is implemented, the case you described in #3949 will stop happening: existing instance will be considered by CA as Unready rather than Deleted, so it will be subtracted when counting upcoming nodes.

Does that make sense?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/cluster-autoscaler cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. size/M Denotes a PR that changes 30-99 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Node groups get "stuck" during node deletion
7 participants