MCM doesn't remove the machine which CA wants #159

himanshu-kun · 2023-01-05T12:14:21Z

What happened:

We have seen a case where CAS wants to remove machine A due to low utilization but machine B is removed by MCM when CAS scales down the machinedeployment.
This has happened because machine B was finalized to be removed by CAS some time back and so ToBeDeleted taint was placed on the node. But later due to unexpected circumstances (autoscaler restart after priority 1 is set, in which case it clears all ToBeDeleted taints, or machineDeployment couldn't be scaled-down etc) autoscaler reverts its decision and removes the ToBeDeleted taint.
This leaves the machine B with priority 1 , and later when machine A is asked to be removed machine B is picked up by MCM

What you expected to happen:
Scale-down of right machine is expected.

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know:
One solution is to enhance MCM to reset priority to 3 for the machine obj if the ToBeDeleted CAS taint on the corresponding node object is removed.

High demand , see live issue # 2423

Environment:
g/autoscaler v 1.25.0 and below.

cc @unmarshall

The text was updated successfully, but these errors were encountered:

gardener-robot · 2023-01-05T12:14:59Z

@himanshu-kun You have mentioned internal references in the public. Please check.

himanshu-kun · 2023-01-06T08:32:51Z

Proposal

in CAS:

CAS now places another annotation requester.machine.sapcloud.io whose value would be autoscaler when it calls DeleteMachines function.
This extra annotation is needed to not have tight coupling of MCM to CAS for deletion operation.

Edge based logic: (1st level protection)

reset prio to 3 and remove requester anno, if node update event is there and the old node has taint but new node doesn't

Level based logic: (2nd level protection)

in MCM:

when we call manageReplicas where we call getMachinesToDelete then we will sort machines in the following order:
- M(prio 1, requestor=autoscaler ,node taint present) ,
- M(prio 1 , requestor=autoscaler, node taint absent) -> we select this at this prio because CAS drains node when it selects a node for removal , so possibility of less workload scheduled on the node is high.
- M(prio 1)
- and the same order as we have now....
in the regular machine reconcile we'll check the machine and if it is of the type M(prio 1 , requestor=autoscaler, node taint absent) then we reset the prio1 to prio3 and remove the reqeustor anno.

Documentation update required here

Why not use the CAS flag `cordon-node-before-terminating`?

If we enable this flag , we could have nodes which remain cordoned and ignored by CAS forever . See upstream issue

Other proposals

We thought of reseting the priorities of machines in the DeleteMachine function of CAS only , but in case when reset of priority fails then the entire DeleteMachine operation is termed as failed which leads to removal of the taint and so the node would again wait scale-down-unneeded-time until its removed, which would lead to additional costs.

Further optimization

in CAS , we make DeleteMachines atomic, by adding machine names to be deleted as annotation in machineDeployment and reducing the replicas. However this needs further discussion.

vlerenc · 2023-01-11T11:45:52Z

As discussed out-of-band, our priorities or even/also an annotation (mentioned in "further optimization") all bear the "sync problem", i.e. syncing the state of CA with MCM, because we use an "independent" marker (and not CA's marker) which nodes must be terminated.

Maybe it's easier to accept CA's way of getting rid of machines and when CA wants to finally get rid of the machine after the drain, we go ahead and decrease the replica count and let the machine set controller check whether it sees any machine whose node resource has the "magic" taint and prefer terminating it first. This way, there is nothing we have to keep in sync - at the expense of tying us to the CA, but we are already extremely tightly coupled.

elankath · 2023-01-25T06:32:56Z

Will commence this after autoscaler rebase.

gardener-robot · 2023-02-28T03:32:47Z

@himanshu-kun You have mentioned internal references in the public. Please check.

himanshu-kun · 2023-03-01T07:05:43Z

Post Grooming

Be in sync with Autoscaler and delete prio 1 machine with node having ToBeDeleted taint from CA
- To avoid situations where there are machine obj with prio 1 but node doesn't have the taint, in DeleteNodes() method we will first clear the prio annotation from such machine objects and then only go ahead with what we do already.
The DeleteNode() logic is expected to be atomic from autoscaler's perspective but in our case its not. We need to refactor and design to make it atomic in the long run.

himanshu-kun · 2023-04-18T06:19:22Z

/assign

gardener-robot · 2023-07-21T05:35:46Z

@ashwani2k You have mentioned internal references in the public. Please check.

himanshu-kun added the kind/bug Bug label Jan 5, 2023

himanshu-kun assigned elankath Jan 10, 2023

himanshu-kun assigned himanshu-kun and unassigned elankath Jan 25, 2023

himanshu-kun removed their assignment Mar 1, 2023

himanshu-kun added area/robustness Robustness, reliability, resilience related priority/1 Priority (lower number equals higher priority) needs/planning Needs (more) planning with other MCM maintainers labels Mar 1, 2023

himanshu-kun added priority/critical Needs to be resolved soon, because it impacts users negatively and removed priority/1 Priority (lower number equals higher priority) labels Mar 31, 2023

gardener-robot assigned himanshu-kun Apr 18, 2023

himanshu-kun assigned rishabh-11 and unassigned himanshu-kun Apr 20, 2023

rishabh-11 mentioned this issue May 8, 2023

Fix targeted scale-in logic #215

Merged

himanshu-kun self-assigned this Jun 6, 2023

himanshu-kun added this to the 2023-Q3 milestone Jul 13, 2023

ashwani2k changed the title ~~MCM doesn't remove the machine which CAS wants~~ MCM doesn't remove the machine which CA wants Jul 21, 2023

himanshu-kun assigned unmarshall and unassigned rishabh-11 and himanshu-kun Aug 9, 2023

himanshu-kun closed this as completed in #215 Aug 17, 2023

gardener-robot added the status/closed Issue is closed (either delivered or triaged) label Aug 17, 2023

rishabh-11 assigned rishabh-11 and unassigned unmarshall Aug 24, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MCM doesn't remove the machine which CA wants #159

MCM doesn't remove the machine which CA wants #159

himanshu-kun commented Jan 5, 2023 •

edited

Loading

gardener-robot commented Jan 5, 2023

himanshu-kun commented Jan 6, 2023

vlerenc commented Jan 11, 2023

elankath commented Jan 25, 2023 •

edited

Loading

gardener-robot commented Feb 28, 2023

himanshu-kun commented Mar 1, 2023

himanshu-kun commented Apr 18, 2023

gardener-robot commented Jul 21, 2023

MCM doesn't remove the machine which CA wants #159

MCM doesn't remove the machine which CA wants #159

Comments

himanshu-kun commented Jan 5, 2023 • edited Loading

gardener-robot commented Jan 5, 2023

himanshu-kun commented Jan 6, 2023

Proposal

Edge based logic: (1st level protection)

Level based logic: (2nd level protection)

Why not use the CAS flag cordon-node-before-terminating?

Other proposals

Further optimization

vlerenc commented Jan 11, 2023

elankath commented Jan 25, 2023 • edited Loading

gardener-robot commented Feb 28, 2023

himanshu-kun commented Mar 1, 2023

Post Grooming

himanshu-kun commented Apr 18, 2023

gardener-robot commented Jul 21, 2023

himanshu-kun commented Jan 5, 2023 •

edited

Loading

Why not use the CAS flag `cordon-node-before-terminating`?

elankath commented Jan 25, 2023 •

edited

Loading