Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MCM doesn't remove the machine which CA wants #159

Closed
himanshu-kun opened this issue Jan 5, 2023 · 8 comments · Fixed by #215
Closed

MCM doesn't remove the machine which CA wants #159

himanshu-kun opened this issue Jan 5, 2023 · 8 comments · Fixed by #215
Assignees
Labels
area/robustness Robustness, reliability, resilience related kind/bug Bug needs/planning Needs (more) planning with other MCM maintainers priority/critical Needs to be resolved soon, because it impacts users negatively status/closed Issue is closed (either delivered or triaged)
Milestone

Comments

@himanshu-kun
Copy link

himanshu-kun commented Jan 5, 2023

What happened:

We have seen a case where CAS wants to remove machine A due to low utilization but machine B is removed by MCM when CAS scales down the machinedeployment.
This has happened because machine B was finalized to be removed by CAS some time back and so ToBeDeleted taint was placed on the node. But later due to unexpected circumstances (autoscaler restart after priority 1 is set, in which case it clears all ToBeDeleted taints, or machineDeployment couldn't be scaled-down etc) autoscaler reverts its decision and removes the ToBeDeleted taint.
This leaves the machine B with priority 1 , and later when machine A is asked to be removed machine B is picked up by MCM

What you expected to happen:
Scale-down of right machine is expected.

How to reproduce it (as minimally and precisely as possible):

Anything else we need to know:
One solution is to enhance MCM to reset priority to 3 for the machine obj if the ToBeDeleted CAS taint on the corresponding node object is removed.

High demand , see live issue # 2423

Environment:
g/autoscaler v 1.25.0 and below.

cc @unmarshall

@himanshu-kun himanshu-kun added the kind/bug Bug label Jan 5, 2023
@gardener-robot
Copy link

@himanshu-kun You have mentioned internal references in the public. Please check.

@himanshu-kun
Copy link
Author

Proposal

in CAS:

  • CAS now places another annotation requester.machine.sapcloud.io whose value would be autoscaler when it calls DeleteMachines function.
  • This extra annotation is needed to not have tight coupling of MCM to CAS for deletion operation.

Edge based logic: (1st level protection)

  • reset prio to 3 and remove requester anno, if node update event is there and the old node has taint but new node doesn't

Level based logic: (2nd level protection)

in MCM:

  • when we call manageReplicas where we call getMachinesToDelete then we will sort machines in the following order:
    • M(prio 1, requestor=autoscaler ,node taint present) ,
    • M(prio 1 , requestor=autoscaler, node taint absent) -> we select this at this prio because CAS drains node when it selects a node for removal , so possibility of less workload scheduled on the node is high.
    • M(prio 1)
    • and the same order as we have now....
  • in the regular machine reconcile we'll check the machine and if it is of the type M(prio 1 , requestor=autoscaler, node taint absent) then we reset the prio1 to prio3 and remove the reqeustor anno.

Documentation update required here

Why not use the CAS flag cordon-node-before-terminating?

If we enable this flag , we could have nodes which remain cordoned and ignored by CAS forever . See upstream issue

Other proposals

We thought of reseting the priorities of machines in the DeleteMachine function of CAS only , but in case when reset of priority fails then the entire DeleteMachine operation is termed as failed which leads to removal of the taint and so the node would again wait scale-down-unneeded-time until its removed, which would lead to additional costs.

Further optimization

  • in CAS , we make DeleteMachines atomic, by adding machine names to be deleted as annotation in machineDeployment and reducing the replicas. However this needs further discussion.

@vlerenc
Copy link
Member

vlerenc commented Jan 11, 2023

As discussed out-of-band, our priorities or even/also an annotation (mentioned in "further optimization") all bear the "sync problem", i.e. syncing the state of CA with MCM, because we use an "independent" marker (and not CA's marker) which nodes must be terminated.

Maybe it's easier to accept CA's way of getting rid of machines and when CA wants to finally get rid of the machine after the drain, we go ahead and decrease the replica count and let the machine set controller check whether it sees any machine whose node resource has the "magic" taint and prefer terminating it first. This way, there is nothing we have to keep in sync - at the expense of tying us to the CA, but we are already extremely tightly coupled.

@elankath
Copy link

elankath commented Jan 25, 2023

Will commence this after autoscaler rebase.

@gardener-robot
Copy link

@himanshu-kun You have mentioned internal references in the public. Please check.

@himanshu-kun
Copy link
Author

Post Grooming

  • Be in sync with Autoscaler and delete prio 1 machine with node having ToBeDeleted taint from CA
    • To avoid situations where there are machine obj with prio 1 but node doesn't have the taint, in DeleteNodes() method we will first clear the prio annotation from such machine objects and then only go ahead with what we do already.
  • The DeleteNode() logic is expected to be atomic from autoscaler's perspective but in our case its not. We need to refactor and design to make it atomic in the long run.

@himanshu-kun himanshu-kun removed their assignment Mar 1, 2023
@himanshu-kun himanshu-kun added area/robustness Robustness, reliability, resilience related priority/1 Priority (lower number equals higher priority) needs/planning Needs (more) planning with other MCM maintainers labels Mar 1, 2023
@himanshu-kun himanshu-kun added priority/critical Needs to be resolved soon, because it impacts users negatively and removed priority/1 Priority (lower number equals higher priority) labels Mar 31, 2023
@himanshu-kun
Copy link
Author

/assign

@himanshu-kun himanshu-kun self-assigned this Jun 6, 2023
@himanshu-kun himanshu-kun added this to the 2023-Q3 milestone Jul 13, 2023
@ashwani2k ashwani2k changed the title MCM doesn't remove the machine which CAS wants MCM doesn't remove the machine which CA wants Jul 21, 2023
@gardener-robot
Copy link

@ashwani2k You have mentioned internal references in the public. Please check.

@gardener-robot gardener-robot added the status/closed Issue is closed (either delivered or triaged) label Aug 17, 2023
@rishabh-11 rishabh-11 assigned rishabh-11 and unassigned unmarshall Aug 24, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/robustness Robustness, reliability, resilience related kind/bug Bug needs/planning Needs (more) planning with other MCM maintainers priority/critical Needs to be resolved soon, because it impacts users negatively status/closed Issue is closed (either delivered or triaged)
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants