-
Notifications
You must be signed in to change notification settings - Fork 23
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MCM doesn't remove the machine which CA wants #159
Comments
@himanshu-kun You have mentioned internal references in the public. Please check. |
Proposalin CAS:
Edge based logic: (1st level protection)
Level based logic: (2nd level protection)in MCM:
Documentation update required here Why not use the CAS flag
|
As discussed out-of-band, our priorities or even/also an annotation (mentioned in "further optimization") all bear the "sync problem", i.e. syncing the state of CA with MCM, because we use an "independent" marker (and not CA's marker) which nodes must be terminated. Maybe it's easier to accept CA's way of getting rid of machines and when CA wants to finally get rid of the machine after the drain, we go ahead and decrease the replica count and let the machine set controller check whether it sees any machine whose node resource has the "magic" taint and prefer terminating it first. This way, there is nothing we have to keep in sync - at the expense of tying us to the CA, but we are already extremely tightly coupled. |
Will commence this after autoscaler rebase. |
@himanshu-kun You have mentioned internal references in the public. Please check. |
Post Grooming
|
/assign |
@ashwani2k You have mentioned internal references in the public. Please check. |
What happened:
We have seen a case where CAS wants to remove machine A due to low utilization but machine B is removed by MCM when CAS scales down the machinedeployment.
This has happened because machine B was finalized to be removed by CAS some time back and so
ToBeDeleted
taint was placed on the node. But later due to unexpected circumstances (autoscaler restart after priority 1 is set, in which case it clears allToBeDeleted
taints, or machineDeployment couldn't be scaled-down etc) autoscaler reverts its decision and removes theToBeDeleted
taint.This leaves the machine B with priority 1 , and later when machine A is asked to be removed machine B is picked up by MCM
What you expected to happen:
Scale-down of right machine is expected.
How to reproduce it (as minimally and precisely as possible):
Anything else we need to know:
One solution is to enhance MCM to reset priority to 3 for the machine obj if the
ToBeDeleted
CAS taint on the corresponding node object is removed.High demand , see live issue # 2423
Environment:
g/autoscaler v 1.25.0 and below.
cc @unmarshall
The text was updated successfully, but these errors were encountered: