Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure that there is a single actor which reduces the machine deployment replicas #181

Open
unmarshall opened this issue Feb 28, 2023 · 0 comments
Labels
kind/enhancement Enhancement, improvement, extension lifecycle/rotten Nobody worked on this for 12 months (final aging stage) needs/planning Needs (more) planning with other MCM maintainers priority/3 Priority (lower number equals higher priority)

Comments

@unmarshall
Copy link

What would you like to be added:

Context:
Issue #118 highlights the fact that even a small time difference between CA and MCM can result in a situation where CA's MCM provider can reduce the replicas of MachineDeployment to 0, in the process also deleting newly launched VMs. In the issue we specifically have a case where a Machine has been transitioned to Failed state by Machine controller because it could not start successfully (20 mins timeout). Machine controller will then launch a new VM. In the mean time CA also sees that ( 1-2 seconds earlier than MCM ) and marks this as a candidate to be deleted and that is addressed via MCM provider (https://github.com/gardener/autoscaler/blob/machine-controller-manager-provider/cluster-autoscaler/cloudprovider/mcm/mcm_manager.go#L435) which adds a priority annotation and reduces the replicas of MachineDeployment. In the issue the original number of replicas = 1, and now CA reduces it to 0. MCM which was in the middle of launching another VM now sees that the replicas are now set to 0 and then will stop all machines.

This happens because a single responsibility principle is broken w.r.t managing the replicas for a machine deployment.

Why is this needed:

There is a need to define clear boundaries in the responsibility set between CA and MCM so as to prevent CA stepping over MCM.

CA's responsibility:

  1. Scale out (within [min, max]) in case there are unscheduled pods.
  2. Scale in (within [min, max]) in case there are under utilised nodes. In this process it should not drain the node as that is solely the responsibility of MCM. We have seen CA's implementation of draining a node and it does not take care of properly evicting pods with PVs.

MCM's responsibility

  1. Ensuring that it continuously attempts to reconcile MachineDeployment, MachineSet and Machine objects as per the desired state. In case a machine does not become healthy in 20 mins (current timeout) then it should be only its job to ensure that it launches another machine and stops/deletes the older FAILED machine.
  2. React to requests from CA for scale up and scale down MachineDeployment's.

There are other responsibilities of each of the above actors, however we have only listed the ones where there is an overlap.

@unmarshall unmarshall added the kind/enhancement Enhancement, improvement, extension label Feb 28, 2023
@himanshu-kun himanshu-kun added priority/3 Priority (lower number equals higher priority) needs/planning Needs (more) planning with other MCM maintainers labels Mar 1, 2023
@gardener-robot gardener-robot added the lifecycle/stale Nobody worked on this for 6 months (will further age) label Nov 8, 2023
@gardener-robot gardener-robot added lifecycle/rotten Nobody worked on this for 12 months (final aging stage) and removed lifecycle/stale Nobody worked on this for 6 months (will further age) labels Jul 17, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/enhancement Enhancement, improvement, extension lifecycle/rotten Nobody worked on this for 12 months (final aging stage) needs/planning Needs (more) planning with other MCM maintainers priority/3 Priority (lower number equals higher priority)
Projects
None yet
Development

No branches or pull requests

3 participants