Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WorkerGroup update behavior in ReplicaSet pattern #9952

Open
epDHowwD opened this issue Jun 10, 2024 · 3 comments
Open

WorkerGroup update behavior in ReplicaSet pattern #9952

epDHowwD opened this issue Jun 10, 2024 · 3 comments
Labels
area/auto-scaling Auto-scaling (CA/HPA/VPA/HVPA, predominantly control plane, but also otherwise) related area/control-plane Control plane related area/scalability Scalability related area/usability Usability related lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.

Comments

@epDHowwD
Copy link

epDHowwD commented Jun 10, 2024

/area control-plane
/area auto-scaling
/area scalability
/area usability

What would you like to be added:

We would like to encourage to allow optionally the workerGroups updates without that the updates are immediately applied (rolling update like a Deployment), but that the version or generation of the current configuration is stored next to the new configuration, initially scaled to 0, while the older generation stays available in the current scale (ReplicaSet update behavior).

With an optional taint it might be possible to prevent new workload to be scheduled on older nodes. Otherwise existing nodes could be continued to be used. However, every new nodes should be only using the newest configuration generation. Node groups offering older configuration might be excluded from scale-up for the cluster-autoscaler.
It should also be possible to not have the older nodes tainted or cordoned so that workload can freely float through the nodes as managed by the services layer.

From application side, one needs to remove nodes from the system after a while. This could be application infrastructure and explicitly kept open from Gardener side. We think that this would be totally fine as the user disables the rolling behavior explicitly and might as well enable it again. The user can remove nodes from the system with kubectl drain for example, or could just turn back on the rolling update behavior.

Why is this needed:

We have to manage and update different configuration dimensions with regular updates: kubelet version, OS version, worker configurations at IaaS (like IMDSv2). Every update is applied immediately. That means when the configuration is changed outside of a maintenance window, the system might encounter an unwanted customer visible unavailability.

To manage updates, we have to create a new workerGroup for every update to host manually next to the current configuration (blue/green). This either limits us to 2 versions, or we create as many workergroups as we want to have different versions in the system. This is multiplied by the rather static configuration of workerGroups.

The static configuration dimensions of worker type, worker size, etc. we maintain 30+ variants respectively as we cannot forecast which combination will be actually used at runtime. All those variants would be offered as node groups to the cluster-autoscaler, most of them scaled to 0 and only available as fallback measures or for corner-cases. This multiplied by the number of versions would easily outperform the cluster-autoscaler.

The system would scale better and would be easer to use if the replica set behavior from above is applied. We would just have the 30+ static configurations maintained and every version update of kubelet or os would create a new generation. The existing workers keep their respective configuration generation and thus, the customer would not see downtime. We could apply the change anytime. The majority of static workerGroup configurations are fallbacks and might have scale 0 at the moment of the event. Those older configurations can directly be removed from the system, which limits the available options for the cluster-autoscaler.

With an optional taint on the older workerGroup configurations we can enforce the current blue/green pattern and move any restarted workload directly to the new configurations to push for the update. This mode would not be active all the time as this this can create holes in the utilization.

To tackle security updates for only specific versions, e.g. the oldest or one in the middle which comprises a problem, the application layer could have an own logic to drain specific affected nodes from the system.

Figure 1 "Rolling Update Deployment for WorkerGroups"

sequenceDiagram
    actor deploymentUpdater
    deploymentUpdater ->> Shoot: update worker.kubernetes.version
    activate Shoot
    participant MachineSet gen1
    participant node gen1
    create participant MachineSet gen2
    Shoot ->> MachineSet gen2: create
    create participant node gen2
    MachineSet gen2 ->> node gen2 : create
    destroy node gen1
    MachineSet gen1 ->> node gen1 : delete
    destroy MachineSet gen1
    MachineSet gen1 -->> Shoot : finished update
    Shoot -->> deploymentUpdater : shoot healthy
    deactivate Shoot
Loading

Figure 2 "ReplicaSet update for WorkerGroups"

sequenceDiagram
    actor deploymentUpdater
    deploymentUpdater ->> Shoot: update worker.kubernetes.version
    activate Shoot
    participant MachineSet gen1
    participant node gen1
    create participant MachineSet gen2
    Shoot ->> MachineSet gen2: create
    create participant node gen2
    MachineSet gen2 ->> node gen2 : create
    MachineSet gen1 -->> Shoot : finished update
    Shoot -->> deploymentUpdater : shoot healthy
    deactivate Shoot

    actor updateService
    destroy node gen1
    updateService ->> node gen1 : drain
    destroy MachineSet gen1
    Shoot ->> MachineSet gen1 : delete empty outdated
Loading
@gardener-prow gardener-prow bot added area/control-plane Control plane related area/auto-scaling Auto-scaling (CA/HPA/VPA/HVPA, predominantly control plane, but also otherwise) related area/scalability Scalability related area/usability Usability related labels Jun 10, 2024
@vlerenc
Copy link
Member

vlerenc commented Jun 10, 2024

Prerequisite: Ability to specify many more worker pools than is possible today, reflected by #9545 -> #9722, and notably #9545 (comment).

We could use the time to implement the following:

  • With the prerequisite hitting the landscapes and assuming no other showstopper is found, you will then be able to specify more/all worker pools you need, keep the old undesired ones and have the new desired ones (with the latest K8s version, OS type or version).
  • Now, you don’t want the old undesired worker pools to grow, so we would provide you with a way to disable CA scale-out per worker pool (while scale-in could still happen). This way, you could taint the old undesired worker pools and let them shrink down over time. They would not be eligible for scale-out, so CA would be able to scale them in whether or not there is unscheduled workload.
  • As mentioned in the call, if you go with that many worker pools, you may have to pay a lot for the machines (more fragmentation is ultimately one side effect) or not get them (out of capacity), but you are always free to proactively migrate workload if you see the need to.
  • Once a worker pool has no nodes anymore, you can drop it from the spec, whenever you want, e.g. in between your maintenance windows or when you do the next maintenance.

Would that help you?

@epDHowwD
Copy link
Author

We discussed different options internally and agreed on a way forward to implement means to generate and manage the workerGroups in our deployment infrastructure in a way which fits to your proposal 👍 .

Now, you don’t want the old undesired worker pools to grow, so we would provide you with a way to disable CA scale-out per worker pool (while scale-in could still happen). This way, you could taint the old undesired worker pools and let them shrink down over time. They would not be eligible for scale-out, so CA would be able to scale them in whether or not there is unscheduled workload.

Agree. Can I ask for an issue link for this, which we can link to our development activities, as this is certainly a requirement for the solution, please?

@gardener-ci-robot
Copy link
Contributor

The Gardener project currently lacks enough active contributors to adequately respond to all issues.
This bot triages issues according to the following rules:

  • After 90d of inactivity, lifecycle/stale is applied
  • After 30d of inactivity since lifecycle/stale was applied, lifecycle/rotten is applied
  • After 30d of inactivity since lifecycle/rotten was applied, the issue is closed

You can:

  • Mark this issue as fresh with /remove-lifecycle stale
  • Mark this issue as rotten with /lifecycle rotten
  • Close this issue with /close

/lifecycle stale

@gardener-prow gardener-prow bot added the lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale. label Sep 10, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/auto-scaling Auto-scaling (CA/HPA/VPA/HVPA, predominantly control plane, but also otherwise) related area/control-plane Control plane related area/scalability Scalability related area/usability Usability related lifecycle/stale Denotes an issue or PR has remained open with no activity and has become stale.
Projects
None yet
Development

No branches or pull requests

3 participants