Machine pool nodes are not rolled during update #2217

calvix · 2023-01-18T08:28:58Z

Issue

The machine pool nodes are not rolled if the KubeadmConfig is changed and the nodes needs to be manually rolled.

Details

I noticed during the thunder update, as there were many changes to both cluster-aws and other stuff, it looked like it rolled nodes but the changes for proxy were not rolled as well, once I had deleted the nodes manually they came up with proper config.

We should first confirm if the scenario is reproducible, and if it is we need to find a solution or be aware of the issue.

How to try to reproduce:

trigger a cluster update that includes a roll of worker nodes
once the update is triggered make a change to a configuration that would trigger yet another change causing and roll of worker nodes
observe if the second change is applied to workers or not

The text was updated successfully, but these errors were encountered:

fiunchinho · 2023-02-21T15:35:52Z

I think it's this bug kubernetes-sigs/cluster-api-provider-aws#4071

alex-dabija · 2023-03-27T08:57:41Z

I've encountered a similar situation when other machine pool settings were updated. I ended up manually rolling the nodes from the AWS console.

primeroz · 2023-03-27T09:10:02Z

FYI CAPz has the a very similar issues with MachinePool Roll updates , https://github.com/giantswarm/giantswarm/issues/25188 , reason why we reverted to MachineDeployments

AndiDog · 2023-04-13T11:53:43Z

Here's why this happens: upstream comment. I'll try to figure out implementation ideas for the fix.

AndiDog · 2023-05-02T10:26:38Z

In kubernetes-sigs/cluster-api-provider-aws#4071 (comment), I figured that cluster-api changes are also needed to speed up time until nodes get refreshed.

Therefore I started working on making our fork https://github.com/giantswarm/cluster-api ready to work with. First change to be upstream: kubernetes-sigs/cluster-api#8586 so we can build only the production components controller images of CAPI, but avoid building not required components (clusterctl, test stuff). Once merged, our fork should not differ from upstream apart from our CircleCI config.

AndiDog · 2023-05-04T16:39:51Z

First part of the fix in CAPA: kubernetes-sigs/cluster-api-provider-aws#4245

T-Kukawka · 2023-06-13T11:53:52Z

@AndiDog do u know if there was any decision made?

AndiDog · 2023-06-14T08:05:25Z

Next step is to discuss the issue in the CAPI office hours, since in the CAPA meeting we noted that it's generic across providers and CAPI needs to be adapted at best.

AndiDog · 2023-06-15T07:20:35Z

Continuing the discussion with CAPI maintainers in the new issue kubernetes-sigs/cluster-api#8858 and Slack thread.

AndiDog · 2023-08-10T13:46:01Z

I'm writing down a proposed solution which requires some contract from CAPI to infra providers (e.g. CAPA) so they can get a checksum of the user data without the bootstrap token – or in other terms, a checksum of KubeadmConfig.spec since we want to roll node on changes to that. That should then trigger some further discussion with upstream maintainers, since it's more actionable. The proposal will go into kubernetes-sigs/cluster-api#8858 very soon, so I won't repeat it here.

In the meantime, I want to try a workaround: CAPA always rolls out the new launch template if AWSMachinePool.spec.additionalTags changes (see if needsUpdate || tagsChanged || ... in code). If we put a checksum of KubeadmConfig.spec in such a tag, we would therefore enforce rolling nodes. The timing issue in CAPI still remains, meaning that the user data (alias bootstrap data secret) only gets updated every 5-15 minutes because CAPI only updates when the bootstrap token TTL is about to expire. That's however much easier to fix and does not need any discussion, since we've already discussed this as clear bug in the upstream meetings. I have a half-ready change that was already working for me.

AndiDog · 2023-08-14T10:32:30Z

The proposed workaround, i.e. adding a hash of the machine pool's KubeadmConfig.spec as tag value into AWSMachinePool.spec.additionalTags, forces node rollout just fine. However, due to CAPI not immediately updating the bootstrap secret (kubernetes-sigs/cluster-api#8858), the new nodes start with an old version of the launch template. Only the next rollout, if 10 minutes later, would use the new version, since CAPI by then has updated the secret. So my workaround won't work until that is fixed, and I'll concentrate on that first. But even then, I have the feeling that a race condition could occur – what if CAPA's AWSMachinePool controller first triggers a node rollout and CAPI takes a few seconds longer to update the bootstrap secret? Well, we'll give it a try.

AndiDog · 2023-08-21T13:20:14Z

Proposed solution in the CAPI issue: kubernetes-sigs/cluster-api#8858 (comment)

AndiDog · 2023-09-27T19:08:04Z

My first proposal would lead to quite some contract changes between CAPI and bootstrap providers, so not sure if it's feasible or desired.

Therefore, to get a quick turnaround, I tried the "swap the MachinePool -> KubeadmConfig object reference" workaround. Unfortunately, it doesn't work and creates miserable problems. But it's a use case that seemingly can be fixed more easily, so there's a chance that I can go forward with maintainers to patch CAPI and CAPA. Details in kubernetes-sigs/cluster-api#8858 (comment).

JosephSalisbury · 2023-10-11T13:09:36Z

from wg capa migration sync: critical for adidas poc (as we need to do manual work to upgrade the cluster)

AndiDog · 2023-10-11T13:10:04Z

I have a meeting with upstream this Friday to see how we can go on. None of the workarounds worked until now because of CAPI/CAPA bugs or shortcomings.

AndiDog · 2023-11-07T09:34:27Z

Working and tested solution in kubernetes-sigs/cluster-api-provider-aws#4619 (combined with newer CAPI version), so moving this to blocked until we progress on the upstream PRs

AndiDog · 2024-01-02T09:03:45Z

We're almost done here. CAPI/CAPA forks and cluster-aws have the feature. Waiting for the upstream PR (targeting CAPA v2.4.0 because it's a new feature) kubernetes-sigs/cluster-api-provider-aws#4619 to be merged before closing this issue. There's also still a small docs PR open: giantswarm/docs#2028.

AndiDog · 2024-02-05T18:50:54Z

Upstream PR is merged now

alex-dabija changed the title ~~if the kubeadmconfig of machinepool changes during ongoing update it wont roll nodes again and you might need to manually roll nodes later~~ Machine pool nodes not rolled during update Mar 27, 2023

alex-dabija changed the title ~~Machine pool nodes not rolled during update~~ Machine pool nodes are not rolled during update Mar 27, 2023

alex-dabija transferred this issue from another repository Mar 27, 2023

alex-dabija mentioned this issue Mar 27, 2023

CAPA cluster upgrades stability #1777

Closed

14 tasks

alex-dabija added area/kaas Mission: Cloud Native Platform - Self-driving Kubernetes as a Service team/hydra topic/capi provider/cluster-api-aws Cluster API based running on AWS kind/bug labels Mar 27, 2023

AndiDog self-assigned this Apr 13, 2023

AndiDog mentioned this issue May 3, 2023

Replace deprecated kustomize config patchesStrategicMerge giantswarm/cluster-api-app#131

Merged

1 task

alex-dabija added team/phoenix Team Phoenix and removed team/hydra labels May 15, 2023

AndiDog mentioned this issue Jun 13, 2023

Switch to using Giant Swarm fork for built images, upgrade to CAPI v1.4.5 giantswarm/cluster-api-app#138

Merged

1 task

T-Kukawka added the goal/capa-internal-ga label Jun 15, 2023

AndiDog mentioned this issue Aug 14, 2023

cluster-aws rolls control plane nodes unnecessarily because helm.sh/chart label is included in hashes #2721

Closed

AndiDog mentioned this issue Aug 17, 2023

🐛 Only refresh bootstrap token if needed, requeue in all cases where node hasn't joined yet kubernetes-sigs/cluster-api#9229

Merged

alex-dabija mentioned this issue Aug 21, 2023

CAPA cluster stability #2737

Closed

This was referenced Dec 21, 2023

Release v2.9.0 giantswarm/cluster-api-provider-aws-app#206

Merged

Release v0.55.0 giantswarm/cluster-aws#465

Merged

Mention instance refresh warmup settings for CAPA, add general advice giantswarm/docs#2028

Merged

AndiDog closed this as completed Feb 5, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Machine pool nodes are not rolled during update #2217

Machine pool nodes are not rolled during update #2217

calvix commented Jan 18, 2023 •

edited by alex-dabija

Loading

fiunchinho commented Feb 21, 2023

alex-dabija commented Mar 27, 2023

primeroz commented Mar 27, 2023

AndiDog commented Apr 13, 2023

AndiDog commented May 2, 2023

AndiDog commented May 4, 2023

T-Kukawka commented Jun 13, 2023

AndiDog commented Jun 14, 2023

AndiDog commented Jun 15, 2023

AndiDog commented Aug 10, 2023 •

edited

Loading

AndiDog commented Aug 14, 2023

AndiDog commented Aug 21, 2023

AndiDog commented Sep 27, 2023

JosephSalisbury commented Oct 11, 2023

AndiDog commented Oct 11, 2023

AndiDog commented Nov 7, 2023

AndiDog commented Jan 2, 2024

AndiDog commented Feb 5, 2024

Machine pool nodes are not rolled during update #2217

Machine pool nodes are not rolled during update #2217

Comments

calvix commented Jan 18, 2023 • edited by alex-dabija Loading

Issue

Details

fiunchinho commented Feb 21, 2023

alex-dabija commented Mar 27, 2023

primeroz commented Mar 27, 2023

AndiDog commented Apr 13, 2023

AndiDog commented May 2, 2023

AndiDog commented May 4, 2023

T-Kukawka commented Jun 13, 2023

AndiDog commented Jun 14, 2023

AndiDog commented Jun 15, 2023

AndiDog commented Aug 10, 2023 • edited Loading

AndiDog commented Aug 14, 2023

AndiDog commented Aug 21, 2023

AndiDog commented Sep 27, 2023

JosephSalisbury commented Oct 11, 2023

AndiDog commented Oct 11, 2023

AndiDog commented Nov 7, 2023

AndiDog commented Jan 2, 2024

AndiDog commented Feb 5, 2024

calvix commented Jan 18, 2023 •

edited by alex-dabija

Loading

AndiDog commented Aug 10, 2023 •

edited

Loading