-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
RKE2 upgrade fails #193
Comments
Here are some observations we've made while upgrading a cluster with 3 control-plane replicas from v1.25.14 to v1.26.9: During the upgrade, we observed at some point that there were only 2 machines remaining (at 07:02:24) which is already lower than what we could expect:
Whereas we could still observe 4 nodes visible in the cluster api:
As it seems to be only relying on node count to decide whether control-plane should be scaled up or down here, rke2 controlplane deleted the last outdated VM, as we saw in controller log:
And we indeed observed the machine being deleted right after that (at 07:02:30):
At this stage, there was only a single remaining node, which broke the etcd chorum (and the cluster) If we look at kubeadm control-plane provider, we can see that it is relying on machine count instead of nodes exposed by the api here. Was it on purpose to use node count instead of machines to manage the number of control-plane machines? |
I have re-created this issue with CAPD when upgrading v1.25.11+rke2r1 to v1.26.4+rke2r1. |
Also tried the suggestion by @zioc and tried using the machine count instead of the node count and that looks a lot better. I will run some more tests to make sure. I don't think there was a specific reason it used nodes. |
Tested this multiple times: if int32(controlPlane.Machines.Len()) <= *rcp.Spec.Replicas {
// scaleUp ensures that we don't continue scaling up while waiting for Machines to have NodeRefs
return r.scaleUpControlPlane(ctx, cluster, rcp, controlPlane)
} And i don't run into the same issue, so thanks for this suggestion @zioc. Looking at the code for the maxsurge=0 change (#188) we are using machines, so this will be fixed with that. I won't do a separate PR for this then |
What happened:
Testing out an RKE2 (on CAPO infra) workload cluster upgrade (started from a cluster of 3 CP + 1 worker nodes in v1.25.14+rke2r1 and wanted to achieve 3 CP + 2 worker nodes in v1.26.9+rke2r1) results in a node rolling update where the new version CP node will refer to an old CP node (deleted by CAPI rolling update) endpoint, which is no longer reachable, breaking the ETCD leader election.
What did you expect to happen:
Its expected the upgrade completes without failure
How to reproduce it:
How to reproduce on CAPO
and running:
Anything else you would like to add:
This is the original issue for the Sylva project. See this for additional information
Environment:
/etc/os-release
):The text was updated successfully, but these errors were encountered: