RKE2 upgrade fails #193

richardcase · 2023-11-13T08:36:31Z

What happened:

Testing out an RKE2 (on CAPO infra) workload cluster upgrade (started from a cluster of 3 CP + 1 worker nodes in v1.25.14+rke2r1 and wanted to achieve 3 CP + 2 worker nodes in v1.26.9+rke2r1) results in a node rolling update where the new version CP node will refer to an old CP node (deleted by CAPI rolling update) endpoint, which is no longer reachable, breaking the ETCD leader election.

What did you expect to happen:

Its expected the upgrade completes without failure

How to reproduce it:

How to reproduce on CAPO

# environment-values/my-environment-values/values.yaml
---
units:
  workload-cluster:
    enabled: true
    helmrelease_spec:
      values:
        cluster:
          # k8s_version: "v1.25.14+rke2r1"       # before apply.sh
          k8s_version: "v1.26.9+rke2r1"       # after apply.sh
          capo:
            # image_name: "ubuntu-jammy-plain-rke2-1.25.14-0.1.0"       # before apply.sh
            image_name: "ubuntu-jammy-plain-rke2-1.26.9-0.0.12"       # after apply.sh
          control_plane_replicas: 3

          machine_deployments:
            md0:
              # replicas: 1       # used for v1.25.14, before apply.sh
              replicas: 2       # used for v1.26.19, after apply.sh
              capo:
                failure_domain: dev-az

and running:

root@~/sylva-core # ./apply.sh environment-values/my-environment-values
:
 ✓ Kustomization/default/flux-webui - Resource is ready: Flux Web UI can be reached at https://flux.sylva (flux.sylva must resolve to 192.168.128.193)
⢎⡰ HelmRelease/workload-cluster/calico - GetLastReleaseFailed - failed to get last release revision
⢎⡰ HelmRelease/workload-cluster/calico-crd - GetLastReleaseFailed - failed to get last release revision
⢎⡰ HelmRelease/workload-cluster/cinder-csi - GetLastReleaseFailed - failed to get last release revision
⢎⡰ HelmRelease/workload-cluster/ingress-nginx - GetLastReleaseFailed - failed to get last release revision
⢎⡰ HelmRelease/workload-cluster/metallb - GetLastReleaseFailed - failed to get last release revision
⢎⡰ HelmRelease/workload-cluster/monitoring - GetLastReleaseFailed - failed to get last release revision
⢎⡰ HelmRelease/workload-cluster/monitoring-crd - GetLastReleaseFailed - failed to get last release revision
⢎⡰ Kustomization/default/workload-cluster - Progressing - Reconciliation in progress
⢎⡰ Kustomization/workload-cluster/namespace-defs - ReconciliationFailed - Namespace/cattle-monitoring-system dry-run failed: failed to get API group resources: unable to retrieve the complete list of server APIs: v1: Get "https://192.168.129.147:6443/api/v1?timeout=30s": dial tcp 192.168.129.147:6443: connect: no route to host
 ✗ Command timeout exceeded
root@~/sylva-core#

Anything else you would like to add:

This is the original issue for the Sylva project. See this for additional information

Environment:

rke provider version:
OS (e.g. from /etc/os-release):

The text was updated successfully, but these errors were encountered:

zioc · 2023-11-14T18:35:50Z

Here are some observations we've made while upgrading a cluster with 3 control-plane replicas from v1.25.14 to v1.26.9:

During the upgrade, we observed at some point that there were only 2 machines remaining (at 07:02:24) which is already lower than what we could expect:

NAMESPACE     NAME                                     CLUSTER              NODENAME                                 PROVIDERID                                          PHASE     AGE     VERSION
cluster-one   cluster-one-control-plane-mnqgv          cluster-one          cluster-one-cp-13ad5aefdb-zq554          openstack:///2766ca44-7af4-4ca3-9ac7-e558f7266396   Running   4m32s   v1.26.9
cluster-one   cluster-one-control-plane-nqn9b          cluster-one          cluster-one-cp-13ad5aefdb-hklvf          openstack:///f73fb268-dae7-47a5-b5e7-7622b49e9a41   Running   7h52m   v1.25.14+rke2r1

Whereas we could still observe 4 nodes visible in the cluster api:

NAME                              STATUS                        ROLES                       AGE     VERSION
cluster-one-cp-13ad5aefdb-2lftq   Ready,SchedulingDisabled      control-plane,etcd,master   7h53m   v1.25.14+rke2r1
cluster-one-cp-13ad5aefdb-hklvf   Ready,SchedulingDisabled      control-plane,etcd,master   7h50m   v1.25.14+rke2r1
cluster-one-cp-13ad5aefdb-w4frp   NotReady,SchedulingDisabled   control-plane,etcd,master   8h      v1.25.14+rke2r1
cluster-one-cp-13ad5aefdb-zq554   Ready                         control-plane,etcd,master   2m26s   v1.26.9+rke2r1

As it seems to be only relying on node count to decide whether control-plane should be scaled up or down here, rke2 controlplane deleted the last outdated VM, as we saw in controller log:

I1114 07:02:18.181203       1 scale.go:192]  "msg"="Waiting for machines to be deleted" "Machines"="cluster-one-control-plane-nqn9b" "RKE2ControlPlane"={"name":"cluster-one-control-plane","namespace":"cluster-one"} "controller"="rke2controlplane" "controllerGroup"="controlplane.cluster.x-k8s.io" "controllerKind"="RKE2ControlPlane" "name"="cluster-one-control-plane" "namespace"="cluster-one" "reconcileID"="f9dd0a9c-a30d-40e5-8021-4aad8bf931b0"

And we indeed observed the machine being deleted right after that (at 07:02:30):

NAMESPACE     NAME                                     CLUSTER              NODENAME                                 PROVIDERID                                          PHASE      AGE     VERSION
cluster-one   cluster-one-control-plane-mnqgv          cluster-one          cluster-one-cp-13ad5aefdb-zq554          openstack:///2766ca44-7af4-4ca3-9ac7-e558f7266396   Running    4m38s   v1.26.9
cluster-one   cluster-one-control-plane-nqn9b          cluster-one          cluster-one-cp-13ad5aefdb-hklvf          openstack:///f73fb268-dae7-47a5-b5e7-7622b49e9a41   Deleting   7h52m   v1.25.14+rke2r1

At this stage, there was only a single remaining node, which broke the etcd chorum (and the cluster)

If we look at kubeadm control-plane provider, we can see that it is relying on machine count instead of nodes exposed by the api here. Was it on purpose to use node count instead of machines to manage the number of control-plane machines?

richardcase · 2023-11-15T16:24:21Z

I have re-created this issue with CAPD when upgrading v1.25.11+rke2r1 to v1.26.4+rke2r1.

richardcase · 2023-11-15T16:26:13Z

Also tried the suggestion by @zioc and tried using the machine count instead of the node count and that looks a lot better. I will run some more tests to make sure.

I don't think there was a specific reason it used nodes.

richardcase · 2023-11-15T19:18:52Z

Tested this multiple times:

if int32(controlPlane.Machines.Len()) <= *rcp.Spec.Replicas {
		// scaleUp ensures that we don't continue scaling up while waiting for Machines to have NodeRefs
		return r.scaleUpControlPlane(ctx, cluster, rcp, controlPlane)
	}

And i don't run into the same issue, so thanks for this suggestion @zioc. Looking at the code for the maxsurge=0 change (#188) we are using machines, so this will be fixed with that. I won't do a separate PR for this then

richardcase added this to CAPI / Turtles Nov 13, 2023

richardcase moved this to CAPI Backlog in CAPI / Turtles Nov 13, 2023

richardcase added this to the v0.2.1 milestone Nov 13, 2023

richardcase self-assigned this Nov 15, 2023

richardcase mentioned this issue Nov 15, 2023

Add maxsurge option for control plane upgrade #188

Merged

4 tasks

richardcase modified the milestones: v0.2.1, v0.2.0 Nov 16, 2023

richardcase closed this as completed in #188 Nov 17, 2023

github-project-automation bot moved this from CAPI Backlog to Done in CAPI / Turtles Nov 17, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

RKE2 upgrade fails #193

RKE2 upgrade fails #193

richardcase commented Nov 13, 2023

zioc commented Nov 14, 2023 •

edited

Loading

richardcase commented Nov 15, 2023

richardcase commented Nov 15, 2023

richardcase commented Nov 15, 2023

RKE2 upgrade fails #193

RKE2 upgrade fails #193

Comments

richardcase commented Nov 13, 2023

zioc commented Nov 14, 2023 • edited Loading

richardcase commented Nov 15, 2023

richardcase commented Nov 15, 2023

richardcase commented Nov 15, 2023

zioc commented Nov 14, 2023 •

edited

Loading