Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RKE2 upgrade fails #193

Closed
richardcase opened this issue Nov 13, 2023 · 4 comments · Fixed by #188
Closed

RKE2 upgrade fails #193

richardcase opened this issue Nov 13, 2023 · 4 comments · Fixed by #188
Assignees
Labels
area/controlplane Indicates an issue or PR related to the control plane provider kind/bug Something isn't working priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now.
Milestone

Comments

@richardcase
Copy link
Contributor

What happened:

Testing out an RKE2 (on CAPO infra) workload cluster upgrade (started from a cluster of 3 CP + 1 worker nodes in v1.25.14+rke2r1 and wanted to achieve 3 CP + 2 worker nodes in v1.26.9+rke2r1) results in a node rolling update where the new version CP node will refer to an old CP node (deleted by CAPI rolling update) endpoint, which is no longer reachable, breaking the ETCD leader election.

What did you expect to happen:

Its expected the upgrade completes without failure

How to reproduce it:

How to reproduce on CAPO

# environment-values/my-environment-values/values.yaml
---
units:
  workload-cluster:
    enabled: true
    helmrelease_spec:
      values:
        cluster:
          # k8s_version: "v1.25.14+rke2r1"       # before apply.sh
          k8s_version: "v1.26.9+rke2r1"       # after apply.sh
          capo:
            # image_name: "ubuntu-jammy-plain-rke2-1.25.14-0.1.0"       # before apply.sh
            image_name: "ubuntu-jammy-plain-rke2-1.26.9-0.0.12"       # after apply.sh
          control_plane_replicas: 3

          machine_deployments:
            md0:
              # replicas: 1       # used for v1.25.14, before apply.sh
              replicas: 2       # used for v1.26.19, after apply.sh
              capo:
                failure_domain: dev-az

and running:

root@~/sylva-core # ./apply.sh environment-values/my-environment-values
:
 ✓ Kustomization/default/flux-webui - Resource is ready: Flux Web UI can be reached at https://flux.sylva (flux.sylva must resolve to 192.168.128.193)
⢎⡰ HelmRelease/workload-cluster/calico - GetLastReleaseFailed - failed to get last release revision
⢎⡰ HelmRelease/workload-cluster/calico-crd - GetLastReleaseFailed - failed to get last release revision
⢎⡰ HelmRelease/workload-cluster/cinder-csi - GetLastReleaseFailed - failed to get last release revision
⢎⡰ HelmRelease/workload-cluster/ingress-nginx - GetLastReleaseFailed - failed to get last release revision
⢎⡰ HelmRelease/workload-cluster/metallb - GetLastReleaseFailed - failed to get last release revision
⢎⡰ HelmRelease/workload-cluster/monitoring - GetLastReleaseFailed - failed to get last release revision
⢎⡰ HelmRelease/workload-cluster/monitoring-crd - GetLastReleaseFailed - failed to get last release revision
⢎⡰ Kustomization/default/workload-cluster - Progressing - Reconciliation in progress
⢎⡰ Kustomization/workload-cluster/namespace-defs - ReconciliationFailed - Namespace/cattle-monitoring-system dry-run failed: failed to get API group resources: unable to retrieve the complete list of server APIs: v1: Get "https://192.168.129.147:6443/api/v1?timeout=30s": dial tcp 192.168.129.147:6443: connect: no route to host
 ✗ Command timeout exceeded
root@~/sylva-core#

Anything else you would like to add:

This is the original issue for the Sylva project. See this for additional information

Environment:

  • rke provider version:
  • OS (e.g. from /etc/os-release):
@richardcase richardcase added kind/bug Something isn't working needs-priority Indicates an issue or PR needs a priority assigning to it needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. area/controlplane Indicates an issue or PR related to the control plane provider priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now. and removed needs-priority Indicates an issue or PR needs a priority assigning to it needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Nov 13, 2023
@richardcase richardcase moved this to CAPI Backlog in CAPI / Turtles Nov 13, 2023
@richardcase richardcase added this to the v0.2.1 milestone Nov 13, 2023
@zioc
Copy link

zioc commented Nov 14, 2023

Here are some observations we've made while upgrading a cluster with 3 control-plane replicas from v1.25.14 to v1.26.9:

During the upgrade, we observed at some point that there were only 2 machines remaining (at 07:02:24) which is already lower than what we could expect:

NAMESPACE     NAME                                     CLUSTER              NODENAME                                 PROVIDERID                                          PHASE     AGE     VERSION
cluster-one   cluster-one-control-plane-mnqgv          cluster-one          cluster-one-cp-13ad5aefdb-zq554          openstack:///2766ca44-7af4-4ca3-9ac7-e558f7266396   Running   4m32s   v1.26.9
cluster-one   cluster-one-control-plane-nqn9b          cluster-one          cluster-one-cp-13ad5aefdb-hklvf          openstack:///f73fb268-dae7-47a5-b5e7-7622b49e9a41   Running   7h52m   v1.25.14+rke2r1

Whereas we could still observe 4 nodes visible in the cluster api:

NAME                              STATUS                        ROLES                       AGE     VERSION
cluster-one-cp-13ad5aefdb-2lftq   Ready,SchedulingDisabled      control-plane,etcd,master   7h53m   v1.25.14+rke2r1
cluster-one-cp-13ad5aefdb-hklvf   Ready,SchedulingDisabled      control-plane,etcd,master   7h50m   v1.25.14+rke2r1
cluster-one-cp-13ad5aefdb-w4frp   NotReady,SchedulingDisabled   control-plane,etcd,master   8h      v1.25.14+rke2r1
cluster-one-cp-13ad5aefdb-zq554   Ready                         control-plane,etcd,master   2m26s   v1.26.9+rke2r1

As it seems to be only relying on node count to decide whether control-plane should be scaled up or down here, rke2 controlplane deleted the last outdated VM, as we saw in controller log:

I1114 07:02:18.181203       1 scale.go:192]  "msg"="Waiting for machines to be deleted" "Machines"="cluster-one-control-plane-nqn9b" "RKE2ControlPlane"={"name":"cluster-one-control-plane","namespace":"cluster-one"} "controller"="rke2controlplane" "controllerGroup"="controlplane.cluster.x-k8s.io" "controllerKind"="RKE2ControlPlane" "name"="cluster-one-control-plane" "namespace"="cluster-one" "reconcileID"="f9dd0a9c-a30d-40e5-8021-4aad8bf931b0"

And we indeed observed the machine being deleted right after that (at 07:02:30):

NAMESPACE     NAME                                     CLUSTER              NODENAME                                 PROVIDERID                                          PHASE      AGE     VERSION
cluster-one   cluster-one-control-plane-mnqgv          cluster-one          cluster-one-cp-13ad5aefdb-zq554          openstack:///2766ca44-7af4-4ca3-9ac7-e558f7266396   Running    4m38s   v1.26.9
cluster-one   cluster-one-control-plane-nqn9b          cluster-one          cluster-one-cp-13ad5aefdb-hklvf          openstack:///f73fb268-dae7-47a5-b5e7-7622b49e9a41   Deleting   7h52m   v1.25.14+rke2r1

At this stage, there was only a single remaining node, which broke the etcd chorum (and the cluster)

If we look at kubeadm control-plane provider, we can see that it is relying on machine count instead of nodes exposed by the api here. Was it on purpose to use node count instead of machines to manage the number of control-plane machines?

@richardcase
Copy link
Contributor Author

I have re-created this issue with CAPD when upgrading v1.25.11+rke2r1 to v1.26.4+rke2r1.

@richardcase
Copy link
Contributor Author

Also tried the suggestion by @zioc and tried using the machine count instead of the node count and that looks a lot better. I will run some more tests to make sure.

I don't think there was a specific reason it used nodes.

@richardcase richardcase self-assigned this Nov 15, 2023
@richardcase
Copy link
Contributor Author

Tested this multiple times:

if int32(controlPlane.Machines.Len()) <= *rcp.Spec.Replicas {
		// scaleUp ensures that we don't continue scaling up while waiting for Machines to have NodeRefs
		return r.scaleUpControlPlane(ctx, cluster, rcp, controlPlane)
	}

And i don't run into the same issue, so thanks for this suggestion @zioc. Looking at the code for the maxsurge=0 change (#188) we are using machines, so this will be fixed with that. I won't do a separate PR for this then

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/controlplane Indicates an issue or PR related to the control plane provider kind/bug Something isn't working priority/critical-urgent Highest priority. Must be actively worked on as someone's top priority right now.
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants