Cluster member nodes are removed after reboot #52

thomasherrle · 2023-05-03T15:32:42Z

vSphere Server Info

vSphere version: 7.0.3.00700

Rancher Server Setup

Rancher version: 2.7.3
Installation option (Docker install/Helm Chart):
- RKE2 1.25.9
Proxy/Cert Details: Valid Wildcard SSL Cert issued by Certum Certification Authority

Information about the Cluster

Kubernetes version: 1.25.9
Cluster Type (Local/Downstream): Local

Custom self hosted cluster running on Ubuntu 22.04.2 LTS nodes

Describe the bug
As soon as I install vSphere CPI (102.0.0+up1.4.2) on a new cluster and reboot any worker or control-plane node the node will be deleted and is unable to rejoin the cluster.
Corresponding rke2-server log:
May 03 15:03:54 testserver01 rke2[857]: time="2023-05-03T15:03:54Z" level=error msg="error syncing 'testagent03': handler node: Operation cannot be fulfilled on nodes \"testagent03\": StorageError: invalid object, Code: 4, Key: /registry/minions/testagent03, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: efd660aa-50e2-4f9a-8483-64f43cd2204e, UID in object meta: , requeuing"

After deinstalling vSphere CPI (102.0.0+up1.4.2), the node rejoins again without issues.
Log:

May 03 15:07:28 testserver02 rke2[857]: time="2023-05-03T15:07:28Z" level=info msg="certificate CN=testagent03 signed by CN=rke2-server-ca@1683122526: notBefore=2023-05-03 14:02:06 +0000 UTC notAfter=2024-05-02 15:07:28 +0000 UTC"
May 03 15:07:28 testserver02 rke2[857]: time="2023-05-03T15:07:28Z" level=info msg="certificate CN=system:node:testagent03,O=system:nodes signed by CN=rke2-client-ca@1683122526: notBefore=2023-05-03 14:02:06 +0000 UTC notAfter=2024-05-02 15:07:28 +0000 UTC"
May 03 15:07:31 testserver02 rke2[857]: time="2023-05-03T15:07:31Z" level=info msg="Handling backend connection request [testagent03]"

To Reproduce
Provision new rke2 cluster

# kubectl get nodes
NAME           STATUS   ROLES                       AGE   VERSION
testagent01    Ready    <none>                      72m   v1.25.9+rke2r1
testagent02    Ready    <none>                      72m   v1.25.9+rke2r1
testagent03    Ready    <none>                      10m   v1.25.9+rke2r1
testserver01   Ready    control-plane,etcd,master   74m   v1.25.9+rke2r1
testserver02   Ready    control-plane,etcd,master   72m   v1.25.9+rke2r1
testserver03   Ready    control-plane,etcd,master   73m   v1.25.9+rke2r1

Install vSphere CPI (102.0.0+up1.4.2) via Rancher UI with "Define vSphere Tags" option. See settings in screenshot.
Reboot any cluster node.
Check rke2-server logs.

Result

Expected Result
Nodes should not be removed when rebooting.

Screenshots

Additional context
All nodes are provisioned with the vSphere parameter "disk.enableUUID=TRUE" set before installing rke2.

The text was updated successfully, but these errors were encountered:

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cluster member nodes are removed after reboot #52

Cluster member nodes are removed after reboot #52

thomasherrle commented May 3, 2023 •

edited

Loading

Cluster member nodes are removed after reboot #52

Cluster member nodes are removed after reboot #52

Comments

thomasherrle commented May 3, 2023 • edited Loading

thomasherrle commented May 3, 2023 •

edited

Loading