Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cluster member nodes are removed after reboot #52

Open
thomasherrle opened this issue May 3, 2023 · 0 comments
Open

Cluster member nodes are removed after reboot #52

thomasherrle opened this issue May 3, 2023 · 0 comments

Comments

@thomasherrle
Copy link

thomasherrle commented May 3, 2023

vSphere Server Info

  • vSphere version: 7.0.3.00700

Rancher Server Setup

  • Rancher version: 2.7.3
  • Installation option (Docker install/Helm Chart):
    • RKE2 1.25.9
  • Proxy/Cert Details: Valid Wildcard SSL Cert issued by Certum Certification Authority

Information about the Cluster

  • Kubernetes version: 1.25.9
  • Cluster Type (Local/Downstream): Local

Custom self hosted cluster running on Ubuntu 22.04.2 LTS nodes

Describe the bug
As soon as I install vSphere CPI (102.0.0+up1.4.2) on a new cluster and reboot any worker or control-plane node the node will be deleted and is unable to rejoin the cluster.
Corresponding rke2-server log:
May 03 15:03:54 testserver01 rke2[857]: time="2023-05-03T15:03:54Z" level=error msg="error syncing 'testagent03': handler node: Operation cannot be fulfilled on nodes \"testagent03\": StorageError: invalid object, Code: 4, Key: /registry/minions/testagent03, ResourceVersion: 0, AdditionalErrorMsg: Precondition failed: UID in precondition: efd660aa-50e2-4f9a-8483-64f43cd2204e, UID in object meta: , requeuing"

After deinstalling vSphere CPI (102.0.0+up1.4.2), the node rejoins again without issues.
Log:

May 03 15:07:28 testserver02 rke2[857]: time="2023-05-03T15:07:28Z" level=info msg="certificate CN=testagent03 signed by CN=rke2-server-ca@1683122526: notBefore=2023-05-03 14:02:06 +0000 UTC notAfter=2024-05-02 15:07:28 +0000 UTC"
May 03 15:07:28 testserver02 rke2[857]: time="2023-05-03T15:07:28Z" level=info msg="certificate CN=system:node:testagent03,O=system:nodes signed by CN=rke2-client-ca@1683122526: notBefore=2023-05-03 14:02:06 +0000 UTC notAfter=2024-05-02 15:07:28 +0000 UTC"
May 03 15:07:31 testserver02 rke2[857]: time="2023-05-03T15:07:31Z" level=info msg="Handling backend connection request [testagent03]"

To Reproduce
Provision new rke2 cluster

# kubectl get nodes
NAME           STATUS   ROLES                       AGE   VERSION
testagent01    Ready    <none>                      72m   v1.25.9+rke2r1
testagent02    Ready    <none>                      72m   v1.25.9+rke2r1
testagent03    Ready    <none>                      10m   v1.25.9+rke2r1
testserver01   Ready    control-plane,etcd,master   74m   v1.25.9+rke2r1
testserver02   Ready    control-plane,etcd,master   72m   v1.25.9+rke2r1
testserver03   Ready    control-plane,etcd,master   73m   v1.25.9+rke2r1 

Install vSphere CPI (102.0.0+up1.4.2) via Rancher UI with "Define vSphere Tags" option. See settings in screenshot.
Reboot any cluster node.
Check rke2-server logs.

Result

Expected Result
Nodes should not be removed when rebooting.

Screenshots
image

Additional context
All nodes are provisioned with the vSphere parameter "disk.enableUUID=TRUE" set before installing rke2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant