Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] Multiple server nodes pre-drains in an RKE2 upgrade #39167

Closed
bk201 opened this issue Sep 29, 2022 · 16 comments
Closed

[BUG] Multiple server nodes pre-drains in an RKE2 upgrade #39167

bk201 opened this issue Sep 29, 2022 · 16 comments
Assignees
Labels
area/harvester area/provisioning-v2 Provisioning issues that are specific to the provisioningv2 generating framework internal kind/bug Issues that are defects reported by users or that we know have reached a real release release-note Note this issue in the milestone's release notes team/hostbusters The team that is responsible for provisioning/managing downstream clusters + K8s version support
Milestone

Comments

@bk201
Copy link
Member

bk201 commented Sep 29, 2022

Rancher Server Setup

  • Rancher version: v2.6.9-rc2
  • Installation option (Docker install/Helm Chart):
    • If Helm Chart, Kubernetes Cluster and version (RKE1, RKE2, k3s, EKS, etc): v1.22.12+rke2r1 (Upgrade to v1.23.12+rke2r1)
  • Proxy/Cert Details:

Information about the Cluster

  • Kubernetes version: v1.22.12+rke2r1 (Upgrade to v1.23.12+rke2r1)
  • Cluster Type (Local/Downstream): local
    • If downstream, what type of cluster? (Custom/Imported or specify provider for Hosted/Infrastructure Provider):

User Information

  • What is the role of the user logged in? (Admin/Cluster Owner/Cluster Member/Project Owner/Project Member/Custom)
    • If custom, define the set of permissions:

Describe the bug

To Reproduce

  1. We trigger an RKE2 upgrade in Harvester (with pre-drain/post-drain hook) in a 4-nodes cluster (3 server, 1 worker):
$kubectl edit clusters.provisioning.cattle.io local -n fleet-local

And edit local cluster with:

spec:
  kubernetesVersion: v1.23.12+rke2r1
  localClusterAuthEndpoint: {}
  rkeConfig:
    chartValues: null
    machineGlobalConfig: null
    provisionGeneration: 1
    upgradeStrategy:
      controlPlaneConcurrency: "1"
      controlPlaneDrainOptions:
        deleteEmptyDirData: true
        enabled: true
        force: true
        ignoreDaemonSets: true
        postDrainHooks:
        - annotation: harvesterhci.io/post-hook
        preDrainHooks:
        - annotation: harvesterhci.io/pre-hook
        timeout: 0
      workerConcurrency: "1"
      workerDrainOptions:
        deleteEmptyDirData: true
        enabled: true
        force: true
        ignoreDaemonSets: true
        postDrainHooks:
        - annotation: harvesterhci.io/post-hook
        preDrainHooks:
        - annotation: harvesterhci.io/pre-hook
        timeout: 0

Result

We observe after the first node is upgraded, there is a high chance the rest two server nodes' scheduling are all disabled. And we see Rancher added pre-drain hooks annotation on plan secrets, which indicates pre-drain signal.

$ kubectl get nodes
NAME    STATUS                     ROLES                       AGE   VERSION
node1   Ready                      control-plane,etcd,master   21d   v1.23.12+rke2r1 <-- upgraded
node2   Ready,SchedulingDisabled   control-plane,etcd,master   21d   v1.23.12+rke2r1  <--
node3   Ready                      <none>                      21d   v1.22.12+rke2r1
node4   Ready,SchedulingDisabled   control-plane,etcd,master   21d   v1.22.12+rke2r1. <--

Expected Result

Only a single server should be disabled.

Screenshots

Additional context

Some observation:

  • Node2 and node4's machine plan secrets have rke.cattle.io/pre-drain annotation set.
$ kubectl get machine -A
NAMESPACE     NAME                  CLUSTER   NODENAME   PROVIDERID     PHASE     AGE   VERSION
fleet-local   custom-24d57cc6f506   local     node1      rke2://node1   Running   21d
fleet-local   custom-3865d0441591   local     node2      rke2://node2   Running   21d
fleet-local   custom-3994bff0f3f3   local     node3      rke2://node3   Running   21d
fleet-local   custom-fda201f64657   local     node4      rke2://node4   Running   21d

$ kubectl get secret custom-3865d0441591-machine-plan -n fleet-local -o json | jq '.metadata.annotations."rke.cattle.io/pre-drain"'
"{\"IgnoreErrors\":false,\"deleteEmptyDirData\":true,\"disableEviction\":false,\"enabled\":true,\"force\":true,\"gracePeriod\":0,\"ignoreDaemonSets\":true,\"postDrainHooks\":[{\"annotation\":\"harvesterhci.io/post-hook\"}],\"preDrainHooks\":[{\"annotation\":\"harvesterhci.io/pre-hook\"}],\"skipWaitForDeleteTimeoutSeconds\":0,\"timeout\":0}"

$ kubectl get secret custom-fda201f64657-machine-plan -n fleet-local -o json | jq '.metadata.annotations."rke.cattle.io/pre-drain"'
"{\"IgnoreErrors\":false,\"deleteEmptyDirData\":true,\"disableEviction\":false,\"enabled\":true,\"force\":true,\"gracePeriod\":0,\"ignoreDaemonSets\":true,\"postDrainHooks\":[{\"annotation\":\"harvesterhci.io/post-hook\"}],\"preDrainHooks\":[{\"annotation\":\"harvesterhci.io/pre-hook\"}],\"skipWaitForDeleteTimeoutSeconds\":0,\"timeout\":0}"

SURE-6031

@bk201 bk201 added kind/bug Issues that are defects reported by users or that we know have reached a real release area/harvester area/provisioning-v2 Provisioning issues that are specific to the provisioningv2 generating framework labels Sep 29, 2022
@Jono-SUSE-Rancher Jono-SUSE-Rancher added this to the v2.6.9 milestone Sep 29, 2022
@Jono-SUSE-Rancher Jono-SUSE-Rancher added the team/hostbusters The team that is responsible for provisioning/managing downstream clusters + K8s version support label Sep 29, 2022
@Oats87
Copy link
Contributor

Oats87 commented Sep 30, 2022

@bk201 I'm struggling to reproduce this issue.

Do you think you would be able to provide an environment where this happens?

@bk201
Copy link
Member Author

bk201 commented Oct 3, 2022

@Oats87 I'll try to create one.

@Oats87
Copy link
Contributor

Oats87 commented Oct 6, 2022

Seems there is a weird bug here that can occasionally cause this. Unfortunately, it is not easy to reproduce, and I have not been able to reproduce it.

@deniseschannon
Copy link

Since this isn't reproducible and has been occurring in previous versions, the release blocker label has been removed.

@Sahota1225 Sahota1225 modified the milestones: v2.6.9, v2.7.1 Oct 12, 2022
@Sahota1225 Sahota1225 added the release-note Note this issue in the milestone's release notes label Oct 12, 2022
@w13915984028
Copy link

w13915984028 commented Oct 14, 2022

What Harvester sets when upgrading is:

	toUpdate.Spec.RKEConfig.ProvisionGeneration += 1
	toUpdate.Spec.RKEConfig.UpgradeStrategy.ControlPlaneConcurrency = "1"
	toUpdate.Spec.RKEConfig.UpgradeStrategy.WorkerConcurrency = "1"
	toUpdate.Spec.RKEConfig.UpgradeStrategy.ControlPlaneDrainOptions.DeleteEmptyDirData = rke2DrainNodes
	toUpdate.Spec.RKEConfig.UpgradeStrategy.ControlPlaneDrainOptions.Enabled = rke2DrainNodes
	toUpdate.Spec.RKEConfig.UpgradeStrategy.ControlPlaneDrainOptions.Force = rke2DrainNodes
	toUpdate.Spec.RKEConfig.UpgradeStrategy.ControlPlaneDrainOptions.IgnoreDaemonSets = &rke2DrainNodes
	toUpdate.Spec.RKEConfig.UpgradeStrategy.WorkerDrainOptions.DeleteEmptyDirData = rke2DrainNodes
	toUpdate.Spec.RKEConfig.UpgradeStrategy.WorkerDrainOptions.Enabled = rke2DrainNodes
	toUpdate.Spec.RKEConfig.UpgradeStrategy.WorkerDrainOptions.Force = rke2DrainNodes
	toUpdate.Spec.RKEConfig.UpgradeStrategy.WorkerDrainOptions.IgnoreDaemonSets = &rke2DrainNodes

According to upgrade setting, control-plane has at most 1 node in upgrade; worker is same.

But from the node status, 2 control-node are in upgrading in the same time, it means rancher's control of upgrading sequence is broken.

harvester/harvester#2907

node-0:~ # k get no
NAME     STATUS                     ROLES                       AGE     VERSION
node-0   Ready                      control-plane,etcd,master   5d11h   v1.24.6+rke2r1
node-1   Ready,SchedulingDisabled   control-plane,etcd,master   5d10h   v1.24.6+rke2r1
node-2   Ready,SchedulingDisabled   control-plane,etcd,master   5d10h   v1.22.12+rke2r1
node-3   Ready                      <none>                      5d10h   v1.22.12+rke2r1

The current Harvester fix, as a workaround, works, and it may bring another question:

As Rancher starts the second upgrade of control-plane node earlier than expected, but Harvester suspends it, thus may cuase finally Rancher report timeout of this node. @Oats87 is it possible? thanks.

cc @bk201 @https://github.com/starbops

@w13915984028
Copy link

From the support-bundle attached in harvester/harvester#2907 (comment),

in logs/cattle-system/rancher-59cd8bb8f7-hmmbq/rancher.log, it shows mutli-nodes are draining at same time

The first line of log from planner is:
2022-10-11T11:23:09.471682431Z 2022/10/11 11:23:09 [INFO] [planner] rkecluster fleet-local/local: waiting: draining etcd node(s) custom-929d403d1670,custom-c05d0d11190c,
those 2 custom-929d403d1670,custom-c05d0d11190c are both control-plan nodes, they are draining at same time.

2022-10-11T11:23:09.471682431Z 2022/10/11 11:23:09 [INFO] [planner] rkecluster fleet-local/local: waiting: draining etcd node(s) custom-929d403d1670,custom-c05d0d11190c
2022-10-11T11:23:09.474096731Z 2022/10/11 11:23:09 [INFO] [planner] rkecluster fleet-local/local: waiting: draining etcd node(s) custom-929d403d1670,custom-c05d0d11190c
2022-10-11T11:23:10.762326566Z 2022/10/11 11:23:10 [ERROR] Failed to read API for groups map[autoscaling/v2:the server could not find the requested resource flowcontrol.apiserver.k8s.io/v1beta2:the server could not find the requested resource]
2022-10-11T11:23:14.327719194Z 2022/10/11 11:23:14 [INFO] [planner] rkecluster fleet-local/local: waiting: draining etcd node(s) custom-929d403d1670,custom-c05d0d11190c
2022-10-11T11:23:14.387075520Z 2022/10/11 11:23:14 [INFO] Watching metadata for autoscaling/v2beta1, Kind=HorizontalPodAutoscaler
2022-10-11T11:23:14.387115743Z 2022/10/11 11:23:14 [INFO] Stopping metadata watch on autoscaling/v1, Kind=HorizontalPodAutoscaler
2022-10-11T11:23:15.086488451Z 2022/10/11 11:23:15 [INFO] [planner] rkecluster fleet-local/local: waiting: draining etcd node(s) custom-929d403d1670,custom-c05d0d11190c
2022-10-11T11:23:15.086524671Z 2022/10/11 11:23:15 [INFO] [planner] rkecluster fleet-local/local: waiting: draining etcd node(s) custom-929d403d1670,custom-c05d0d11190c
2022-10-11T11:23:16.658870231Z 2022/10/11 11:23:16 [ERROR] Failed to handle tunnel request from remote address 10.52.1.49:53496: response 401: failed authentication
2022-10-11T11:23:19.447217540Z 2022/10/11 11:23:19 [INFO] [planner] rkecluster fleet-local/local: waiting: draining etcd node(s) custom-929d403d1670,custom-c05d0d11190c
2022-10-11T11:23:19.545281217Z 2022/10/11 11:23:19 [INFO] [planner] rkecluster fleet-local/local: waiting: draining etcd node(s) custom-929d403d1670,custom-c05d0d11190c
2022-10-11T11:23:21.681837805Z 2022/10/11 11:23:21 [ERROR] Failed to handle tunnel request from remote address 10.52.1.49:46494: response 401: failed authentication
2022-10-11T11:23:26.686803004Z 2022/10/11 11:23:26 [ERROR] Failed to handle tunnel request from remote address 10.52.1.49:46506: response 401: failed authentication
2022-10-11T11:23:29.436589166Z 2022/10/11 11:23:29 [INFO] [planner] rkecluster fleet-local/local: waiting: draining etcd node(s) custom-929d403d1670,custom-c05d0d11190c

node-0:~ # k -n fleet-local get machines
NAME                  CLUSTER   NODENAME   PROVIDERID      PHASE     AGE     VERSION
custom-2d94d5d682dc   local     node-3     rke2://node-3   Running   5d13h   // worker
custom-7c1afab6e79d   local     node-0     rke2://node-0   Running   5d14h
custom-929d403d1670   local     node-1     rke2://node-1   Running   5d14h  // control-plane
custom-c05d0d11190c   local     node-2     rke2://node-2   Running   5d13h  // control-plane

@starbops
Copy link
Member

starbops commented Oct 20, 2022

Today, I did another round of Harvester upgrade on a 4-node cluster and tried my best to collect all the rancher pods' logs with a simple script while upgrading:

rancher-pod-logs.tar.gz

Before rancher upgrade:

  • rancher-7fd549bcc4-5twfg.txt
  • rancher-7fd549bcc4-j8m7r.txt
  • rancher-7fd549bcc4-dzdw8.txt

After rancher upgrade:

  • rancher-65f8899dfb-ksnkx.txt
  • rancher-65f8899dfb-58k7p.txt
  • rancher-65f8899dfb-87k5t.txt
  • rancher-65f8899dfb-26b6s.txt
  • rancher-65f8899dfb-nf7p4.txt
  • rancher-65f8899dfb-zrvsk.txt
  • rancher-65f8899dfb-qkvd6.txt
  • rancher-65f8899dfb-4mxd9.txt

In the middle of the upgrade, there was indeed a multi-node SchedulingDisable situation (node-1 & node-2) after the first node (node-0) was upgraded and rebooted. But we had a workaround code snippet deployed in the upgrade controller so the whole upgrade did not get stuck forever, it eventually went through to the end.

Here are some of the information that you can reference with the logs:

$ k -n fleet-local get machines
NAME                  CLUSTER   NODENAME   PROVIDERID      PHASE     AGE   VERSION
custom-1b287700d314   local     node-2     rke2://node-2   Running   25h  // control-plane
custom-57aefc97a78e   local     node-3     rke2://node-3   Running   25h  // worker
custom-ad79796f3d2a   local     node-1     rke2://node-1   Running   25h  // control-plane
custom-cd36cfbeabf7   local     node-0     rke2://node-0   Running   25h  // control-plane (bootstrap node)
  • Rancher upgrade from v2.6.4 to v2.6.9-rc5
  • RKE2 upgrade from v1.22.12+rke2r1 to v1.24.7+rke2r1

cc @w13915984028

@w13915984028
Copy link

According to the source code

https://github.com/rancher/rancher/blob/release/v2.7/pkg/provisioningv2/rke2/planner/planner.go#L352

err = p.reconcile(controlPlane, clusterSecretTokens, plan, true, etcdTier, isEtcd, isInitNodeOrDeleting,		"1", joinServer,		controlPlane.Spec.UpgradeStrategy.ControlPlaneDrainOptions)
...
err = p.reconcile(controlPlane, clusterSecretTokens, plan, true, controlPlaneTier, isControlPlane, isInitNodeOrDeleting,		controlPlane.Spec.UpgradeStrategy.ControlPlaneConcurrency, joinServer,		controlPlane.Spec.UpgradeStrategy.ControlPlaneDrainOptions)

etcdTier, controlPlaneTier are fetched 1 in each tier

but they may share the same nodes (e.g. 3 management-node), thus breaks the control policy ControlPlaneConcurrency = "1"

it could be: after the init node is upgraded, it will upgrade another 2 in parallel. sometimes, it will be successful, sometimes not

@starbops Your last test log shows that.

@bk201
Copy link
Member Author

bk201 commented Nov 1, 2022

@Oats87 Are the above comments helpful, or do you still need a live environment reproducing this issue? Thanks!

@Oats87
Copy link
Contributor

Oats87 commented Nov 23, 2022

@bk201 I've been working to try and reproduce this but I have not been able to do so. Have you folks found an accurate reproducer for this?

@bk201
Copy link
Member Author

bk201 commented Dec 2, 2022

@Oats87 We'll try to create one and get back to you. Thanks!

@starbops
Copy link
Member

starbops commented Dec 9, 2022

Hi @Oats87, I successfully reproduced the issue on a 3-node Harvester cluster in our environment, though it's not always reproducible. I left the environment intact maybe you are interested in looking into it.

Screen Shot 2022-12-09 at 14 12 42

For simplicity and to avoid the lengthy upgrade process, I didn't trigger the normal upgrade flow of Harvester. Instead, I did the following (only upgrade RKE2):

  1. Prepare a v1.0.3 Harvester cluster (RKE2 version is v1.22.12+rke2r1, Rancher version is v2.6.4-harvester3)
  2. Upgrade Rancher to v2.6.9 with the following script
#!/usr/bin/env sh

set -ex

trap cleanup EXIT

cleanup() {
  if [ -n "$TEMP_DIR" ]; then
    \rm -vrf "$TEMP_DIR"
  fi
}

RANCHER_VERSION=${1:-v2.6.9}
TEMP_DIR=$(mktemp -d -p /tmp)

wharfie rancher/system-agent-installer-rancher:"$RANCHER_VERSION" "$TEMP_DIR"

pushd "$TEMP_DIR"
helm upgrade rancher ./rancher-"${RANCHER_VERSION#v}".tgz --reuse-values --set rancherImageTag="$RANCHER_VERSION"  --namespace cattle-system --wait
popd

kubectl -n cattle-system rollout status deploy rancher
  1. Simulate the Harvester upgrade by patching clusters.provisioning.cattle.io with the command kubectl -n fleet-local patch clusters.provisioning.cattle.io local --type merge --patch-file ./upgrade-patch.yaml. The patch is like the following:
spec:
  kubernetesVersion: v1.24.7+rke2r1
  rkeConfig:
    provisionGeneration: 1
    upgradeStrategy:
      controlPlaneConcurrency: "1"
      workerConcurrency: "1"
      controlPlaneDrainOptions:
        deleteEmptyDirData: true
        enabled: true
        force: true
        ignoreDaemonSets: true
        postDrainHooks:
        - annotation: "harvesterhci.io/post-hook"
        preDrainHooks:
        - annotation: "harvesterhci.io/pre-hook"
      workerDrainOptions:
        deleteEmptyDirData: true
        enabled: true
        force: true
        ignoreDaemonSets: true
        postDrainHooks:
        - annotation: "harvesterhci.io/post-hook"
        preDrainHooks:
        - annotation: "harvesterhci.io/pre-hook"
  1. The upgrade starts on the first node, harvester-node-0. The machine secret custom-7bb31dfaa3bb-machine-plan has rke.cattle.io/pre-drain annotated.
  2. Manually annotate custom-7bb31dfaa3bb-machine-plan with harvesterhci.io/pre-hook just like the normal upgrade flow of Harvester does
  3. The first node, harvester-node-0 starts to drain the pods
  4. After the drain is done, custom-7bb31dfaa3bb-machine-plan is annotated with rke.cattle.io/post-drain
  5. Manually annotate custom-7bb31dfaa3bb-machine-plan with harvesterhci.io/post-hook just like the normal upgrade flow of Harvester does
  6. The first node upgrade is done
  7. The upgrade start on the second node, harvester-node-2. The machine secret custom-bb2ddb6fb772-machine-plan has rke.cattle.io/pre-drain annotated.
  8. Manually annotate custom-bb2ddb6fb772-machine-plan with harvesterhci.io/pre-hook
  9. The second node, harvester-node-2 starts to drain the pods
  10. The second node drain is done, custom-bb2ddb6fb772-machine-plan is annotated with rke.cattle.io/post-drain
  11. Somehow, the third node, harvester-node-1 is cordoned off and has rke.cattle.io/pre-drain annotated

The support bundle is here:

supportbundle_12c2d5c7-956a-4e26-bebb-dd4ec43dc5d8_2022-12-09T06-14-20Z.zip

P.S. I have tried this iteration several times, but it did not happen the issue, until now. But it's more frequent when executing a regular Harvester upgrade.

@starbops
Copy link
Member

With trace logs enabled on rancher, I reproduced the issue with the same methods in the same environment. Here's the support bundle:
supportbundle_12c2d5c7-956a-4e26-bebb-dd4ec43dc5d8_2022-12-20T03-21-52Z.zip

Hope that helps!

@Oats87
Copy link
Contributor

Oats87 commented Apr 21, 2023

I believe I have identified why this is occurring. Huge shout out to @starbops for helping me debug this/gathering me the corresponding logs for this.

c6b6afd is a commit that introduces logic that attempts to continue determining draining status/update a plan if a plan has been applied but probes are failing. This seems to introduce an edge case where a valid but "old" plan may start having its probes fail (which is very possible to happen when the init node is restarted for example), causing the planner to attempt to drain that node.

I'll need to think of how to prevent this edge case while also accommodating the original desired business logic defined in the PR/commit.

@Oats87
Copy link
Contributor

Oats87 commented May 15, 2023

#41459 reverts the addition of the planAppliedButWaitingForProbes short circuiting

@bk201
Copy link
Member Author

bk201 commented Jun 16, 2023

We can confirm the issue doesn't happen recently after bumping to Rancher 2.7.5-rc releases; thanks!

@bk201 bk201 closed this as completed Jun 16, 2023
@zube zube bot removed the [zube]: Done label Sep 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/harvester area/provisioning-v2 Provisioning issues that are specific to the provisioningv2 generating framework internal kind/bug Issues that are defects reported by users or that we know have reached a real release release-note Note this issue in the milestone's release notes team/hostbusters The team that is responsible for provisioning/managing downstream clusters + K8s version support
Projects
None yet
Development

No branches or pull requests

8 participants