-
Notifications
You must be signed in to change notification settings - Fork 3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] Multiple server nodes pre-drains in an RKE2 upgrade #39167
Comments
@bk201 I'm struggling to reproduce this issue. Do you think you would be able to provide an environment where this happens? |
@Oats87 I'll try to create one. |
Seems there is a weird bug here that can occasionally cause this. Unfortunately, it is not easy to reproduce, and I have not been able to reproduce it. |
Since this isn't reproducible and has been occurring in previous versions, the release blocker label has been removed. |
What
According to upgrade setting, control-plane has at most 1 node in upgrade; worker is same. But from the node status, 2 control-node are in upgrading in the same time, it means rancher's control of upgrading sequence is broken.
The current As Rancher starts the second upgrade of control-plane node earlier than expected, but Harvester suspends it, thus may cuase finally Rancher report timeout of this node. @Oats87 is it possible? thanks. |
From the in The first line of log from
|
Today, I did another round of Harvester upgrade on a 4-node cluster and tried my best to collect all the rancher pods' logs with a simple script while upgrading: Before rancher upgrade:
After rancher upgrade:
In the middle of the upgrade, there was indeed a multi-node SchedulingDisable situation (node-1 & node-2) after the first node (node-0) was upgraded and rebooted. But we had a workaround code snippet deployed in the upgrade controller so the whole upgrade did not get stuck forever, it eventually went through to the end. Here are some of the information that you can reference with the logs:
|
According to the source code
but they may share the same nodes (e.g. 3 management-node), thus breaks the control policy it could be: after the init node is upgraded, it will upgrade another 2 in parallel. sometimes, it will be successful, sometimes not @starbops Your last test log shows that. |
@Oats87 Are the above comments helpful, or do you still need a live environment reproducing this issue? Thanks! |
@bk201 I've been working to try and reproduce this but I have not been able to do so. Have you folks found an accurate reproducer for this? |
@Oats87 We'll try to create one and get back to you. Thanks! |
Hi @Oats87, I successfully reproduced the issue on a 3-node Harvester cluster in our environment, though it's not always reproducible. I left the environment intact maybe you are interested in looking into it. For simplicity and to avoid the lengthy upgrade process, I didn't trigger the normal upgrade flow of Harvester. Instead, I did the following (only upgrade RKE2):
The support bundle is here: supportbundle_12c2d5c7-956a-4e26-bebb-dd4ec43dc5d8_2022-12-09T06-14-20Z.zip P.S. I have tried this iteration several times, but it did not happen the issue, until now. But it's more frequent when executing a regular Harvester upgrade. |
With trace logs enabled on rancher, I reproduced the issue with the same methods in the same environment. Here's the support bundle: Hope that helps! |
I believe I have identified why this is occurring. Huge shout out to @starbops for helping me debug this/gathering me the corresponding logs for this. c6b6afd is a commit that introduces logic that attempts to continue determining draining status/update a plan if a plan has been applied but probes are failing. This seems to introduce an edge case where a valid but "old" plan may start having its probes fail (which is very possible to happen when the init node is restarted for example), causing the planner to attempt to drain that node. I'll need to think of how to prevent this edge case while also accommodating the original desired business logic defined in the PR/commit. |
#41459 reverts the addition of the |
We can confirm the issue doesn't happen recently after bumping to Rancher 2.7.5-rc releases; thanks! |
Rancher Server Setup
Information about the Cluster
User Information
Describe the bug
To Reproduce
And edit local cluster with:
Result
We observe after the first node is upgraded, there is a high chance the rest two server nodes' scheduling are all disabled. And we see Rancher added pre-drain hooks annotation on plan secrets, which indicates pre-drain signal.
Expected Result
Only a single server should be disabled.
Screenshots
Additional context
Some observation:
rke.cattle.io/pre-drain
annotation set.SURE-6031
The text was updated successfully, but these errors were encountered: