-
Notifications
You must be signed in to change notification settings - Fork 2.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
k3s-killall.sh
during encryption key rotation causes last node to get stuck on rotate
instead of moving to reencrypt_finished
#6155
Comments
I don't believe you're following the process properly. According to your steps:
If I follow your process from the top down, it sounds like you are running the first server all the way through Can you confirm the exact order of the steps you're following, sequentially, across all the nodes in the cluster?
What specifically have you narrowed down to the script? Are you using the script or are you not ? Does the the process work properly if you use systemctl to stop and restart the service, instead of using the script? |
@brandond Updated the indentation level so it's more obvious that I'm not doing that (in fact I promise it doesn't get past prepare if you don't restart the followes, found that out the hard way very early on). I can assure you though we are following the spec and I'll provide whatever logs are necssary (currently running through tests now). Also, according to https://rancher.com/docs/k3s/latest/en/security/secrets_encryption/#high-availability-encryption-key-rotation, it isn't mentioned specifically how to kill k3s, though the rke2 docs do: https://docs.rke2.io/security/secrets_encryption/#multi-server-encryption-key-rotation. Although they don't make any mention of using the killall script to stop the respective servers, I imagine it isn't specifically called out not to do so because it's a pretty reasonable way to stop the servers and not preventable. Also, it appears to work without the killall script in that the node no longer gets stuck and encryption key rotation finishes, but we want the functionality of the killall script as part of the encryption key rotation process. Let me know if there is anything else I can clarify while I wait for these tests to finish. |
Perhaps we need to make it more clear that |
Closing this issue as rancher has migrated away from using the killall script. |
Environmental Info:
K3s Version:
Node(s) CPU architecture, OS, and Version:
Cluster Configuration:
3 Node all roles
Describe the bug:
When using the killall script during encryption key rotation, the last node to be restarted post
reencrypt
will still have the encryption key rotation stagerotate
.Steps To Reproduce:
prepare
,rotate
,reencrypt
)k3s secrets-encrypt <stage>
k3s secrets-encrypt status
systemctl is-active k3s
k3s-killall.sh
systemctl restart k3s
k3s secrets-encrypt status
Expected behavior:
After restarting the last node, all nodes now have an encryption key rotation stage of
reencrypt_finished
.Actual behavior:
The node has the incorrect encryption key rotation stage. This is also interesting due to the fact that the leader successfully reconciled the bootstrap data and the second controlplane node was able to move to
reencrypt_finished
.Additional context / logs:
So far, this has been narrowed down to the
k3s-killall.sh
script, however we desire the functionality of the killall script during out encryption key rotation process due to how sensitive of an operation it is in an effort to minimize cluster activity during the process.This issue also likely affects rke2, however v1.24.4+k3s1 is just the version I found it on.
Additionally, this was found as a part of scale testing in rancher. I tested with 10k secrets, although it's likely the issue is timing related and happens before then.
The behavior in this issue seems similar to rancher/rke2#3006
The text was updated successfully, but these errors were encountered: