Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

k3s-killall.sh during encryption key rotation causes last node to get stuck on rotate instead of moving to reencrypt_finished #6155

Closed
jakefhyde opened this issue Sep 19, 2022 · 4 comments

Comments

@jakefhyde
Copy link
Contributor

jakefhyde commented Sep 19, 2022

Environmental Info:
K3s Version:

Node(s) CPU architecture, OS, and Version:

Cluster Configuration:

3 Node all roles

Describe the bug:

When using the killall script during encryption key rotation, the last node to be restarted post reencrypt will still have the encryption key rotation stage rotate.

Steps To Reproduce:

  • Installed K3s: Rancher provisioned v1.24.4
  • Create 10k secrets
  • Begin encryption key rotation sequence
  • Elect leader
  • For each of the encryption key rotation stages (prepare, rotate, reencrypt)
    • Run the following commands on the leader for the current stage
      • k3s secrets-encrypt <stage>
      • k3s secrets-encrypt status
        • run the above command until the expected encryption key rotation stage is returned
    • For each node, run the following commands
      • systemctl is-active k3s
      • Verify output of above command, if not true, wait 10 seconds then try again
      • k3s-killall.sh
      • systemctl restart k3s
      • k3s secrets-encrypt status
      • Run above command until it succeeds
      • Verify that the other nodes have the same encryption key rotation status as the leader.

Expected behavior:

After restarting the last node, all nodes now have an encryption key rotation stage of reencrypt_finished.

Actual behavior:

The node has the incorrect encryption key rotation stage. This is also interesting due to the fact that the leader successfully reconciled the bootstrap data and the second controlplane node was able to move to reencrypt_finished.

Additional context / logs:

So far, this has been narrowed down to the k3s-killall.sh script, however we desire the functionality of the killall script during out encryption key rotation process due to how sensitive of an operation it is in an effort to minimize cluster activity during the process.

This issue also likely affects rke2, however v1.24.4+k3s1 is just the version I found it on.

Additionally, this was found as a part of scale testing in rancher. I tested with 10k secrets, although it's likely the issue is timing related and happens before then.

The behavior in this issue seems similar to rancher/rke2#3006

@brandond
Copy link
Member

brandond commented Sep 19, 2022

I don't believe you're following the process properly. According to your steps:

  • Run the following commands on the leader for each of the encryption key rotation stages (prepare, rotate, reencrypt)
    • k3s secrets-encrypt <stage>
    • k3s secrets-encrypt status
      • run the above command until the expected encryption key rotation stage is returned
  • For each node, run the following commands
    • systemctl is-active k3s
      • Verify output of above command, if not true, wait 10 seconds then try again
    • k3s-killall.sh
    • systemctl restart k3s
    • k3s secrets-encrypt status
      • Run above command until it succeeds
    • Verify that the other nodes have the same encryption key rotation status as the leader.

If I follow your process from the top down, it sounds like you are running the first server all the way through prepare, rotate, and reencrypt before using k3s-killall.sh to restart k3s on the other servers. According to the steps at https://rancher.com/docs/k3s/latest/en/security/secrets_encryption/#high-availability-encryption-key-rotation k3s-killall.sh should be run on the secondary servers after each of the prepare and rotate steps are run on the first server, not just once after the first server has been run all the way through all three steps

Can you confirm the exact order of the steps you're following, sequentially, across all the nodes in the cluster?

So far, this has been narrowed down to the k3s-killall.sh script, however we desire the functionality of the killall script during out encryption key rotation process due to how sensitive of an operation it is in an effort to minimize cluster activity during the process.

What specifically have you narrowed down to the script? Are you using the script or are you not ? Does the the process work properly if you use systemctl to stop and restart the service, instead of using the script?

@jakefhyde
Copy link
Contributor Author

@brandond Updated the indentation level so it's more obvious that I'm not doing that (in fact I promise it doesn't get past prepare if you don't restart the followes, found that out the hard way very early on). I can assure you though we are following the spec and I'll provide whatever logs are necssary (currently running through tests now). Also, according to https://rancher.com/docs/k3s/latest/en/security/secrets_encryption/#high-availability-encryption-key-rotation, it isn't mentioned specifically how to kill k3s, though the rke2 docs do: https://docs.rke2.io/security/secrets_encryption/#multi-server-encryption-key-rotation. Although they don't make any mention of using the killall script to stop the respective servers, I imagine it isn't specifically called out not to do so because it's a pretty reasonable way to stop the servers and not preventable. Also, it appears to work without the killall script in that the node no longer gets stuck and encryption key rotation finishes, but we want the functionality of the killall script as part of the encryption key rotation process. Let me know if there is anything else I can clarify while I wait for these tests to finish.

@brandond
Copy link
Member

brandond commented Sep 19, 2022

Perhaps we need to make it more clear that k3s-killall.sh should not be used to stop k3s unless you are OK with the potential for data loss, and it definitely shouldn't be used to stop k3s when performing certificate rotation. The killall script does use systemctl to stop the service, but then immediately proceeds on to kill -9 k3s and all of its child processes and pods, which may result in the datastore not shutting down cleanly.

@jakefhyde
Copy link
Contributor Author

Closing this issue as rancher has migrated away from using the killall script.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants