`k3s-killall.sh` during encryption key rotation causes last node to get stuck on `rotate` instead of moving to `reencrypt_finished` #6155

jakefhyde · 2022-09-19T15:54:55Z

Environmental Info:
K3s Version:

Node(s) CPU architecture, OS, and Version:

Cluster Configuration:

3 Node all roles

Describe the bug:

When using the killall script during encryption key rotation, the last node to be restarted post reencrypt will still have the encryption key rotation stage rotate.

Steps To Reproduce:

Installed K3s: Rancher provisioned v1.24.4
Create 10k secrets
Begin encryption key rotation sequence
Elect leader
For each of the encryption key rotation stages (prepare, rotate, reencrypt)
- Run the following commands on the leader for the current stage
  - k3s secrets-encrypt <stage>
  - k3s secrets-encrypt status
    - run the above command until the expected encryption key rotation stage is returned
- For each node, run the following commands
  - systemctl is-active k3s
  - Verify output of above command, if not true, wait 10 seconds then try again
  - k3s-killall.sh
  - systemctl restart k3s
  - k3s secrets-encrypt status
  - Run above command until it succeeds
  - Verify that the other nodes have the same encryption key rotation status as the leader.

Expected behavior:

After restarting the last node, all nodes now have an encryption key rotation stage of reencrypt_finished.

Actual behavior:

The node has the incorrect encryption key rotation stage. This is also interesting due to the fact that the leader successfully reconciled the bootstrap data and the second controlplane node was able to move to reencrypt_finished.

Additional context / logs:

So far, this has been narrowed down to the k3s-killall.sh script, however we desire the functionality of the killall script during out encryption key rotation process due to how sensitive of an operation it is in an effort to minimize cluster activity during the process.

This issue also likely affects rke2, however v1.24.4+k3s1 is just the version I found it on.

Additionally, this was found as a part of scale testing in rancher. I tested with 10k secrets, although it's likely the issue is timing related and happens before then.

The behavior in this issue seems similar to rancher/rke2#3006

The text was updated successfully, but these errors were encountered:

brandond · 2022-09-19T16:07:45Z

I don't believe you're following the process properly. According to your steps:

Run the following commands on the leader for each of the encryption key rotation stages (prepare, rotate, reencrypt)

k3s secrets-encrypt <stage>

k3s secrets-encrypt status

run the above command until the expected encryption key rotation stage is returned

For each node, run the following commands

systemctl is-active k3s

Verify output of above command, if not true, wait 10 seconds then try again

k3s-killall.sh

systemctl restart k3s

k3s secrets-encrypt status

Run above command until it succeeds

Verify that the other nodes have the same encryption key rotation status as the leader.

If I follow your process from the top down, it sounds like you are running the first server all the way through prepare, rotate, and reencrypt before using k3s-killall.sh to restart k3s on the other servers. According to the steps at https://rancher.com/docs/k3s/latest/en/security/secrets_encryption/#high-availability-encryption-key-rotation k3s-killall.sh should be run on the secondary servers after each of the prepare and rotate steps are run on the first server, not just once after the first server has been run all the way through all three steps

Can you confirm the exact order of the steps you're following, sequentially, across all the nodes in the cluster?

So far, this has been narrowed down to the k3s-killall.sh script, however we desire the functionality of the killall script during out encryption key rotation process due to how sensitive of an operation it is in an effort to minimize cluster activity during the process.

What specifically have you narrowed down to the script? Are you using the script or are you not ? Does the the process work properly if you use systemctl to stop and restart the service, instead of using the script?

jakefhyde · 2022-09-19T17:52:01Z

@brandond Updated the indentation level so it's more obvious that I'm not doing that (in fact I promise it doesn't get past prepare if you don't restart the followes, found that out the hard way very early on). I can assure you though we are following the spec and I'll provide whatever logs are necssary (currently running through tests now). Also, according to https://rancher.com/docs/k3s/latest/en/security/secrets_encryption/#high-availability-encryption-key-rotation, it isn't mentioned specifically how to kill k3s, though the rke2 docs do: https://docs.rke2.io/security/secrets_encryption/#multi-server-encryption-key-rotation. Although they don't make any mention of using the killall script to stop the respective servers, I imagine it isn't specifically called out not to do so because it's a pretty reasonable way to stop the servers and not preventable. Also, it appears to work without the killall script in that the node no longer gets stuck and encryption key rotation finishes, but we want the functionality of the killall script as part of the encryption key rotation process. Let me know if there is anything else I can clarify while I wait for these tests to finish.

brandond · 2022-09-19T18:00:58Z

Perhaps we need to make it more clear that k3s-killall.sh should not be used to stop k3s unless you are OK with the potential for data loss, and it definitely shouldn't be used to stop k3s when performing certificate rotation. The killall script does use systemctl to stop the service, but then immediately proceeds on to kill -9 k3s and all of its child processes and pods, which may result in the datastore not shutting down cleanly.

jakefhyde · 2023-01-18T01:51:00Z

Closing this issue as rancher has migrated away from using the killall script.

jakefhyde mentioned this issue Sep 19, 2022

[RKE2] Unable to rotate encryption keys for large number of secrets rancher/rancher#38283

Closed

jakefhyde closed this as completed Jan 18, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

`k3s-killall.sh` during encryption key rotation causes last node to get stuck on `rotate` instead of moving to `reencrypt_finished` #6155

`k3s-killall.sh` during encryption key rotation causes last node to get stuck on `rotate` instead of moving to `reencrypt_finished` #6155

jakefhyde commented Sep 19, 2022 •

edited

Loading

brandond commented Sep 19, 2022 •

edited

Loading

jakefhyde commented Sep 19, 2022

brandond commented Sep 19, 2022 •

edited

Loading

jakefhyde commented Jan 18, 2023

k3s-killall.sh during encryption key rotation causes last node to get stuck on rotate instead of moving to reencrypt_finished #6155

k3s-killall.sh during encryption key rotation causes last node to get stuck on rotate instead of moving to reencrypt_finished #6155

Comments

jakefhyde commented Sep 19, 2022 • edited Loading

brandond commented Sep 19, 2022 • edited Loading

jakefhyde commented Sep 19, 2022

brandond commented Sep 19, 2022 • edited Loading

jakefhyde commented Jan 18, 2023

`k3s-killall.sh` during encryption key rotation causes last node to get stuck on `rotate` instead of moving to `reencrypt_finished` #6155

`k3s-killall.sh` during encryption key rotation causes last node to get stuck on `rotate` instead of moving to `reencrypt_finished` #6155

jakefhyde commented Sep 19, 2022 •

edited

Loading

brandond commented Sep 19, 2022 •

edited

Loading

brandond commented Sep 19, 2022 •

edited

Loading