Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

etcd snapshot cleanup fails if node name changes #8182

Closed
aganesh-suse opened this issue Aug 14, 2023 · 1 comment
Closed

etcd snapshot cleanup fails if node name changes #8182

aganesh-suse opened this issue Aug 14, 2023 · 1 comment
Assignees
Milestone

Comments

@aganesh-suse
Copy link

Environmental Info:
K3s Version:
1.27.4

Node(s) CPU architecture, OS, and Version:
ubuntu 22.04

Cluster Configuration:
HA - 3 server/1agent

Describe the bug:
same as rancher/rke2#3714
since the change is getting pushed into k3s, creating this to reverify the same on k3s as well.

Steps To Reproduce:
The etcd snapshot retention, currently takes in the node-name value as well into consideration. So if node name changes, the previous snapshots for the same etcd are not cleaned up right.

Expected behavior:
Cleanup of etcd-snapshots should happen irrespective of node name changes.

Other linked code fixes:
#8099
#8177

@aganesh-suse aganesh-suse self-assigned this Aug 14, 2023
@rancher-max rancher-max added this to the v1.27.5+k3s1 milestone Aug 14, 2023
@aganesh-suse
Copy link
Author

Validated on master branch with commit id: e83b1ba

Environment Details

Infrastructure

  • Cloud
  • Hosted

Node(s) CPU architecture, OS, and Version:

cat /etc/os-release | grep PRETTY
PRETTY_NAME="Ubuntu 22.04.2 LTS"

Cluster Configuration:

Server config: 3 etcd, control planes servers/1 agent config

Config.yaml:

Main ETCD SERVER (+CONTROL PLANE) CONFIG:

token: blah
node-name: "server1"
etcd-snapshot-retention: 2
etcd-snapshot-schedule-cron: "* * * * *"
etcd-s3: true
etcd-s3-access-key: xxx
etcd-s3-secret-key: xxx
etcd-s3-bucket: s3-bucket-name
etcd-s3-folder: k3ssnap/commit-setup
etcd-s3-region: us-east-2
write-kubeconfig-mode: "0644"
cluster-init: true
node-label:
- k3s-upgrade=server

Sample Secondary Etcd, control plane config.yaml:

token: blah
server: https://x.x.x.x:6443
node-name: "server3"
write-kubeconfig-mode: "0644"
node-label:
- k3s-upgrade=server

AGENT CONFIG:

token: blah
server: https://x.x.x.x:6443
node-name: "agent1"
node-label:
- k3s-upgrade=agent

Testing Steps

  1. Create config dir and place the config.yaml file in server/agent nodes:
$ sudo mkdir -p /etc/rancher/k3s && sudo cp config.yaml /etc/rancher/k3s

Note: First round node-names:
<version|commit>-server1
server2
server3
agent1
2. Install k3s:
Verify Issue Using Commit:

$ curl -sfL https://get.k3s.io | sudo INSTALL_K3S_COMMIT='e83b1ba4aade82a740e5b199ca788436ec391bb2' sh -s - server

Reproduce issue Using Version:

$ curl -sfL https://get.k3s.io | sudo INSTALL_K3S_VERSION='v1.27.4+k3s1' sh -s - server
  1. Wait for 2 minutes.
    Note: The snapshot gets created every 1 minute (etcd-snapshot-schedule-cron: "* * * * *") . Retention is for 2 snapshots (etcd-snapshot-retention: 2).
    Reference for cron job format: https://cloud.google.com/scheduler/docs/configuring/cron-job-schedules
    After 2 minutes: 2 snapshots are created with name etcd-snapshot-server1-2-xxxx if node-name: server1-2 in config.yaml),
  2. Check outputs of:
sudo ls -lrt /var/lib/rancher/k3s/server/db/snapshots
sudo k3s etcd-snapshots list

4a. Also check the s3 bucket/folder in aws to see the snapshots listed.
5. Update the node-name in the config.yaml:
node-names:
<version|commit>-server1-<|suffix1>
server2-<|suffix1>
server3-<|suffix1>
agent1-<|suffix1>
6. restart the k3s service for all nodes.

sudo systemctl restart k3s-server
  1. Wait for 2 more minutes and check the snapshot list:
sudo ls -lrt /var/lib/rancher/k3s/server/db/snapshots
sudo k3s etcd-snapshots list

7a. Also check the s3 bucket/folder in aws to see the snapshots listed.

  1. Repeat steps 5 through 7 once more.
    node names:
    <version|commit>-server1-<|suffix2>-<|suffix1>
    server2-<|suffix2>-<|suffix1>
    server3-<|suffix2>-<|suffix1>
    agent1-<|suffix2>-<|suffix1>

Replication Results:

  • k3s version used for replication:
 $ k3s -v 
k3s version v1.27.4+k3s1 (36645e73)
go version go1.20.6

After multiple node name updates:
Current and Previous Node Names of main etcd server in order since deployment:

version-setup-server1                 
version-setup-server1-7707 
version-setup-server1-32761-7707           

List of Snapshots in local directory:

 $ sudo ls -lrt /var/lib/rancher/k3s/server/db/snapshots 
total 25616
-rw------- 1 root root 4485152 Aug 15 22:45 etcd-snapshot-version-setup-server1-1692139500
-rw------- 1 root root 4710432 Aug 15 22:46 etcd-snapshot-version-setup-server1-1692139561
-rw------- 1 root root 4976672 Aug 15 22:47 etcd-snapshot-version-setup-server1-7707-1692139624
-rw------- 1 root root 5832736 Aug 15 22:48 etcd-snapshot-version-setup-server1-7707-1692139684
-rw------- 1 root root 6205472 Aug 15 22:49 etcd-snapshot-version-setup-server1-32761-7707-1692139741
 $ sudo k3s etcd-snapshot list 
time="2023-08-15T22:49:52Z" level=warning msg="Unknown flag --token found in config.yaml, skipping\n"
time="2023-08-15T22:49:52Z" level=warning msg="Unknown flag --etcd-snapshot-retention found in config.yaml, skipping\n"
time="2023-08-15T22:49:52Z" level=warning msg="Unknown flag --etcd-snapshot-schedule-cron found in config.yaml, skipping\n"
time="2023-08-15T22:49:52Z" level=warning msg="Unknown flag --cluster-init found in config.yaml, skipping\n"
time="2023-08-15T22:49:52Z" level=warning msg="Unknown flag --write-kubeconfig-mode found in config.yaml, skipping\n"
time="2023-08-15T22:49:52Z" level=warning msg="Unknown flag --node-label found in config.yaml, skipping\n"
time="2023-08-15T22:49:52Z" level=info msg="Checking if S3 bucket xxx exists"
time="2023-08-15T22:49:52Z" level=info msg="S3 bucket xxx exists"
Name                                                      Size    Created
etcd-snapshot-version-setup-server1-32761-7707-1692139741 6205472 2023-08-15T22:49:02Z
etcd-snapshot-version-setup-server1-7707-1692139624       4976672 2023-08-15T22:47:05Z
etcd-snapshot-version-setup-server1-7707-1692139684       5832736 2023-08-15T22:48:05Z
etcd-snapshot-version-setup-server1-1692139561            4710432 2023-08-15T22:46:02Z
etcd-snapshot-version-setup-server1-31052-1692135181      5222432 2023-08-15T21:33:02Z

As we can see above, previous snapshots with different node-names are still listed and not cleaned up.

Validation Results:

  • k3s version used for validation:
 $ k3s -v 
k3s version v1.27.4+k3s-e83b1ba4 (e83b1ba4)
go version go1.20.6

Node-names used for the main etcd server in order since deployment:

commit-setup-server1               
commit-setup-server1-27711         
commit-setup-server1-31514-27711 

The snapshots listed are:

 $ sudo ls -lrt /var/lib/rancher/k3s/server/db/snapshots 
total 13816
-rw------- 1 root root 6762528 Aug 15 22:08 etcd-snapshot-commit-setup-server1-31514-27711-1692137283
-rw------- 1 root root 7376928 Aug 15 22:09 etcd-snapshot-commit-setup-server1-31514-27711-1692137341
 $ sudo k3s etcd-snapshot list 
time="2023-08-15T22:09:21Z" level=warning msg="Unknown flag --token found in config.yaml, skipping\n"
time="2023-08-15T22:09:21Z" level=warning msg="Unknown flag --etcd-snapshot-retention found in config.yaml, skipping\n"
time="2023-08-15T22:09:21Z" level=warning msg="Unknown flag --etcd-snapshot-schedule-cron found in config.yaml, skipping\n"
time="2023-08-15T22:09:21Z" level=warning msg="Unknown flag --cluster-init found in config.yaml, skipping\n"
time="2023-08-15T22:09:21Z" level=warning msg="Unknown flag --write-kubeconfig-mode found in config.yaml, skipping\n"
time="2023-08-15T22:09:21Z" level=warning msg="Unknown flag --node-label found in config.yaml, skipping\n"
time="2023-08-15T22:09:21Z" level=info msg="Checking if S3 bucket xxx exists"
time="2023-08-15T22:09:21Z" level=info msg="S3 bucket xxx exists"
Name                                                      Size    Created
etcd-snapshot-commit-setup-server1-31514-27711-1692137283 6762528 2023-08-15T22:08:04Z
etcd-snapshot-commit-setup-server1-31514-27711-1692137341 7376928 2023-08-15T22:09:02Z

As we can see, the previous snapshots with old node-names are no longer retained and get cleaned up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

No branches or pull requests

2 participants