[Backport release-1.25] etcd snapshot cleanup fails if node name changes #8187

aganesh-suse · 2023-08-14T18:48:08Z

This is a backport issue for rancher/rke2#3714,

Node(s) CPU architecture, OS, and Version:
ubuntu 22.04

Cluster Configuration:
HA - 3 server/1agent

Describe the bug:
same as rancher/rke2#3714
since the change is getting pushed into k3s, creating this to reverify the same on k3s as well.

Steps To Reproduce:
The etcd snapshot retention, currently takes in the node-name value as well into consideration. So if node name changes, the previous snapshots for the same etcd are not cleaned up right.

Expected behavior:
Cleanup of etcd-snapshots should happen irrespective of node name changes.

Validated on release-1.25 branch with commit id: `ce85b98`

Environment Details

Infrastructure

Cloud
Hosted

Node(s) CPU architecture, OS, and Version:

cat /etc/os-release | grep PRETTY
PRETTY_NAME="Ubuntu 22.04.2 LTS"

Cluster Configuration:

Server config: 3 etcd, control planes servers/1 agent config

Config.yaml:

Main ETCD SERVER (+CONTROL PLANE) CONFIG:

token: blah
node-name: "server1"
etcd-snapshot-retention: 2
etcd-snapshot-schedule-cron: "* * * * *"
etcd-s3: true
etcd-s3-access-key: xxx
etcd-s3-secret-key: xxx
etcd-s3-bucket: s3-bucket-name
etcd-s3-folder: k3ssnap/commit-setup
etcd-s3-region: us-east-2
write-kubeconfig-mode: "0644"
cluster-init: true
node-label:
- k3s-upgrade=server

Sample Secondary Etcd, control plane config.yaml:

token: blah
server: https://x.x.x.x:6443
node-name: "server3"
write-kubeconfig-mode: "0644"
node-label:
- k3s-upgrade=server

AGENT CONFIG:

token: blah
server: https://x.x.x.x:6443
node-name: "agent1"
node-label:
- k3s-upgrade=agent

Testing Steps

Create config dir and place the config.yaml file in server/agent nodes:

$ sudo mkdir -p /etc/rancher/k3s && sudo cp config.yaml /etc/rancher/k3s

Note: First round node-names:
<version|commit>-server1
server2
server3
agent1
2. Install k3s:
Verify Issue Using Commit:

$ curl -sfL https://get.k3s.io | sudo INSTALL_K3S_COMMIT='ce85b988584227478a5044ebb47611ccf0905e1c' sh -s - server

Reproduce issue Using Version:

$ curl -sfL https://get.k3s.io | sudo INSTALL_K3S_VERSION='v1.25.12+k3s1' sh -s - server

Wait for 2 minutes.
Note: The snapshot gets created every 1 minute (etcd-snapshot-schedule-cron: "* * * * *") . Retention is for 2 snapshots (etcd-snapshot-retention: 2).
Reference for cron job format: https://cloud.google.com/scheduler/docs/configuring/cron-job-schedules
After 2 minutes: 2 snapshots are created with name etcd-snapshot-server1-2-xxxx if node-name: server1-2 in config.yaml),
Check outputs of:

sudo ls -lrt /var/lib/rancher/k3s/server/db/snapshots

sudo k3s etcd-snapshots list

4a. Also check the s3 bucket/folder in aws to see the snapshots listed.
5. Update the node-name in the config.yaml:
node-names:
<version|commit>-server1-<|suffix1>
server2-<|suffix1>
server3-<|suffix1>
agent1-<|suffix1>
6. restart the k3s service for all nodes.

sudo systemctl restart k3s-server

Wait for 2 more minutes and check the snapshot list:

sudo ls -lrt /var/lib/rancher/k3s/server/db/snapshots

sudo k3s etcd-snapshots list

7a. Also check the s3 bucket/folder in aws to see the snapshots listed.

Repeat steps 5 through 7 once more.
node names:
<version|commit>-server1-<|suffix2>-<|suffix1>
server2-<|suffix2>-<|suffix1>
server3-<|suffix2>-<|suffix1>
agent1-<|suffix2>-<|suffix1>

Replication Results:

k3s version used for replication:

 $ k3s -v 
k3s version v1.25.12+k3s1 (7515237f)
go version go1.20.6

After multiple node name updates:
Current and Previous Node Names of main etcd server in order since deployment:

version-setup-server1             
version-setup-server1-8098        
version-setup-server1-8993-8098

List of Snapshots in local directory:

 $ sudo ls -lrt /var/lib/rancher/k3s/server/db/snapshots 
total 31804
-rw------- 1 root root 4431904 Aug 15 20:37 etcd-snapshot-version-setup-server1-1692131822
-rw------- 1 root root 4636704 Aug 15 20:38 etcd-snapshot-version-setup-server1-1692131881
-rw------- 1 root root 4800544 Aug 15 20:39 etcd-snapshot-version-setup-server1-8098-1692131941
-rw------- 1 root root 5509152 Aug 15 20:40 etcd-snapshot-version-setup-server1-8098-1692132001
-rw------- 1 root root 6189088 Aug 15 20:41 etcd-snapshot-version-setup-server1-8993-8098-1692132066
-rw------- 1 root root 6975520 Aug 15 20:42 etcd-snapshot-version-setup-server1-8993-8098-1692132123

 $ sudo k3s etcd-snapshot list 
time="2023-08-15T20:42:07Z" level=warning msg="Unknown flag --token found in config.yaml, skipping\n"
time="2023-08-15T20:42:07Z" level=warning msg="Unknown flag --etcd-snapshot-retention found in config.yaml, skipping\n"
time="2023-08-15T20:42:07Z" level=warning msg="Unknown flag --etcd-snapshot-schedule-cron found in config.yaml, skipping\n"
time="2023-08-15T20:42:07Z" level=warning msg="Unknown flag --cluster-init found in config.yaml, skipping\n"
time="2023-08-15T20:42:07Z" level=warning msg="Unknown flag --write-kubeconfig-mode found in config.yaml, skipping\n"
time="2023-08-15T20:42:07Z" level=warning msg="Unknown flag --node-label found in config.yaml, skipping\n"
time="2023-08-15T20:42:07Z" level=info msg="Checking if S3 bucket xxx exists"
time="2023-08-15T20:42:08Z" level=info msg="S3 bucket xxx exists"
Name                                                     Size    Created
etcd-snapshot-version-setup-server1-8098-1692131941      4800544 2023-08-15T20:39:08Z
etcd-snapshot-version-setup-server1-8098-1692132001      5509152 2023-08-15T20:40:02Z
etcd-snapshot-version-setup-server1-8993-8098-1692132066 6189088 2023-08-15T20:41:07Z
etcd-snapshot-version-setup-server1-8993-8098-1692132123 6975520 2023-08-15T20:42:04Z
etcd-snapshot-version-setup-server1-1692131822           4431904 2023-08-15T20:37:03Z
etcd-snapshot-version-setup-server1-1692131881           4636704 2023-08-15T20:38:02Z

As we can see above, previous snapshots with different node-names are still listed and not cleaned up.

Validation Results:

k3s version used for validation:

k3s version v1.25.12+k3s-ce85b988 (ce85b988)
go version go1.20.6

After multiple node name updates,
Current and previous node-names for the main etcd server in order since deployment:

commit-setup-server1                  
commit-setup-server1-30269
commit-setup-server1-26884-30269

The snapshots listed are:

 $ sudo ls -lrt /var/lib/rancher/k3s/server/db/snapshots 
total 12296
-rw------- 1 root root 6197280 Aug 15 20:10 etcd-snapshot-commit-setup-server1-26884-30269-1692130204
-rw------- 1 root root 6385696 Aug 15 20:11 etcd-snapshot-commit-setup-server1-26884-30269-1692130261

 $ sudo k3s etcd-snapshot list 
time="2023-08-15T20:11:52Z" level=warning msg="Unknown flag --token found in config.yaml, skipping\n"
time="2023-08-15T20:11:52Z" level=warning msg="Unknown flag --etcd-snapshot-retention found in config.yaml, skipping\n"
time="2023-08-15T20:11:52Z" level=warning msg="Unknown flag --etcd-snapshot-schedule-cron found in config.yaml, skipping\n"
time="2023-08-15T20:11:52Z" level=warning msg="Unknown flag --cluster-init found in config.yaml, skipping\n"
time="2023-08-15T20:11:52Z" level=warning msg="Unknown flag --write-kubeconfig-mode found in config.yaml, skipping\n"
time="2023-08-15T20:11:52Z" level=warning msg="Unknown flag --node-label found in config.yaml, skipping\n"
time="2023-08-15T20:11:52Z" level=info msg="Checking if S3 bucket xxx exists"
time="2023-08-15T20:11:52Z" level=info msg="S3 bucket xxx exists"
Name                                                      Size    Created
etcd-snapshot-commit-setup-server1-26884-30269-1692130204 6197280 2023-08-15T20:10:05Z
etcd-snapshot-commit-setup-server1-26884-30269-1692130261 6385696 2023-08-15T20:11:02Z

As we can see, the previous snapshots with old node-names are no longer retained and get cleaned up.

aganesh-suse added this to the v1.25.13+k3s1 milestone Aug 14, 2023

aganesh-suse self-assigned this Aug 14, 2023

aganesh-suse closed this as completed Aug 15, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Backport release-1.25] etcd snapshot cleanup fails if node name changes #8187

[Backport release-1.25] etcd snapshot cleanup fails if node name changes #8187

aganesh-suse commented Aug 14, 2023 •

edited

Loading

aganesh-suse commented Aug 15, 2023 •

edited

Loading

[Backport release-1.25] etcd snapshot cleanup fails if node name changes #8187

[Backport release-1.25] etcd snapshot cleanup fails if node name changes #8187

Comments

aganesh-suse commented Aug 14, 2023 • edited Loading

aganesh-suse commented Aug 15, 2023 • edited Loading

Validated on release-1.25 branch with commit id: ce85b98

Environment Details

Testing Steps

aganesh-suse commented Aug 14, 2023 •

edited

Loading

aganesh-suse commented Aug 15, 2023 •

edited

Loading

Validated on release-1.25 branch with commit id: `ce85b98`