Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Backport release-1.25] etcd snapshot cleanup fails if node name changes #8187

Closed
aganesh-suse opened this issue Aug 14, 2023 · 1 comment
Closed
Assignees
Milestone

Comments

@aganesh-suse
Copy link

aganesh-suse commented Aug 14, 2023

This is a backport issue for rancher/rke2#3714,

Node(s) CPU architecture, OS, and Version:
ubuntu 22.04

Cluster Configuration:
HA - 3 server/1agent

Describe the bug:
same as rancher/rke2#3714
since the change is getting pushed into k3s, creating this to reverify the same on k3s as well.

Steps To Reproduce:
The etcd snapshot retention, currently takes in the node-name value as well into consideration. So if node name changes, the previous snapshots for the same etcd are not cleaned up right.

Expected behavior:
Cleanup of etcd-snapshots should happen irrespective of node name changes.

Other linked code fixes:
#8123
#8190

@aganesh-suse aganesh-suse added this to the v1.25.13+k3s1 milestone Aug 14, 2023
@aganesh-suse aganesh-suse self-assigned this Aug 14, 2023
@aganesh-suse
Copy link
Author

aganesh-suse commented Aug 15, 2023

Validated on release-1.25 branch with commit id: ce85b98

Environment Details

Infrastructure

  • Cloud
  • Hosted

Node(s) CPU architecture, OS, and Version:

cat /etc/os-release | grep PRETTY
PRETTY_NAME="Ubuntu 22.04.2 LTS"

Cluster Configuration:

Server config: 3 etcd, control planes servers/1 agent config

Config.yaml:

Main ETCD SERVER (+CONTROL PLANE) CONFIG:

token: blah
node-name: "server1"
etcd-snapshot-retention: 2
etcd-snapshot-schedule-cron: "* * * * *"
etcd-s3: true
etcd-s3-access-key: xxx
etcd-s3-secret-key: xxx
etcd-s3-bucket: s3-bucket-name
etcd-s3-folder: k3ssnap/commit-setup
etcd-s3-region: us-east-2
write-kubeconfig-mode: "0644"
cluster-init: true
node-label:
- k3s-upgrade=server

Sample Secondary Etcd, control plane config.yaml:

token: blah
server: https://x.x.x.x:6443
node-name: "server3"
write-kubeconfig-mode: "0644"
node-label:
- k3s-upgrade=server

AGENT CONFIG:

token: blah
server: https://x.x.x.x:6443
node-name: "agent1"
node-label:
- k3s-upgrade=agent

Testing Steps

  1. Create config dir and place the config.yaml file in server/agent nodes:
$ sudo mkdir -p /etc/rancher/k3s && sudo cp config.yaml /etc/rancher/k3s

Note: First round node-names:
<version|commit>-server1
server2
server3
agent1
2. Install k3s:
Verify Issue Using Commit:

$ curl -sfL https://get.k3s.io | sudo INSTALL_K3S_COMMIT='ce85b988584227478a5044ebb47611ccf0905e1c' sh -s - server

Reproduce issue Using Version:

$ curl -sfL https://get.k3s.io | sudo INSTALL_K3S_VERSION='v1.25.12+k3s1' sh -s - server
  1. Wait for 2 minutes.
    Note: The snapshot gets created every 1 minute (etcd-snapshot-schedule-cron: "* * * * *") . Retention is for 2 snapshots (etcd-snapshot-retention: 2).
    Reference for cron job format: https://cloud.google.com/scheduler/docs/configuring/cron-job-schedules
    After 2 minutes: 2 snapshots are created with name etcd-snapshot-server1-2-xxxx if node-name: server1-2 in config.yaml),
  2. Check outputs of:
sudo ls -lrt /var/lib/rancher/k3s/server/db/snapshots
sudo k3s etcd-snapshots list

4a. Also check the s3 bucket/folder in aws to see the snapshots listed.
5. Update the node-name in the config.yaml:
node-names:
<version|commit>-server1-<|suffix1>
server2-<|suffix1>
server3-<|suffix1>
agent1-<|suffix1>
6. restart the k3s service for all nodes.

sudo systemctl restart k3s-server
  1. Wait for 2 more minutes and check the snapshot list:
sudo ls -lrt /var/lib/rancher/k3s/server/db/snapshots
sudo k3s etcd-snapshots list

7a. Also check the s3 bucket/folder in aws to see the snapshots listed.

  1. Repeat steps 5 through 7 once more.
    node names:
    <version|commit>-server1-<|suffix2>-<|suffix1>
    server2-<|suffix2>-<|suffix1>
    server3-<|suffix2>-<|suffix1>
    agent1-<|suffix2>-<|suffix1>

Replication Results:

  • k3s version used for replication:
 $ k3s -v 
k3s version v1.25.12+k3s1 (7515237f)
go version go1.20.6

After multiple node name updates:
Current and Previous Node Names of main etcd server in order since deployment:

version-setup-server1             
version-setup-server1-8098        
version-setup-server1-8993-8098   

List of Snapshots in local directory:

 $ sudo ls -lrt /var/lib/rancher/k3s/server/db/snapshots 
total 31804
-rw------- 1 root root 4431904 Aug 15 20:37 etcd-snapshot-version-setup-server1-1692131822
-rw------- 1 root root 4636704 Aug 15 20:38 etcd-snapshot-version-setup-server1-1692131881
-rw------- 1 root root 4800544 Aug 15 20:39 etcd-snapshot-version-setup-server1-8098-1692131941
-rw------- 1 root root 5509152 Aug 15 20:40 etcd-snapshot-version-setup-server1-8098-1692132001
-rw------- 1 root root 6189088 Aug 15 20:41 etcd-snapshot-version-setup-server1-8993-8098-1692132066
-rw------- 1 root root 6975520 Aug 15 20:42 etcd-snapshot-version-setup-server1-8993-8098-1692132123
 $ sudo k3s etcd-snapshot list 
time="2023-08-15T20:42:07Z" level=warning msg="Unknown flag --token found in config.yaml, skipping\n"
time="2023-08-15T20:42:07Z" level=warning msg="Unknown flag --etcd-snapshot-retention found in config.yaml, skipping\n"
time="2023-08-15T20:42:07Z" level=warning msg="Unknown flag --etcd-snapshot-schedule-cron found in config.yaml, skipping\n"
time="2023-08-15T20:42:07Z" level=warning msg="Unknown flag --cluster-init found in config.yaml, skipping\n"
time="2023-08-15T20:42:07Z" level=warning msg="Unknown flag --write-kubeconfig-mode found in config.yaml, skipping\n"
time="2023-08-15T20:42:07Z" level=warning msg="Unknown flag --node-label found in config.yaml, skipping\n"
time="2023-08-15T20:42:07Z" level=info msg="Checking if S3 bucket xxx exists"
time="2023-08-15T20:42:08Z" level=info msg="S3 bucket xxx exists"
Name                                                     Size    Created
etcd-snapshot-version-setup-server1-8098-1692131941      4800544 2023-08-15T20:39:08Z
etcd-snapshot-version-setup-server1-8098-1692132001      5509152 2023-08-15T20:40:02Z
etcd-snapshot-version-setup-server1-8993-8098-1692132066 6189088 2023-08-15T20:41:07Z
etcd-snapshot-version-setup-server1-8993-8098-1692132123 6975520 2023-08-15T20:42:04Z
etcd-snapshot-version-setup-server1-1692131822           4431904 2023-08-15T20:37:03Z
etcd-snapshot-version-setup-server1-1692131881           4636704 2023-08-15T20:38:02Z

As we can see above, previous snapshots with different node-names are still listed and not cleaned up.

Validation Results:

  • k3s version used for validation:
k3s version v1.25.12+k3s-ce85b988 (ce85b988)
go version go1.20.6

After multiple node name updates,
Current and previous node-names for the main etcd server in order since deployment:

commit-setup-server1                  
commit-setup-server1-30269
commit-setup-server1-26884-30269  

The snapshots listed are:

 $ sudo ls -lrt /var/lib/rancher/k3s/server/db/snapshots 
total 12296
-rw------- 1 root root 6197280 Aug 15 20:10 etcd-snapshot-commit-setup-server1-26884-30269-1692130204
-rw------- 1 root root 6385696 Aug 15 20:11 etcd-snapshot-commit-setup-server1-26884-30269-1692130261
 $ sudo k3s etcd-snapshot list 
time="2023-08-15T20:11:52Z" level=warning msg="Unknown flag --token found in config.yaml, skipping\n"
time="2023-08-15T20:11:52Z" level=warning msg="Unknown flag --etcd-snapshot-retention found in config.yaml, skipping\n"
time="2023-08-15T20:11:52Z" level=warning msg="Unknown flag --etcd-snapshot-schedule-cron found in config.yaml, skipping\n"
time="2023-08-15T20:11:52Z" level=warning msg="Unknown flag --cluster-init found in config.yaml, skipping\n"
time="2023-08-15T20:11:52Z" level=warning msg="Unknown flag --write-kubeconfig-mode found in config.yaml, skipping\n"
time="2023-08-15T20:11:52Z" level=warning msg="Unknown flag --node-label found in config.yaml, skipping\n"
time="2023-08-15T20:11:52Z" level=info msg="Checking if S3 bucket xxx exists"
time="2023-08-15T20:11:52Z" level=info msg="S3 bucket xxx exists"
Name                                                      Size    Created
etcd-snapshot-commit-setup-server1-26884-30269-1692130204 6197280 2023-08-15T20:10:05Z
etcd-snapshot-commit-setup-server1-26884-30269-1692130261 6385696 2023-08-15T20:11:02Z

As we can see, the previous snapshots with old node-names are no longer retained and get cleaned up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

No branches or pull requests

1 participant