Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Backport release-1.26] etcd snapshot cleanup fails if node name changes #8185

Closed
aganesh-suse opened this issue Aug 14, 2023 · 1 comment
Closed
Assignees
Milestone

Comments

@aganesh-suse
Copy link

aganesh-suse commented Aug 14, 2023

This is a backport issue for rancher/rke2#3714,

[Backport release-1.26]
Node(s) CPU architecture, OS, and Version:
ubuntu 22.04

Cluster Configuration:
HA - 3 server/1agent

Describe the bug:
same as rancher/rke2#3714
since the change is getting pushed into k3s, creating this to reverify the same on k3s as well.

Steps To Reproduce:
The etcd snapshot retention, currently takes in the node-name value as well into consideration. So if node name changes, the previous snapshots for the same etcd are not cleaned up right.

Expected behavior:
Cleanup of etcd-snapshots should happen irrespective of node name changes.

Other linked code fixes:
#8122
#8189

@aganesh-suse aganesh-suse added this to the v1.26.8+k3s1 milestone Aug 14, 2023
@aganesh-suse aganesh-suse self-assigned this Aug 14, 2023
@aganesh-suse
Copy link
Author

aganesh-suse commented Aug 15, 2023

Validated on release-1.26 branch with commit id: 15e0eac

Environment Details

Infrastructure

  • Cloud
  • Hosted

Node(s) CPU architecture, OS, and Version:

cat /etc/os-release | grep PRETTY
PRETTY_NAME="Ubuntu 22.04.2 LTS"

Cluster Configuration:

Server config: 3 etcd, control planes servers/1 agent config

Config.yaml:

Main ETCD SERVER (+CONTROL PLANE) CONFIG:

token: blah
node-name: "server1"
etcd-snapshot-retention: 2
etcd-snapshot-schedule-cron: "* * * * *"
etcd-s3: true
etcd-s3-access-key: xxx
etcd-s3-secret-key: xxx
etcd-s3-bucket: s3-bucket-name
etcd-s3-folder: k3ssnap/commit-setup
etcd-s3-region: us-east-2
write-kubeconfig-mode: "0644"
cluster-init: true
node-label:
- k3s-upgrade=server

Sample Secondary Etcd, control plane config.yaml:

token: blah
server: https://x.x.x.x:6443
node-name: "server3"
write-kubeconfig-mode: "0644"
node-label:
- k3s-upgrade=server

AGENT CONFIG:

token: blah
server: https://x.x.x.x:6443
node-name: "agent1"
node-label:
- k3s-upgrade=agent

Testing Steps

  1. Create config dir and place the config.yaml file in server/agent nodes:
$ sudo mkdir -p /etc/rancher/k3s && sudo cp config.yaml /etc/rancher/k3s

Note: First round node-names:
<version|commit>-server1
server2
server3
agent1
2. Install k3s:
Verify/Validate Issue Using Commit:

$ curl -sfL https://get.k3s.io | sudo INSTALL_K3S_COMMIT='15e0eac1682abf142f6d2d4e51c40f6a43194c11' sh -s - server

Reproduce Issue Using Version:

$ curl -sfL https://get.k3s.io | sudo INSTALL_K3S_VERSION='v1.26.7+k3s1' sh -s - server
  1. Wait for 2 minutes.
    Note: The snapshot gets created every 1 minute (etcd-snapshot-schedule-cron: "* * * * *") . Retention is for 2 snapshots (etcd-snapshot-retention: 2).
    Reference for cron job format: https://cloud.google.com/scheduler/docs/configuring/cron-job-schedules
    After 2 minutes: 2 snapshots are created with name etcd-snapshot-server1-2-xxxx if node-name: server1-2 in config.yaml),
  2. Check outputs of:
sudo ls -lrt /var/lib/rancher/k3s/server/db/snapshots
sudo k3s etcd-snapshots list

4a. Also check the s3 bucket/folder in aws to see the snapshots listed.
5. Update the node-name in the config.yaml:
node-names:
<version|commit>-server1-<|suffix1>
server2-<|suffix1>
server3-<|suffix1>
agent1-<|suffix1>
6. restart the k3s service for all nodes.

sudo systemctl restart k3s-server
  1. Wait for 2 more minutes and check the snapshot list:
sudo ls -lrt /var/lib/rancher/k3s/server/db/snapshots
sudo k3s etcd-snapshots list

7a. Also check the s3 bucket/folder in aws to see the snapshots listed.

  1. Repeat steps 5 through 7 once more.
    node names:
    <version|commit>-server1-<|suffix2>-<|suffix1>
    server2-<|suffix2>-<|suffix1>
    server3-<|suffix2>-<|suffix1>
    agent1-<|suffix2>-<|suffix1>

Replication Results:

  • k3s version used for replication:
 $ k3s -v
k3s version v1.26.7+k3s1 (e47cfc09)
go version go1.20.6

After multiple node name updates:
Node-names of main etcd server - in order since deployment:

version-setup-server1               
version-setup-server1-18877         
version-setup-server1-31631-18877  

List of Snapshots in local directory:

 $ sudo ls -lrt /var/lib/rancher/k3s/server/db/snapshots 
total 27156
-rw------- 1 root root 4460576 Aug 15 19:12 etcd-snapshot-version-setup-server1-1692126722
-rw------- 1 root root 4673568 Aug 15 19:13 etcd-snapshot-version-setup-server1-1692126783
-rw------- 1 root root 5328928 Aug 15 19:15 etcd-snapshot-version-setup-server1-18877-1692126901
-rw------- 1 root root 6230048 Aug 15 19:16 etcd-snapshot-version-setup-server1-31631-18877-1692126964
-rw------- 1 root root 7094304 Aug 15 19:17 etcd-snapshot-version-setup-server1-31631-18877-1692127025
 $ sudo k3s etcd-snapshot list 
time="2023-08-15T19:17:18Z" level=warning msg="Unknown flag --token found in config.yaml, skipping\n"
time="2023-08-15T19:17:18Z" level=warning msg="Unknown flag --etcd-snapshot-retention found in config.yaml, skipping\n"
time="2023-08-15T19:17:18Z" level=warning msg="Unknown flag --etcd-snapshot-schedule-cron found in config.yaml, skipping\n"
time="2023-08-15T19:17:18Z" level=warning msg="Unknown flag --cluster-init found in config.yaml, skipping\n"
time="2023-08-15T19:17:18Z" level=warning msg="Unknown flag --write-kubeconfig-mode found in config.yaml, skipping\n"
time="2023-08-15T19:17:18Z" level=warning msg="Unknown flag --node-label found in config.yaml, skipping\n"
time="2023-08-15T19:17:18Z" level=info msg="Checking if S3 bucket xxx exists"
time="2023-08-15T19:17:18Z" level=info msg="S3 bucket xxx exists"
Name                                                       Size    Created
etcd-snapshot-version-setup-server1-1692126722             4460576 2023-08-15T19:12:03Z
etcd-snapshot-version-setup-server1-1692126783             4673568 2023-08-15T19:13:04Z
etcd-snapshot-version-setup-server1-18877-1692126901       5328928 2023-08-15T19:15:02Z
etcd-snapshot-version-setup-server1-31631-18877-1692126964 6230048 2023-08-15T19:16:07Z
etcd-snapshot-version-setup-server1-31631-18877-1692127025 7094304 2023-08-15T19:17:06Z

As we can see above, previous snapshots with different node-names are still listed and not cleaned up.

Validation Results:

  • k3s version used for validation:
k3s version v1.26.7+k3s-15e0eac1 (15e0eac1)
go version go1.20.6

After multiple node name updates,
Node names of main etcd server in order since deployment:

commit-setup-server1              
commit-setup-server1-32484        
commit-setup-server1-5061-32484   

Note: Have not deleted the previous node-names intentionally from the cluster. This is so, we can understand the snapshot names listed below:
The snapshots listed are:

 $ sudo ls -lrt /var/lib/rancher/k3s/server/db/snapshots 
total 13656
-rw------- 1 root root 6598688 Aug 15 18:51 etcd-snapshot-commit-setup-server1-5061-32484-1692125464
-rw------- 1 root root 7376928 Aug 15 18:52 etcd-snapshot-commit-setup-server1-5061-32484-1692125522
 $ sudo k3s etcd-snapshot list 
time="2023-08-15T18:52:07Z" level=warning msg="Unknown flag --token found in config.yaml, skipping\n"
time="2023-08-15T18:52:07Z" level=warning msg="Unknown flag --etcd-snapshot-retention found in config.yaml, skipping\n"
time="2023-08-15T18:52:07Z" level=warning msg="Unknown flag --etcd-snapshot-schedule-cron found in config.yaml, skipping\n"
time="2023-08-15T18:52:07Z" level=warning msg="Unknown flag --cluster-init found in config.yaml, skipping\n"
time="2023-08-15T18:52:07Z" level=warning msg="Unknown flag --write-kubeconfig-mode found in config.yaml, skipping\n"
time="2023-08-15T18:52:07Z" level=warning msg="Unknown flag --node-label found in config.yaml, skipping\n"
time="2023-08-15T18:52:07Z" level=info msg="Checking if S3 bucket xxx exists"
time="2023-08-15T18:52:07Z" level=info msg="S3 bucket xxx exists"
Name                                                     Size    Created
etcd-snapshot-commit-setup-server1-5061-32484-1692125522 7376928 2023-08-15T18:52:03Z
etcd-snapshot-commit-setup-server1-5061-32484-1692125464 6598688 2023-08-15T18:51:05Z

As we can see, the previous snapshots with old node-names are no longer retained and get cleaned up.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Archived in project
Development

No branches or pull requests

1 participant