Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Segmentation Fault on running rke2 etcd-snapshot save #4942

Closed
aganesh-suse opened this issue Oct 21, 2023 · 14 comments
Closed

Segmentation Fault on running rke2 etcd-snapshot save #4942

aganesh-suse opened this issue Oct 21, 2023 · 14 comments
Assignees

Comments

@aganesh-suse
Copy link

aganesh-suse commented Oct 21, 2023

Environmental Info:
RKE2 Version:

rke2 -v
rke2 version v1.26.10-rc2+rke2r1 (825e3188d273e7271a0b5ce924d42455b4d37a34)
go version go1.20.10 X:boringcrypto

Node(s) CPU architecture, OS, and Version:

Ubuntu 22.04

$ uname -a
Linux ip-172-31-27-121 5.19.0-1025-aws #26~22.04.1-Ubuntu SMP Mon Apr 24 01:58:15 UTC 2023 x86_64 x86_64 x86_64 GNU/Linux

Cluster Configuration:

1 server/1 agent

Describe the bug:

Hit a segmentation fault in one of the etcd-snapshot save operations:

$ sudo rke2 etcd-snapshot save 
time="2023-10-21T00:46:00Z" level=warning msg="Unknown flag --token found in config.yaml, skipping\n"
time="2023-10-21T00:46:00Z" level=warning msg="Unknown flag --etcd-snapshot-retention found in config.yaml, skipping\n"
time="2023-10-21T00:46:00Z" level=warning msg="Unknown flag --etcd-snapshot-schedule-cron found in config.yaml, skipping\n"
time="2023-10-21T00:46:00Z" level=warning msg="Unknown flag --write-kubeconfig-mode found in config.yaml, skipping\n"
time="2023-10-21T00:46:00Z" level=warning msg="Unknown flag --node-external-ip found in config.yaml, skipping\n"
time="2023-10-21T00:46:01Z" level=info msg="Saving etcd snapshot to /var/lib/rancher/rke2/server/db/snapshots/on-demand-ip-x-x-x-x-1697849161"
{"level":"info","ts":"2023-10-21T00:46:01.02135Z","caller":"snapshot/v3_snapshot.go:65","msg":"created temporary db file","path":"/var/lib/rancher/rke2/server/db/snapshots/on-demand-ip-x-x-x-x-1697849161.part"}
{"level":"info","ts":"2023-10-21T00:46:01.057649Z","logger":"client","caller":"v3@v3.5.9-k3s1/maintenance.go:212","msg":"opened snapshot stream; downloading"}
{"level":"info","ts":"2023-10-21T00:46:01.057715Z","caller":"snapshot/v3_snapshot.go:73","msg":"fetching snapshot","endpoint":"https://127.0.0.1:2379"}
{"level":"info","ts":"2023-10-21T00:46:02.329315Z","logger":"client","caller":"v3@v3.5.9-k3s1/maintenance.go:220","msg":"completed snapshot read; closing"}
{"level":"info","ts":"2023-10-21T00:46:02.410492Z","caller":"snapshot/v3_snapshot.go:88","msg":"fetched snapshot","endpoint":"https://127.0.0.1:2379","size":"19 MB","took":"1 second ago"}
{"level":"info","ts":"2023-10-21T00:46:02.410993Z","caller":"snapshot/v3_snapshot.go:97","msg":"saved","path":"/var/lib/rancher/rke2/server/db/snapshots/on-demand-ip-x-x-x-x-1697849161"}
time="2023-10-21T00:46:02Z" level=info msg="Saving snapshot metadata to /var/lib/rancher/rke2/server/db/.metadata/on-demand-ip-x-x-x-x-1697849161"
time="2023-10-21T00:46:02Z" level=info msg="Checking if S3 bucket xxxx exists"
time="2023-10-21T00:46:02Z" level=info msg="S3 bucket xxxx exists"
time="2023-10-21T00:46:02Z" level=info msg="Saving etcd snapshot on-demand-ip-x-x-x-x-1697849161 to S3"
time="2023-10-21T00:46:02Z" level=info msg="Uploading snapshot to s3://xxxx//var/lib/rancher/rke2/server/db/snapshots/on-demand-ip-x-x-x-x-1697849161"
time="2023-10-21T00:46:03Z" level=info msg="Uploaded snapshot metadata s3://xxxx//var/lib/rancher/rke2/server/db/.metadata/on-demand-ip-x-x-x-x-1697849161"
time="2023-10-21T00:46:03Z" level=info msg="S3 upload complete for on-demand-ip-x-x-x-x-1697849161"
time="2023-10-21T00:46:03Z" level=info msg="Reconciling ETCDSnapshotFile resources"
time="2023-10-21T00:46:03Z" level=info msg="Reconciliation of ETCDSnapshotFile resources complete"
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x28d4dcf]

goroutine 1 [running]:
github.com/k3s-io/k3s/pkg/etcd.(*ETCD).listLocalSnapshots.func1({0xc000f0eae0, 0x58}, {0x0, 0x0}, {0x3b4fa60, 0xc000f10ab0})
        /go/pkg/mod/github.com/k3s-io/k3s@v1.26.10-0.20231019001327-43d998604f2d/pkg/etcd/snapshot.go:437 +0x8f
path/filepath.walk({0xc000ff8570, 0x29}, {0x3b86828, 0xc0010d01a0}, 0xc0012cc460)
        /usr/local/go/src/path/filepath/path.go:508 +0x1f3
path/filepath.Walk({0xc000ff8570, 0x29}, 0xc0012cc460)
        /usr/local/go/src/path/filepath/path.go:579 +0x6c
github.com/k3s-io/k3s/pkg/etcd.(*ETCD).listLocalSnapshots(0xc0003cb630)
        /go/pkg/mod/github.com/k3s-io/k3s@v1.26.10-0.20231019001327-43d998604f2d/pkg/etcd/snapshot.go:436 +0xad
github.com/k3s-io/k3s/pkg/etcd.(*ETCD).ReconcileSnapshotData(0xc0003cb630, {0x3b7ca68, 0xc0009e0460})
        /go/pkg/mod/github.com/k3s-io/k3s@v1.26.10-0.20231019001327-43d998604f2d/pkg/etcd/snapshot.go:732 +0xe9
github.com/k3s-io/k3s/pkg/etcd.(*ETCD).Snapshot(0xc0003cb630, {0x3b7ca68, 0xc0009e0460})
        /go/pkg/mod/github.com/k3s-io/k3s@v1.26.10-0.20231019001327-43d998604f2d/pkg/etcd/snapshot.go:382 +0x1439
github.com/k3s-io/k3s/pkg/cli/etcdsnapshot.save(0xc0009c7ce0, 0xc0009bdbc8?)
        /go/pkg/mod/github.com/k3s-io/k3s@v1.26.10-0.20231019001327-43d998604f2d/pkg/cli/etcdsnapshot/etcd_snapshot.go:127 +0x92
github.com/k3s-io/k3s/pkg/cli/etcdsnapshot.Save(0xc0009c7ce0?)
        /go/pkg/mod/github.com/k3s-io/k3s@v1.26.10-0.20231019001327-43d998604f2d/pkg/cli/etcdsnapshot/etcd_snapshot.go:110 +0x45
github.com/urfave/cli.HandleAction({0x2fde0c0?, 0x37d9d00?}, 0x4?)
        /go/pkg/mod/github.com/urfave/cli@v1.22.14/app.go:524 +0x50
github.com/urfave/cli.Command.Run({{0x3643998, 0x4}, {0x0, 0x0}, {0x0, 0x0, 0x0}, {0x36a5cbb, 0x22}, {0x0, ...}, ...}, ...)
        /go/pkg/mod/github.com/urfave/cli@v1.22.14/command.go:175 +0x67b
github.com/urfave/cli.(*App).RunAsSubcommand(0xc000491880, 0xc0009c7a20)
        /go/pkg/mod/github.com/urfave/cli@v1.22.14/app.go:405 +0xe87
github.com/urfave/cli.Command.startApp({{0x365b53c, 0xd}, {0x0, 0x0}, {0x0, 0x0, 0x0}, {0x36a5cbb, 0x22}, {0x0, ...}, ...}, ...)
        /go/pkg/mod/github.com/urfave/cli@v1.22.14/command.go:380 +0xb7f
github.com/urfave/cli.Command.Run({{0x365b53c, 0xd}, {0x0, 0x0}, {0x0, 0x0, 0x0}, {0x36a5cbb, 0x22}, {0x0, ...}, ...}, ...)
        /go/pkg/mod/github.com/urfave/cli@v1.22.14/command.go:103 +0x845
github.com/urfave/cli.(*App).Run(0xc0004916c0, {0xc000839d40, 0x9, 0x9})
        /go/pkg/mod/github.com/urfave/cli@v1.22.14/app.go:277 +0xb87
main.main()
        /source/main.go:23 +0x97e```

Steps To Reproduce:

  1. $ for (( I=0; I < 5; I++ )); do sudo rke2 etcd-snapshot save ; done
  2. sudo rke2 etcd-snapshot prune --snapshot-retention 3
  3. sudo rke2 etcd-snapshot delete
    Hit the seg fault at loop 4 of the save operation above.

Also, the cron is set in the config.yaml to take a snapshot every 1 min, and retain only 2 snapshots. So, there is a save/delete snapshot happening every minute.

@brandond
Copy link
Contributor

brandond commented Oct 21, 2023

It looks like something is deleting files out from under the directory walk while it is iterating over the listing. I suspect this requires simultaneously taking and pruning snapshots to reproduce?

@aganesh-suse
Copy link
Author

aganesh-suse commented Oct 21, 2023

i think the prune was happening because of the cron set to take a snapshot every minute and a prune happens every 2 minutes - because we tried to retain only 2 snapshots.

etcd-snapshot-retention: 2
etcd-snapshot-schedule-cron: "* * * * *"

in the meantime, i ran 5 on-demand saves in a loop. and the 4th save failed with the seg fault above.

yes. this looks like a timing issue of simultaneous prune and save snapshot that happened.

@brandond
Copy link
Contributor

There is a lock that prevents multiple snapshots from running at the same time within a single process, but it doesn't prevent multiple CLI invocations from stepping on each other, or the CLI from stepping on the service. We should handle that better.

@hdeadman
Copy link

We were getting a similar crash of the rancher agent on v1.25.16+rke2r1 when the etcd-snapshot-dir folder doesn't exist. Two of the three controller servers had configured an nfs mount as the snapshot directory and the mount wasn't working for some reason so we kept seeing containerd crash, presumably b/c it is a child process of whatever was crashing due to bad etcd-snapshot-dir. The rancher agent was never able to completely come up. Our snapshot retention was 10 and cron schedule was every 8 hours.

@hdeadman
Copy link

Here was full stack trace from our logs. I think this should be prioritized higher than medium.

Jan 26 15:23:21 redacted rke2: time="2024-01-26T15:23:21Z" level=info msg="rke2 is up and running"
Jan 26 15:23:21 redacted systemd: Started Rancher Kubernetes Engine v2 (server).
Jan 26 15:23:21 redacted rke2: time="2024-01-26T15:23:21Z" level=info msg="Failed to get existing traefik HelmChart" error="helmcharts.helm.cattle.io \"traefik\" not found"
Jan 26 15:23:21 redacted rke2: time="2024-01-26T15:23:21Z" level=info msg="Reconciling ETCDSnapshotFile resources"
Jan 26 15:23:21 redacted rke2: time="2024-01-26T15:23:21Z" level=info msg="Reconciliation of ETCDSnapshotFile resources complete"
Jan 26 15:23:21 redacted rke2: panic: runtime error: invalid memory address or nil pointer dereference
Jan 26 15:23:21 redacted rke2: [signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x28c028f]
Jan 26 15:23:21 redacted rke2: goroutine 262 [running]:
Jan 26 15:23:21 redacted rke2: github.com/k3s-io/k3s/pkg/etcd.(*ETCD).listLocalSnapshots.func1({0xc0005eb374, 0x37}, {0x0, 0x0}, {0x3ad1ac0, 0xc000790a20})
Jan 26 15:23:21 redacted rke2: /go/pkg/mod/github.com/k3s-io/k3s@v1.25.16-0.20231122010439-c8165989e934/pkg/etcd/snapshot.go:439 +0x8f
Jan 26 15:23:21 redacted rke2: path/filepath.Walk({0xc0005eb374, 0x37}, 0xc001dd3048)
Jan 26 15:23:21 redacted rke2: /usr/local/go/src/path/filepath/path.go:562 +0x50
Jan 26 15:23:21 redacted rke2: github.com/k3s-io/k3s/pkg/etcd.(*ETCD).listLocalSnapshots(0xc0008f8a50)
Jan 26 15:23:21 redacted rke2: /go/pkg/mod/github.com/k3s-io/k3s@v1.25.16-0.20231122010439-c8165989e934/pkg/etcd/snapshot.go:438 +0xad
Jan 26 15:23:21 redacted rke2: github.com/k3s-io/k3s/pkg/etcd.(*ETCD).ReconcileSnapshotData(0xc0008f8a50, {0x3b00f48, 0xc000ad95e0})
Jan 26 15:23:21 redacted rke2: /go/pkg/mod/github.com/k3s-io/k3s@v1.25.16-0.20231122010439-c8165989e934/pkg/etcd/snapshot.go:735 +0xe9
Jan 26 15:23:21 redacted rke2: github.com/k3s-io/k3s/pkg/cluster.(*Cluster).Start.func1()
Jan 26 15:23:21 redacted rke2: /go/pkg/mod/github.com/k3s-io/k3s@v1.25.16-0.20231122010439-c8165989e934/pkg/cluster/cluster.go:110 +0xa4
Jan 26 15:23:21 redacted rke2: created by github.com/k3s-io/k3s/pkg/cluster.(*Cluster).Start
Jan 26 15:23:21 redacted rke2: /go/pkg/mod/github.com/k3s-io/k3s@v1.25.16-0.20231122010439-c8165989e934/pkg/cluster/cluster.go:101 +0x6ca
Jan 26 15:23:21 redacted systemd: rke2-server.service: main process exited, code=exited, status=2/INVALIDARGUMENT
Jan 26 15:23:21 redacted systemd: Unit rke2-server.service entered failed state.
Jan 26 15:23:21 redacted systemd: rke2-server.service failed.

@brandond
Copy link
Contributor

Two of the three controller servers had configured an nfs mount as the snapshot directory and the mount wasn't working for some reason.

I don't believe this is something we've tested and should be considered unsupported.

Is the exact same NFS path shared by all the nodes? I would ensure that they're not all sharing the same path, otherwise you'll get weirdness like duplicate snapshots in the list, nodes pruning other nodes snapshots out from underneath each other, and so on. This is not intended to be a shared filesystem.

@hdeadman
Copy link

This particular issue wasn't caused by three controllers writing to the same shared folder. Two of the three mounts weren't mounted so the path on two of the servers was an invalid directory and that was causing the stack trace and the main process exiting. The logs did say status=2/INVALIDARGUMENT as the exit code but that isn't particularly helpful since it doesn't say which argument is invalid. If it is going to exit due to the directory being invalid, some logging that included the invalid path or config parameter would make it clear what the problem was.

@brandond
Copy link
Contributor

Two of the three mounts weren't mounted so the path on two of the servers was an invalid directory and that was causing the stack trace and the main process exiting.

I'm not sure what you mean by "was an invalid directory". Even if you intended to have an NFS export mounted there, the directory would need to exist. Did the target snapshot directory not exist at all?

I can confirm that I can reproduce a crash when setting the snapshot dir to something that does not exist:

rke2 server '--etcd-snapshot-dir=/does/not/exist' '--etcd-snapshot-schedule-cron=* * * * *'

panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x28c028f]

goroutine 259 [running]:
github.com/k3s-io/k3s/pkg/etcd.(*ETCD).listLocalSnapshots.func1({0xc000080044, 0xf}, {0x0, 0x0}, {0x3ad1ac0, 0xc001a7e6f0})
	/go/pkg/mod/github.com/k3s-io/k3s@v1.25.16-0.20231122010439-c8165989e934/pkg/etcd/snapshot.go:439 +0x8f
path/filepath.Walk({0xc000080044, 0xf}, 0xc000bab048)
	/usr/local/go/src/path/filepath/path.go:562 +0x50
github.com/k3s-io/k3s/pkg/etcd.(*ETCD).listLocalSnapshots(0xc0004ce140)
	/go/pkg/mod/github.com/k3s-io/k3s@v1.25.16-0.20231122010439-c8165989e934/pkg/etcd/snapshot.go:438 +0xad
github.com/k3s-io/k3s/pkg/etcd.(*ETCD).ReconcileSnapshotData(0xc0004ce140, {0x3b00f48, 0xc000459180})
	/go/pkg/mod/github.com/k3s-io/k3s@v1.25.16-0.20231122010439-c8165989e934/pkg/etcd/snapshot.go:735 +0xe9
github.com/k3s-io/k3s/pkg/cluster.(*Cluster).Start.func1()
	/go/pkg/mod/github.com/k3s-io/k3s@v1.25.16-0.20231122010439-c8165989e934/pkg/cluster/cluster.go:110 +0xa4
created by github.com/k3s-io/k3s/pkg/cluster.(*Cluster).Start
	/go/pkg/mod/github.com/k3s-io/k3s@v1.25.16-0.20231122010439-c8165989e934/pkg/cluster/cluster.go:101 +0x6ca

@brandond brandond self-assigned this Jan 29, 2024
@brandond brandond added this to the v1.29.2+rke2r1 milestone Jan 29, 2024
@hdeadman
Copy link

hdeadman commented Jan 30, 2024 via email

@brandond
Copy link
Contributor

brandond commented Jan 30, 2024

Correct, the folder did not exist at all (b/c the nfs mount didn't work).

I mean, I'm guessing the mount didn't work because the folder didn't exist. The target directory has to exist for you to mount something there. It doesn't get created by the mount.

But yes, we shouldn't crash if you ask to backup to a path that doesn't exist.

@hdeadman
Copy link

hdeadman commented Jan 30, 2024 via email

@aganesh-suse
Copy link
Author

Issue found on master branch with commit 992194b

Environment Details

Infrastructure

  • Cloud
  • Hosted

Node(s) CPU architecture, OS, and Version:

$ cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.2 LTS"

$ uname -m
x86_64

Cluster Configuration:

HA : 3 server / 1 agent

or

1 server/ 1 agent

Config.yaml:

token: xxxx
write-kubeconfig-mode: "0644"
node-external-ip: 1.1.1.1

etcd-snapshot-retention: 5
etcd-snapshot-schedule-cron: "* * * * *"
etcd-s3: true
etcd-s3-access-key: xxxx
etcd-s3-secret-key: xxxx
etcd-s3-bucket: xxxx
etcd-s3-folder: xxxx
etcd-s3-region: xxxx

etcd-snapshot-dir: '/does/not/exist'
debug: true

Steps to reproduce:

  1. Copy config.yaml
$ sudo mkdir -p /etc/rancher/rke2 && sudo cp config.yaml /etc/rancher/rke2
  1. Install RKE2
curl -sfL https://get.rke2.io | sudo INSTALL_RKE2_COMMIT='992194bd7e2f1eb8a9bbd8f2f4ae78e3ee314b92' INSTALL_RKE2_TYPE='server' INSTALL_RKE2_METHOD=tar sh -
  1. Start the RKE2 service
$ sudo systemctl enable --now rke2-server
or 
$ sudo systemctl enable --now rke2-agent
  1. Perform etcd-snapshot operations: save, prune, list, delete

Expected behavior:

Step 4: Perform etcd-snapshot operations: save, prune, list, delete - none of them should have a seg fault. Should exit gracefully.

Reproducing Results/Observations:

  • rke2 version used for replication:
$ rke2 -v
rke2 version v1.29.1+dev.992194bd (992194bd7e2f1eb8a9bbd8f2f4ae78e3ee314b92)
go version go1.21.6 X:boringcrypto

Prune operation seg faults:

$ sudo rke2 etcd-snapshot prune --snapshot-retention 3
time="2024-02-15T00:23:45Z" level=warning msg="Unknown flag --token found in config.yaml, skipping\n"
time="2024-02-15T00:23:45Z" level=warning msg="Unknown flag --etcd-snapshot-retention found in config.yaml, skipping\n"
time="2024-02-15T00:23:45Z" level=warning msg="Unknown flag --etcd-snapshot-schedule-cron found in config.yaml, skipping\n"
time="2024-02-15T00:23:45Z" level=warning msg="Unknown flag --write-kubeconfig-mode found in config.yaml, skipping\n"
time="2024-02-15T00:23:45Z" level=warning msg="Unknown flag --node-external-ip found in config.yaml, skipping\n"
time="2024-02-15T00:23:45Z" level=info msg="Applying snapshot retention=3 to local snapshots with prefix on-demand in /does/not/exist"
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x2f057bd]

goroutine 1 [running]:
github.com/k3s-io/k3s/pkg/etcd.snapshotRetention.func1({0xc000272e24?, 0xf?}, {0x0, 0x0}, {0x4679e40, 0xc00052fb90})
	/go/pkg/mod/github.com/k3s-io/k3s@v1.29.2-0.20240209222238-de825845b2f1/pkg/etcd/snapshot.go:930 +0x3d
path/filepath.Walk({0xc000272e24, 0xf}, 0xc000e3d958)
	/usr/local/go/src/path/filepath/path.go:570 +0x4a
github.com/k3s-io/k3s/pkg/etcd.snapshotRetention(0x3, {0x3e4816b, 0x9}, {0xc000272e24, 0xf})
	/go/pkg/mod/github.com/k3s-io/k3s@v1.29.2-0.20240209222238-de825845b2f1/pkg/etcd/snapshot.go:929 +0x1ad
github.com/k3s-io/k3s/pkg/etcd.(*ETCD).PruneSnapshots(0xc000d73540, {0x46b8430, 0xc000d72eb0})
	/go/pkg/mod/github.com/k3s-io/k3s@v1.29.2-0.20240209222238-de825845b2f1/pkg/etcd/snapshot.go:507 +0x6d
github.com/k3s-io/k3s/pkg/cli/etcdsnapshot.prune(0x0?, 0x6bed320)
	/go/pkg/mod/github.com/k3s-io/k3s@v1.29.2-0.20240209222238-de825845b2f1/pkg/cli/etcdsnapshot/etcd_snapshot.go:280 +0x7c
github.com/k3s-io/k3s/pkg/cli/etcdsnapshot.Prune(0xc000afb340?)
	/go/pkg/mod/github.com/k3s-io/k3s@v1.29.2-0.20240209222238-de825845b2f1/pkg/cli/etcdsnapshot/etcd_snapshot.go:267 +0x34
github.com/urfave/cli.HandleAction({0x3624820?, 0x41faac8?}, 0x5?)
	/go/pkg/mod/github.com/urfave/cli@v1.22.14/app.go:524 +0x50
github.com/urfave/cli.Command.Run({{0x3e111b0, 0x5}, {0x0, 0x0}, {0x0, 0x0, 0x0}, {0x4096cec, 0x56}, {0x0, ...}, ...}, ...)
	/go/pkg/mod/github.com/urfave/cli@v1.22.14/command.go:175 +0x63e
github.com/urfave/cli.(*App).RunAsSubcommand(0xc000a71880, 0xc000afb080)
	/go/pkg/mod/github.com/urfave/cli@v1.22.14/app.go:405 +0xe07
github.com/urfave/cli.Command.startApp({{0x3e54fda, 0xd}, {0x0, 0x0}, {0x0, 0x0, 0x0}, {0x0, 0x0}, {0x0, ...}, ...}, ...)
	/go/pkg/mod/github.com/urfave/cli@v1.22.14/command.go:380 +0xb58
github.com/urfave/cli.Command.Run({{0x3e54fda, 0xd}, {0x0, 0x0}, {0x0, 0x0, 0x0}, {0x0, 0x0}, {0x0, ...}, ...}, ...)
	/go/pkg/mod/github.com/urfave/cli@v1.22.14/command.go:103 +0x7e5
github.com/urfave/cli.(*App).Run(0xc000ba7a40, {0xc0009f00d0, 0xd, 0xd})
	/go/pkg/mod/github.com/urfave/cli@v1.22.14/app.go:277 +0xb27
main.main()
	/source/main.go:23 +0x97b

@brandond
Copy link
Contributor

Moving this out to the next release to fix the prune subcommand.

@aganesh-suse
Copy link
Author

Validated on master branch with commit c7cd05b

Environment Details

Infrastructure

  • Cloud
  • Hosted

Node(s) CPU architecture, OS, and Version:

$ cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.2 LTS"

$ uname -m
x86_64

Cluster Configuration:

HA : 3 server / 1 agent

Config.yaml:

token: xxxx
write-kubeconfig-mode: "0644"
node-external-ip: 1.1.1.1
debug: true

etcd-snapshot-retention: 5
etcd-snapshot-schedule-cron: "* * * * *"
etcd-s3: true
etcd-s3-access-key: xxxx
etcd-s3-secret-key: xxxx
etcd-s3-bucket: xxxx
etcd-s3-folder: xxxx
etcd-s3-region: xxxx

etcd-snapshot-dir: '/does/not/exist'

Testing Steps

  1. Copy config.yaml
$ sudo mkdir -p /etc/rancher/rke2 && sudo cp config.yaml /etc/rancher/rke2
  1. Install RKE2
curl -sfL https://get.rke2.io | sudo INSTALL_RKE2_COMMIT='c7cd05bf547712250bd7a47db69258dbf823c80a' INSTALL_RKE2_TYPE='server' INSTALL_RKE2_METHOD=tar sh -
  1. Start the RKE2 service
$ sudo systemctl enable --now rke2-server
or 
$ sudo systemctl enable --now rke2-agent
  1. Verify Cluster Status:
kubectl get nodes -o wide
kubectl get pods -A
  1. Perform etcd-snapshot operations: list, save, prune, delete. They should exit gracefully.

Validation Results:

  • rke2 version used for validation:
$ rke2 -v
rke2 version v1.29.2+dev.c7cd05bf (c7cd05bf547712250bd7a47db69258dbf823c80a)
go version go1.21.7 X:boringcrypto

List:

$ sudo rke2 etcd-snapshot list
WARN[0000] Unknown flag --token found in config.yaml, skipping
WARN[0000] Unknown flag --server found in config.yaml, skipping
WARN[0000] Unknown flag --etcd-snapshot-schedule-cron found in config.yaml, skipping
WARN[0000] Unknown flag --write-kubeconfig-mode found in config.yaml, skipping
WARN[0000] Unknown flag --node-external-ip found in config.yaml, skipping
INFO[0000] Checking if S3 bucket sonobuoy-results exists
INFO[0000] S3 bucket sonobuoy-results exists
FATA[0000] failed to get etcd-snapshot-dir: stat /does/not/exist: no such file or directory

Save:

$ sudo rke2 etcd-snapshot save
WARN[0000] Unknown flag --token found in config.yaml, skipping
WARN[0000] Unknown flag --server found in config.yaml, skipping
WARN[0000] Unknown flag --etcd-snapshot-schedule-cron found in config.yaml, skipping
WARN[0000] Unknown flag --write-kubeconfig-mode found in config.yaml, skipping
WARN[0000] Unknown flag --node-external-ip found in config.yaml, skipping
DEBU[0000] Attempting to retrieve extra metadata from rke2-etcd-snapshot-extra-metadata ConfigMap
DEBU[0000] Error encountered attempting to retrieve extra metadata from rke2-etcd-snapshot-extra-metadata ConfigMap, error: configmaps "rke2-etcd-snapshot-extra-metadata" not found
FATA[0000] failed to get etcd-snapshot-dir: stat /does/not/exist: no such file or directory

Prune:

$ sudo rke2 etcd-snapshot prune
WARN[0000] Unknown flag --token found in config.yaml, skipping
WARN[0000] Unknown flag --server found in config.yaml, skipping
WARN[0000] Unknown flag --etcd-snapshot-schedule-cron found in config.yaml, skipping
WARN[0000] Unknown flag --write-kubeconfig-mode found in config.yaml, skipping
WARN[0000] Unknown flag --node-external-ip found in config.yaml, skipping
FATA[0000] failed to get etcd-snapshot-dir: stat /does/not/exist: no such file or directory

Delete:

$ sudo rke2 etcd-snapshot delete etcd-snapshot-x
WARN[0000] Unknown flag --token found in config.yaml, skipping
WARN[0000] Unknown flag --server found in config.yaml, skipping
WARN[0000] Unknown flag --etcd-snapshot-schedule-cron found in config.yaml, skipping
WARN[0000] Unknown flag --write-kubeconfig-mode found in config.yaml, skipping
WARN[0000] Unknown flag --node-external-ip found in config.yaml, skipping
FATA[0000] failed to get etcd-snapshot-dir: stat /does/not/exist: no such file or directory

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

4 participants