Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Panic when etcd-snapshot-dir does not exist #9316

Closed
brandond opened this issue Jan 30, 2024 · 3 comments
Closed

Panic when etcd-snapshot-dir does not exist #9316

brandond opened this issue Jan 30, 2024 · 3 comments
Assignees
Milestone

Comments

@brandond
Copy link
Contributor

brandond commented Jan 30, 2024

K3s tracking issue for:

If the target snapshot dir does not exist, the etcd snapshot will fail, and the subsequent reconcile in listLocalSnapshots will panic when attempting to walk the nonexistent path.

There is another race condition if a scheduled snapshot and manual snapshot run at the same time, and one of them prunes files out from underneath the other. There is locking in the snapshot code path, but it is essentially useless because it is a mutex within the server process, and the cron scheduler already ensures only a single execution at a time - and it does nothing to help with multiple snapshots taken by separate processes. We should fix that as well, if possible.

@aganesh-suse
Copy link

Issue found on master branch with commit de82584

Environment Details

Infrastructure

  • Cloud
  • Hosted

Node(s) CPU architecture, OS, and Version:

$ cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.2 LTS"

$ uname -m
x86_64

Cluster Configuration:

HA: 3 server/ 1 agent

Config.yaml:

token: xxxx
cluster-init: true
write-kubeconfig-mode: "0644"
node-external-ip: 1.1.1.1
node-label:
- k3s-upgrade=server

etcd-snapshot-retention: 2
etcd-snapshot-schedule-cron: "* * * * *"
etcd-s3: true
etcd-s3-access-key: xxx
etcd-s3-secret-key: xxx
etcd-s3-bucket: xxx
etcd-s3-folder: xxx
etcd-s3-region: xxx

etcd-snapshot-dir: '/does/not/exist'
debug: true

Testing Steps

  1. Copy config.yaml
$ sudo mkdir -p /etc/rancher/k3s && sudo cp config.yaml /etc/rancher/k3s
  1. Install k3s
curl -sfL https://get.k3s.io | sudo INSTALL_K3S_COMMIT='de825845b2f1eca82c19892c327ed274abfa8901' sh -s - server
  1. Perform etcd-snapshot operations: save, prune, list, delete

Expected behavior:

Step 3: Perform etcd-snapshot operations: save, prune, list, delete - none of them should have a seg fault. Should exit gracefully.

Reproducing Results/Observations:

  • k3s version used for replication:
$ k3s -v
k3s version v1.29.1+k3s-de825845 (de825845)
go version go1.21.6

Prune operation output:

$ sudo /usr/local/bin/k3s etcd-snapshot prune --snapshot-retention 2 
time="2024-02-15T19:32:35Z" level=warning msg="Unknown flag --token found in config.yaml, skipping\n"
time="2024-02-15T19:32:35Z" level=warning msg="Unknown flag --etcd-snapshot-retention found in config.yaml, skipping\n"
time="2024-02-15T19:32:35Z" level=warning msg="Unknown flag --etcd-snapshot-schedule-cron found in config.yaml, skipping\n"
time="2024-02-15T19:32:35Z" level=warning msg="Unknown flag --cluster-init found in config.yaml, skipping\n"
time="2024-02-15T19:32:35Z" level=warning msg="Unknown flag --write-kubeconfig-mode found in config.yaml, skipping\n"
time="2024-02-15T19:32:35Z" level=warning msg="Unknown flag --node-external-ip found in config.yaml, skipping\n"
time="2024-02-15T19:32:35Z" level=warning msg="Unknown flag --node-label found in config.yaml, skipping\n"
time="2024-02-15T19:32:35Z" level=info msg="Applying snapshot retention=2 to local snapshots with prefix on-demand in /does/not/exist"
panic: runtime error: invalid memory address or nil pointer dereference
[signal SIGSEGV: segmentation violation code=0x1 addr=0x18 pc=0x46f311d]

goroutine 1 [running]:
github.com/k3s-io/k3s/pkg/etcd.snapshotRetention.func1({0xc001220ee4?, 0xf?}, {0x0, 0x0}, {0x6f0bb80, 0xc0012980f0})
	/go/src/github.com/k3s-io/k3s/pkg/etcd/snapshot.go:930 +0x3d
path/filepath.Walk({0xc001220ee4, 0xf}, 0xc000f5f700)
	/usr/local/go/src/path/filepath/path.go:570 +0x4a
github.com/k3s-io/k3s/pkg/etcd.snapshotRetention(0x2, {0x604a43d, 0x9}, {0xc001220ee4, 0xf})
	/go/src/github.com/k3s-io/k3s/pkg/etcd/snapshot.go:929 +0x1ad
github.com/k3s-io/k3s/pkg/etcd.(*ETCD).PruneSnapshots(0xc00094af00, {0x6f64090, 0xc000943720})
	/go/src/github.com/k3s-io/k3s/pkg/etcd/snapshot.go:507 +0x6d
github.com/k3s-io/k3s/pkg/cli/etcdsnapshot.prune(0x0?, 0xa8b1a00)
	/go/src/github.com/k3s-io/k3s/pkg/cli/etcdsnapshot/etcd_snapshot.go:280 +0x7c
github.com/k3s-io/k3s/pkg/cli/etcdsnapshot.Prune(0xc000a9be40?)
	/go/src/github.com/k3s-io/k3s/pkg/cli/etcdsnapshot/etcd_snapshot.go:267 +0x34
github.com/urfave/cli.HandleAction({0x54002c0?, 0x66e58b8?}, 0x5?)
	/go/pkg/mod/github.com/urfave/cli@v1.22.14/app.go:524 +0x50
github.com/urfave/cli.Command.Run({{0x6010151, 0x5}, {0x0, 0x0}, {0x0, 0x0, 0x0}, {0x6376f09, 0x56}, {0x0, ...}, ...}, ...)
	/go/pkg/mod/github.com/urfave/cli@v1.22.14/command.go:175 +0x63e
github.com/urfave/cli.(*App).RunAsSubcommand(0xc000dcda40, 0xc000a9bb80)
	/go/pkg/mod/github.com/urfave/cli@v1.22.14/app.go:405 +0xe07
github.com/urfave/cli.Command.startApp({{0x605de37, 0xd}, {0x0, 0x0}, {0x0, 0x0, 0x0}, {0x0, 0x0}, {0x0, ...}, ...}, ...)
	/go/pkg/mod/github.com/urfave/cli@v1.22.14/command.go:380 +0xb58
github.com/urfave/cli.Command.Run({{0x605de37, 0xd}, {0x0, 0x0}, {0x0, 0x0, 0x0}, {0x0, 0x0}, {0x0, ...}, ...}, ...)
	/go/pkg/mod/github.com/urfave/cli@v1.22.14/command.go:103 +0x7e5
github.com/urfave/cli.(*App).Run(0xc000dcd880, {0xc000d56b60, 0xd, 0xd})
	/go/pkg/mod/github.com/urfave/cli@v1.22.14/app.go:277 +0xb27
main.main()
	/go/src/github.com/k3s-io/k3s/cmd/server/main.go:81 +0xbfb

@brandond
Copy link
Contributor Author

Moving this out to the next release to fix the prune subcommand.

@aganesh-suse
Copy link

Validated on master branch with commit 364dfd8

Environment Details

Infrastructure

  • Cloud
  • Hosted

Node(s) CPU architecture, OS, and Version:

$ cat /etc/os-release
PRETTY_NAME="Ubuntu 22.04.2 LTS"

$ uname -m
x86_64

Cluster Configuration:

HA: 3 server/ 1 agent

Config.yaml:

token: xxxx
cluster-init: true
write-kubeconfig-mode: "0644"
node-external-ip: 1.1.1.1
node-label:
- k3s-upgrade=server

etcd-snapshot-retention: 2
etcd-snapshot-schedule-cron: "* * * * *"
etcd-s3: true
etcd-s3-access-key: xxxx
etcd-s3-secret-key: xxxx
etcd-s3-bucket: xxxx
etcd-s3-folder: xxxx
etcd-s3-region: xxxx

etcd-snapshot-dir: '/does/not/exist'

debug: true

Testing Steps

  1. Copy config.yaml
$ sudo mkdir -p /etc/rancher/k3s && sudo cp config.yaml /etc/rancher/k3s
  1. Install k3s
curl -sfL https://get.k3s.io | sudo INSTALL_K3S_COMMIT='364dfd8b89e62bc7816aa4843db8deb6f6fd2978' sh -s - server
  1. Verify Cluster Status:
kubectl get nodes -o wide
kubectl get pods -A
  1. Verify etcd-snapshot operations: save, prune, list, delete
sudo /usr/local/bin/k3s etcd-snapshot list
sudo /usr/local/bin/k3s etcd-snapshot save
sudo /usr/local/bin/k3s etcd-snapshot prune
sudo /usr/local/bin/k3s etcd-snapshot delete <snapshot>

Validation Results:

  • k3s version used for validation:
$ k3s -v
k3s version v1.29.2+k3s-364dfd8b (364dfd8b)
go version go1.21.7

Save:

 $ sudo /usr/local/bin/k3s etcd-snapshot save 
time="2024-03-11T18:42:04Z" level=warning msg="Unknown flag --token found in config.yaml, skipping\n"
time="2024-03-11T18:42:04Z" level=warning msg="Unknown flag --etcd-snapshot-schedule-cron found in config.yaml, skipping\n"
time="2024-03-11T18:42:04Z" level=warning msg="Unknown flag --cluster-init found in config.yaml, skipping\n"
time="2024-03-11T18:42:04Z" level=warning msg="Unknown flag --write-kubeconfig-mode found in config.yaml, skipping\n"
time="2024-03-11T18:42:04Z" level=warning msg="Unknown flag --node-external-ip found in config.yaml, skipping\n"
time="2024-03-11T18:42:04Z" level=warning msg="Unknown flag --node-label found in config.yaml, skipping\n"
time="2024-03-11T18:42:04Z" level=debug msg="Attempting to retrieve extra metadata from k3s-etcd-snapshot-extra-metadata ConfigMap"
time="2024-03-11T18:42:04Z" level=debug msg="Error encountered attempting to retrieve extra metadata from k3s-etcd-snapshot-extra-metadata ConfigMap, error: configmaps \"k3s-etcd-snapshot-extra-metadata\" not found"
time="2024-03-11T18:42:04Z" level=fatal msg="failed to get etcd-snapshot-dir: stat /does/not/exist: no such file or directory"

Prune:

 $ sudo /usr/local/bin/k3s etcd-snapshot prune
time="2024-03-11T18:43:11Z" level=warning msg="Unknown flag --token found in config.yaml, skipping\n"
time="2024-03-11T18:43:11Z" level=warning msg="Unknown flag --etcd-snapshot-schedule-cron found in config.yaml, skipping\n"
time="2024-03-11T18:43:11Z" level=warning msg="Unknown flag --cluster-init found in config.yaml, skipping\n"
time="2024-03-11T18:43:11Z" level=warning msg="Unknown flag --write-kubeconfig-mode found in config.yaml, skipping\n"
time="2024-03-11T18:43:11Z" level=warning msg="Unknown flag --node-external-ip found in config.yaml, skipping\n"
time="2024-03-11T18:43:11Z" level=warning msg="Unknown flag --node-label found in config.yaml, skipping\n"
time="2024-03-11T18:43:11Z" level=fatal msg="failed to get etcd-snapshot-dir: stat /does/not/exist: no such file or directory"

List:

$ sudo k3s etcd-snapshot list
WARN[0000] Unknown flag --token found in config.yaml, skipping
WARN[0000] Unknown flag --server found in config.yaml, skipping
WARN[0000] Unknown flag --write-kubeconfig-mode found in config.yaml, skipping
WARN[0000] Unknown flag --node-external-ip found in config.yaml, skipping
WARN[0000] Unknown flag --node-label found in config.yaml, skipping
FATA[0000] failed to get etcd-snapshot-dir: stat /does/not/exist: no such file or directory

Delete:

$ sudo k3s etcd-snapshot delete on-demand-ip-x
WARN[0000] Unknown flag --token found in config.yaml, skipping
WARN[0000] Unknown flag --server found in config.yaml, skipping
WARN[0000] Unknown flag --write-kubeconfig-mode found in config.yaml, skipping
WARN[0000] Unknown flag --node-external-ip found in config.yaml, skipping
WARN[0000] Unknown flag --node-label found in config.yaml, skipping
FATA[0000] failed to get etcd-snapshot-dir: stat /does/not/exist: no such file or directory

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
Status: Done Issue
Development

No branches or pull requests

3 participants