Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix etcd backup/restore test and add guardrail for etcd-snapshot #11314

Merged
merged 1 commit into from
Nov 28, 2024

Conversation

manuelbuil
Copy link
Contributor

@manuelbuil manuelbuil commented Nov 13, 2024

Proposed Changes

This PR does two things:
1 - Fixes the e2e test doing backup/restore from a snapshot. It was failing with:

FATA[0000] cannot perform cluster-reset while server URL is set - remove server from configuration before resetting

When comparing the steps it was following and the documentation, it was indeed wrong. The rest of the servers don't have to run --cluster-reset. When removing the step Resets non bootstrap nodes, the e2e test works again

2 - I must confess that I wasted an hour trying to use k3s etcd-snapshot in a k3s cluster that was not using etcd. When doing that, the error we get is super confusing:

FATA[0000] see server log for details: Unauthorized

However, there is nothing in the logs. The problem is that we are treating a 404 (url does not exist) as Unauthorized https://github.com/k3s-io/k3s/blob/master/pkg/server/router.go#L87. We have been doing that for a long time, so it is probably risky to change that. Therefore, I decided to check if kine.sock exists and db/etcd/config not in the dataDir. If that's the case, we print K3s is not deployed with an etcd datastore

Types of Changes

Test fix + guardrail

Verification

1 - Run the e2e test
2 - Run any k3s etcd-snapshot command without etcd as datastore

Testing

Linked Issues

#11388

User-Facing Change


Further Comments

@manuelbuil manuelbuil requested a review from a team as a code owner November 13, 2024 15:58
VestigeJ
VestigeJ previously approved these changes Nov 13, 2024
Copy link

codecov bot commented Nov 13, 2024

Codecov Report

Attention: Patch coverage is 72.72727% with 3 lines in your changes missing coverage. Please review.

Project coverage is 42.44%. Comparing base (b93fd98) to head (cd5478f).
Report is 13 commits behind head on master.

Files with missing lines Patch % Lines
pkg/cluster/managed.go 72.72% 2 Missing and 1 partial ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master   #11314      +/-   ##
==========================================
- Coverage   46.89%   42.44%   -4.45%     
==========================================
  Files         179      179              
  Lines       18587    18610      +23     
==========================================
- Hits         8716     7899     -817     
- Misses       8518     9505     +987     
+ Partials     1353     1206     -147     
Flag Coverage Δ
e2etests 34.33% <72.72%> (-7.61%) ⬇️
inttests 18.81% <0.00%> (-15.85%) ⬇️
unittests 13.82% <0.00%> (+0.17%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

dereknola
dereknola previously approved these changes Nov 13, 2024
Copy link
Member

@brandond brandond left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The problem is that we are treating a 404 (url does not exist) as Unauthorized https://github.com/k3s-io/k3s/blob/master/pkg/server/router.go#L87.

That's not what's happening here. The NotFoundHandler field is perhaps poorly named, but it is just how mux handles chaining request handlers. If a router handler does not find a matching route, it hands it off to the next handler. In this case you'll notice that section of code setting up a long chain of routers where each router in the chain sets the previous one as its NotFound handler, and the final router is whats returned. Because etcd isn't present to install any routes that match the snapshot path, you end up getting passed all the way through to the apiserver, which is what generates the Unauthorized response.

We do have other handlers that return proper errors if etcd isn't enabled, for example:

k3s/pkg/server/router.go

Lines 118 to 126 in b93fd98

func bootstrapHandler(runtime *config.ControlRuntime) http.Handler {
if runtime.HTTPBootstrap {
return bootstrap.Handler(&runtime.ControlRuntimeBootstrap)
}
return http.HandlerFunc(func(resp http.ResponseWriter, req *http.Request) {
logrus.Warnf("Received HTTP bootstrap request from %s, but embedded etcd is not enabled.", req.RemoteAddr)
util.SendError(errors.New("etcd disabled"), resp, req, http.StatusBadRequest)
})
}

Here's where the snapshot handlers are registered:

  • k3s/pkg/cluster/managed.go

    Lines 94 to 103 in b93fd98

    // registerDBHandlers registers managed-datastore-specific callbacks, and installs additional HTTP route handlers.
    // Note that for etcd, controllers only run on nodes with a local apiserver, in order to provide stable external
    // management of etcd cluster membership without being disrupted when a member is removed from the cluster.
    func (c *Cluster) registerDBHandlers(handler http.Handler) (http.Handler, error) {
    if c.managedDB == nil {
    return handler, nil
    }
    return c.managedDB.Register(handler)
    }
  • k3s/pkg/etcd/etcd.go

    Lines 632 to 633 in b93fd98

    // Register adds db info routes for the http request handler, and registers cluster controller callbacks
    func (e *ETCD) Register(handler http.Handler) (http.Handler, error) {
  • k3s/pkg/etcd/etcd.go

    Lines 696 to 697 in b93fd98

    // handler wraps the handler with routes for database info
    func (e *ETCD) handler(next http.Handler) http.Handler {

We do not need to worry about returning 404s for the snapshot handlers if etcd isn't enabled, since that's not what's currently happening. Nor should the CLI be checking for etcd datastore files - request validation should happen on the server side.

You could probably do a couple different things:

  • Add a default handler in router.go that matches /db/ and sends an appropriate error - counting on the fact that the etcd router is later in the chain and the requests won't ever make it to this handler if etcd is running.
  • Add a default handler in managed.go that is only added if etcd isn't enabled. This is probably technically safer, since it would avoid installing the error handler if etcd is running.

@manuelbuil
Copy link
Contributor Author

  • Add a default handler in managed.go that is only added if etcd isn't enabled. This is probably technically safer, since it would avoid installing the error handler if etcd is running.

Thanks for the explanation and the suggestion. I took this path, I also agree it is safer

@manuelbuil manuelbuil dismissed stale reviews from dereknola and VestigeJ via 8bbc8a6 November 14, 2024 11:52
@manuelbuil manuelbuil force-pushed the etcdsnapshotfailwhennoetcd branch from 17234f0 to 8bbc8a6 Compare November 14, 2024 11:52
pkg/cli/etcdsnapshot/etcd_snapshot.go Outdated Show resolved Hide resolved
pkg/cluster/managed.go Outdated Show resolved Hide resolved
@manuelbuil manuelbuil force-pushed the etcdsnapshotfailwhennoetcd branch 5 times, most recently from aec8aac to 83358d5 Compare November 20, 2024 09:20
@manuelbuil manuelbuil requested a review from brandond November 20, 2024 13:59
Copy link
Member

@brandond brandond left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

whitespace nit

pkg/cluster/managed.go Outdated Show resolved Hide resolved
@manuelbuil manuelbuil force-pushed the etcdsnapshotfailwhennoetcd branch from 83358d5 to cd5478f Compare November 21, 2024 08:58
Signed-off-by: manuelbuil <mbuil@suse.com>
@cwayne18
Copy link
Member

/trivy

Copy link
Contributor

❌ Trivy scan action failed, check logs ❌

1 similar comment
Copy link
Contributor

❌ Trivy scan action failed, check logs ❌

@manuelbuil
Copy link
Contributor Author

❌ Trivy scan action failed, check logs ❌

2024-11-21T18:45:05Z	ERROR	[vulndb] Failed to download artifact	repo="ghcr.io/aquasecurity/trivy-db:2" err="OCI repository error: 1 error occurred:\n\t* GET https://ghcr.io/v2/aquasecurity/trivy-db/manifests/2: TOOMANYREQUESTS: retry-after: 324.606µs, allowed: 44000/minute\n\n"

scale up trivy!

Copy link
Contributor

🌟 No High or Critical CVEs Found 🌟

@manuelbuil manuelbuil merged commit 4ec2617 into k3s-io:master Nov 28, 2024
30 checks passed
@manuelbuil manuelbuil deleted the etcdsnapshotfailwhennoetcd branch November 28, 2024 08:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants