Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Migrate to using custom resource to store etcd snapshot metadata #8064

Merged
merged 14 commits into from
Oct 12, 2023

Conversation

brandond
Copy link
Contributor

@brandond brandond commented Jul 28, 2023

Proposed Changes

Add ADR and implementation for etcd snapshot CRD migration.

Types of Changes

ADR, bugfix, reimplementation

Verification

Run full snapshot create / delete / prune tests; confirm that all function normally and snapshot CRs, configmaps, and CLI output stay in sync

Testing

n/a

Linked Issues

User-Facing Change

Further Comments

@brandond brandond requested a review from a team as a code owner July 28, 2023 00:24
@brandond brandond force-pushed the etcd-snapshot-cr branch 2 times, most recently from 3eef133 to 4df89b4 Compare July 28, 2023 05:50
Copy link
Contributor

@manuelbuil manuelbuil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for writing this up Brad. I have two (stupid?) questions:

1 - Will this impact how rke2 is doing snapshots too?
2 - IIRC, snapshots are stored in /var/lib/rancher/k3s/server/db/snapshots, right? So, what the configmap stores is information about the snapshots stored in the filesystem, right?

docs/adrs/etcd-snapshot-cr.md Show resolved Hide resolved
@brandond
Copy link
Contributor Author

brandond commented Jul 28, 2023

@manuelbuil

1 - Will this impact how rke2 is doing snapshots too?

Yes, this code manages snapshots for both products.

2 - IIRC, snapshots are stored in /var/lib/rancher/k3s/server/db/snapshots, right? So, what the configmap stores is information about the snapshots stored in the filesystem, right?

Yes, and the snapshots stored in s3.

@cwayne18
Copy link
Collaborator

@Oats87 I know we talked this over in design, but it would be good to get sign-off from the v2prov side to make sure y'all are in agreement with whatever we decide to do here

@brandond
Copy link
Contributor Author

brandond commented Aug 14, 2023

I have a couple changes to make based on what was discussed in the call. This includes:

  • Storing metadata alongside the snapshots both on disk and in s3 so that it doesn't only live in the snapshot list
  • Use of kube-system namespace UID as cluster ID (to prevent listing/pruning of snapshots from different clusters in the same bucket)

@codecov
Copy link

codecov bot commented Sep 28, 2023

Codecov Report

Attention: 932 lines in your changes are missing coverage. Please review.

Comparison is base (7d38b4a) 47.05% compared to head (4854dbf) 50.66%.

❗ Current head 4854dbf differs from pull request most recent head 174ff79. Consider uploading reports for the commit 174ff79 to get more accurate results

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #8064      +/-   ##
==========================================
+ Coverage   47.05%   50.66%   +3.60%     
==========================================
  Files         144      147       +3     
  Lines       14964    15611     +647     
==========================================
+ Hits         7042     7909     +867     
+ Misses       6809     6453     -356     
- Partials     1113     1249     +136     
Flag Coverage Δ
e2etests 47.23% <10.25%> (?)
inttests 44.01% <39.27%> (-0.42%) ⬇️
unittests 17.94% <0.27%> (-1.62%) ⬇️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files Coverage Δ
pkg/agent/config/config.go 52.59% <ø> (+3.39%) ⬆️
pkg/agent/containerd/containerd.go 20.39% <ø> (ø)
pkg/apis/k3s.cattle.io/v1/zz_generated_register.go 69.23% <100.00%> (+5.59%) ⬆️
pkg/cli/agent/agent.go 26.22% <ø> (+26.22%) ⬆️
pkg/cli/server/server.go 62.33% <ø> (+2.20%) ⬆️
pkg/cloudprovider/cloudprovider.go 65.33% <ø> (ø)
pkg/cloudprovider/servicelb.go 62.21% <ø> (+0.20%) ⬆️
pkg/cluster/address_controller.go 84.21% <ø> (ø)
pkg/cluster/bootstrap.go 48.74% <100.00%> (+11.38%) ⬆️
pkg/cluster/encrypt.go 42.85% <100.00%> (-4.97%) ⬇️
... and 30 more

... and 24 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@cwayne18
Copy link
Collaborator

cc @jakefhyde @Oats87 for any RM-related review

@brandond brandond marked this pull request as draft September 28, 2023 16:29
@brandond
Copy link
Contributor Author

Moving to draft, this is still work in progress

@brandond brandond force-pushed the etcd-snapshot-cr branch 10 times, most recently from 886025d to 2eb5014 Compare October 6, 2023 00:43
@brandond brandond force-pushed the etcd-snapshot-cr branch 4 times, most recently from 84eddb1 to 3ceae0c Compare October 7, 2023 02:27
@brandond brandond force-pushed the etcd-snapshot-cr branch 3 times, most recently from 6bc6bc3 to 4854dbf Compare October 11, 2023 18:12
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
Also adds a hack go script to print the embedded CRDs, for developer use.

Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
…estEntityTooLarge

Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
Consistently refer to object keys as such, simplify error handling.

Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
Attempt to use timestamp from creation or filename instead of file/object modification times

Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
Also, don't list ONLY s3 snapshots if S3 is enabled.

Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
Write the extra metadata both locally and to S3. These files are placed such that they will not be used by older versions of K3s that do not make use of them.

Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
Fixes snapshot list coming out in non-deterministic order

Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
Reconcile snapshot CRs instead of ConfigMap; manage ConfigMap downstream from CR list

Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
This required pulling the token hash stuff out of the cluster package, into util.

Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
Snapshot delete/prune tests were only working because the delete command
would report success even when deleting a snapshot that didn't exist,
and the test regex was finding the snapshot name multiple times in
the list output and deleting it twice.

Snapshot restore tests seem to have expected the deployment to be rolled out
immediately, which is not a reasonable expectation.

Signed-off-by: Brad Davidson <brad.davidson@rancher.com>
Copy link
Contributor

@briandowns briandowns left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Provisional approval pending CI.

@brandond
Copy link
Contributor Author

brandond commented Oct 12, 2023

S390x is having issues and we do not have an ETA for restoration; all other tests are green. merging

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants