Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

S3 Snapshot Retention policy is too aggressive #5216

Open
Oats87 opened this issue Jan 8, 2024 · 13 comments
Open

S3 Snapshot Retention policy is too aggressive #5216

Oats87 opened this issue Jan 8, 2024 · 13 comments

Comments

@Oats87
Copy link
Contributor

Oats87 commented Jan 8, 2024

Environmental Info:
RKE2 Version: v1.25.15+rke2r2, v1.27.7+rke2r2

Node(s) CPU architecture, OS, and Version:
N/A

Cluster Configuration:
Multi-master

Describe the bug:
According to the r/r issue, RKE2 is too aggressive at pruning S3 snapshots.

Steps To Reproduce:
See Rancher issue.

Additional context / logs:
rancher/rancher#43872

@brandond
Copy link
Contributor

brandond commented Jan 8, 2024

That is correct - the S3 retention count applies to ALL snapshots stored on S3, not to the number of snapshots per node.

On earlier releases of RKE2, snapshots were pruned from S3 only by the node that uploaded them, but this lead to snapshots never being pruned when their source node was deleted or otherwise removed from the cluster. For clusters where nodes are upgraded or patched by replacement, this lead to S3 snapshot counts growing without bound due to the orphaned snapshots never being cleaned up.

One option would be to provide separate settings and commands for pruning snapshots from nodes that are no longer cluster members, but this would probably have to go through the RFE process.

@Oats87
Copy link
Contributor Author

Oats87 commented Jan 8, 2024

OK, so it sounds like default behavior was changed and this is affecting our users as it was not clearly communicated through to them. Perhaps we need to look at our process to see how to avoid this in the future.

cc: @snasovich @cwayne18

@brandond
Copy link
Contributor

brandond commented Jan 8, 2024

@thomashoell
Copy link

thomashoell commented Jan 17, 2024

That is correct - the S3 retention count applies to ALL snapshots stored on S3, not to the number of snapshots per node.

On earlier releases of RKE2, snapshots were pruned from S3 only by the node that uploaded them, but this lead to snapshots never being pruned when their source node was deleted or otherwise removed from the cluster. For clusters where nodes are upgraded or patched by replacement, this lead to S3 snapshot counts growing without bound due to the orphaned snapshots never being cleaned up.

One option would be to provide separate settings and commands for pruning snapshots from nodes that are no longer cluster members, but this would probably have to go through the RFE process.

But right now the behaviour is not consistent with what the UI says and also what you'd expect.
I would be totally fine with ONE consistent snapshot per snapshot cycle in S3, but right now RKE2 removes anything except ${num_master_nodes}*${snapshot_retention_count}. So if I have 3 master nodes and a retention count of 5, I'm left with 3 snapshots of the latest cycle and 2 snapshots of the 2nd latest.

We've worked around by increasing the retention count from 5 to 15. This will lead to 15 snapshots per master node but at least we have 5 usable snapshots in S3 instead of 2...

@brandond
Copy link
Contributor

brandond commented Jan 17, 2024

right now the behaviour is not consistent with what the UI says

That would be the rancher UI, not the RKE2 UI. It sounds like the rancher UI text needs to be updated.

RKE2 removes anything except ${num_master_nodes}*${snapshot_retention_count}

That is not correct. RKE2 stores snapshot_retention_count snapshots locally on each node, and snapshot_retention_count snapshots on s3. So your total COMBINED snapshot count, across both nodes and s3, would would be (num_master_nodes * snapshot_retention_count) + snapshot_retention_count

If you are looking at just s3, then yes you would probably want to align your retention count so that you keep the desired number of snapshots on s3 - if you have 3 nodes and want 2 cycles worth of snapshots, then you should set retention to 6, for example.

We will discuss how to enhance this to provide more granular control over snapshot retention, but it will probably require splitting the settings to allow separate counts for local and s3.

@horihel
Copy link

horihel commented Jan 18, 2024

We will discuss how to enhance this to provide more granular control over snapshot retention, but it will probably require splitting the settings to allow separate counts for local and s3.

splitting the settings is probably the most sensible approach.

Copy link
Contributor

github-actions bot commented Apr 5, 2024

This repository uses a bot to automatically label issues which have not had any activity (commit/comment/label) for 45 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the bot can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the bot will automatically close the issue in 14 days. Thank you for your contributions.

@brandond
Copy link
Contributor

brandond commented Apr 6, 2024

Unstale

@fayak
Copy link

fayak commented Apr 19, 2024

I'm not sure this is exactly related, but even I set the option Keep the last X to some value like 15, I still have 1 snapshot of each of my 5 nodes, not 15 per nodes nor 15 in total (3 per nodes)

Copy link
Contributor

github-actions bot commented Jun 3, 2024

This repository uses a bot to automatically label issues which have not had any activity (commit/comment/label) for 45 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the bot can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the bot will automatically close the issue in 14 days. Thank you for your contributions.

@fayak
Copy link

fayak commented Jun 3, 2024

This is still relevant. S3 backups are messed up and don't work as expected

Copy link
Contributor

This repository uses a bot to automatically label issues which have not had any activity (commit/comment/label) for 45 days. This helps us manage the community issues better. If the issue is still relevant, please add a comment to the issue so the bot can remove the label and we know it is still valid. If it is no longer relevant (or possibly fixed in the latest release), the bot will automatically close the issue in 14 days. Thank you for your contributions.

@horihel
Copy link

horihel commented Jul 22, 2024

still very much an issue - workaround is to enable versioning and/or soft-delete on your s3 storage. Not sure if rke2 registers recovered backups automatically though.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants