-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fail Snapshot Delete if Metadata can't be Read #57786
Fail Snapshot Delete if Metadata can't be Read #57786
Conversation
In case we use shard generation uuids, there is no good reason whatsoever to be resilient/lenient here when deleting a snapshot. All we do is put writes on top of a corrupted repository state. In line with the behavior elsewhere in snapshotting, we should fail-fast here to stop further writes to the repository.
Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm undecided if we should do this. Let's discuss it in the coming days.
@ywelsch it's been a while on this one. Given the recent discussions on hardening the repo I'm a bigger fan than ever of this one. I really don't see how anything good could come out of deleting from a repo when a shard folder isn't readable properly. You can still get rid of the snapshots with this change by deleting all snapshots referencing the shard, but you can't just quietly keep going in this broken state => I think it's a neat safe-guard against concurrent writing, especially (but not limited to) on S3. |
Pinging @elastic/es-distributed (Team:Distributed) |
closing this now, this is irrelevant since we decided on doing #89163 which makes this impossible |
In case we use shard generation UUIDS, there is no good reason
whatsoever to be resilient/lenient here when deleting a snapshot.
All we do is put writes on top of a corrupted repository state.
In line with the behavior elsewhere in snapshotting, we should
fail-fast here to stop further writes to the repository in a situation
where something clearly went wrong.
Also, by failing fast here we prevent garbage from accumulating in the repository because it allows the user to rerun the delete if it failed because of some transient IO issue (we couldn't/can't do this in the old metadata format because we update the root level
RepositoryData
first).Relates #57785 (the impact of this bug would've been much lower with this change,
it's not as easy to make a similar change to snapshot creation but I'm looking into it as well).