Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fail Snapshot Delete if Metadata can't be Read #57786

Conversation

original-brownbear
Copy link
Member

@original-brownbear original-brownbear commented Jun 7, 2020

In case we use shard generation UUIDS, there is no good reason
whatsoever to be resilient/lenient here when deleting a snapshot.
All we do is put writes on top of a corrupted repository state.
In line with the behavior elsewhere in snapshotting, we should
fail-fast here to stop further writes to the repository in a situation
where something clearly went wrong.
Also, by failing fast here we prevent garbage from accumulating in the repository because it allows the user to rerun the delete if it failed because of some transient IO issue (we couldn't/can't do this in the old metadata format because we update the root level RepositoryData first).

Relates #57785 (the impact of this bug would've been much lower with this change,
it's not as easy to make a similar change to snapshot creation but I'm looking into it as well).

In case we use shard generation uuids, there is no good reason
whatsoever to be resilient/lenient here when deleting a snapshot.
All we do is put writes on top of a corrupted repository state.
In line with the behavior elsewhere in snapshotting, we should
fail-fast here to stop further writes to the repository.
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore)

@elasticmachine elasticmachine added the Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. label Jun 7, 2020
Copy link
Member

@tlrx tlrx left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm undecided if we should do this. Let's discuss it in the coming days.

@original-brownbear
Copy link
Member Author

@ywelsch it's been a while on this one. Given the recent discussions on hardening the repo I'm a bigger fan than ever of this one. I really don't see how anything good could come out of deleting from a repo when a shard folder isn't readable properly. You can still get rid of the snapshots with this change by deleting all snapshots referencing the shard, but you can't just quietly keep going in this broken state => I think it's a neat safe-guard against concurrent writing, especially (but not limited to) on S3.

@ywelsch ywelsch removed their request for review August 26, 2021 13:34
@arteam arteam added v8.1.0 and removed v8.0.0 labels Jan 12, 2022
@mark-vieira mark-vieira added v8.2.0 and removed v8.1.0 labels Feb 2, 2022
@elasticsearchmachine elasticsearchmachine changed the base branch from master to main July 22, 2022 23:13
@mark-vieira mark-vieira added v8.5.0 and removed v8.4.0 labels Jul 27, 2022
@elasticsearchmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (Team:Distributed)

@original-brownbear
Copy link
Member Author

closing this now, this is irrelevant since we decided on doing #89163 which makes this impossible

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >non-issue Team:Distributed (Obsolete) Meta label for distributed team (obsolete). Replaced by Distributed Indexing/Coordination. team-discuss v7.9.0 v8.5.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants