Crashing during snapshot deletion might result in unreferenced data left in repository #13159

nkvoll · 2015-08-27T18:32:22Z

It's currently possible to end up with unreferenced data in a snapshot repository, given the following steps:

Create a index, foobar, with size X bytes
Snapshot cluster
Start deleting the snapshot, crash after deleting snapshot-{} and metadata-{} files
Delete index foobar
Snapshot cluster again

Normally, step 5 would cause the files no longer referenced by any snapshots do be deleted, but if the underlying index is deleted as well, they won't get cleaned up. In the example above, there would be X bytes of disk space used without any snapshot referencing them. Given sufficiently large values of X, this could be a significant amount of storage wasted. Even with small amounts of data, this might accrue over time to become significant.

Suggestion: create a deleting-{} file as a sibling of the snapshot-{} file that gets written before the files referenced by the snapshot gets deleted. When the deletion has been completed, this file should be the last one deleted. These files indicates that a deletion is in progress or have been attempted so it's possible to tell that the snapshot might be in a half-deleted state (so we can avoid using it). It should also enable later snapshot processes to continue the deletion process where the previous left off.

The text was updated successfully, but these errors were encountered:

tlrx · 2018-03-21T16:56:18Z

We talked about it today with @ywelsch and we still think that this is a real issue. We don't have any plan about it yet but this is something we need to think about for the future of the snapshot/restore functionality.

original-brownbear · 2019-03-20T08:03:33Z

There is another scenario here that can lead to the same stale data that I found:

Create a index, foobar, with size X bytes
Start snapshotting cluster
Abort snapshot
4 .Temporarily disconnect datanode from master (via network partition or so) while it is writing shard data to repository
Master node finishes deleting snapshot

If in this scenario the data node continues to write segment files (as seen in this test failure #39852) we get unreferenced data since step 5 has removed any metadata reference to the index and snapshotting it again will lead to a new index uuid.

original-brownbear · 2019-03-20T08:05:28Z

I opened #40228 with a suggestion on how to write the tombstones to get a better handle on this situation.

* Use ability to list child "folders" in the blob store to implement recursive delete on all stale index folders when cleaning up instead of using the diff between two `RepositoryData` instances to cover aborted deletes * Runs after ever delete operation * Relates #13159 (fixing most of this issues caused by unreferenced indices, leaving some meta files to be cleaned up only)

original-brownbear · 2019-06-21T17:51:26Z

Leaving this open, since there's still a little work left here in cleaning up unreferenced top level blobs. I'll raise a PR for that shortly.

* Use ability to list child "folders" in the blob store to implement recursive delete on all stale index folders when cleaning up instead of using the diff between two `RepositoryData` instances to cover aborted deletes * Runs after ever delete operation * Relates elastic#13159 (fixing most of this issues caused by unreferenced indices, leaving some meta files to be cleaned up only)

* Use ability to list child "folders" in the blob store to implement recursive delete on all stale index folders when cleaning up instead of using the diff between two `RepositoryData` instances to cover aborted deletes * Runs after ever delete operation * Relates #13159 (fixing most of this issues caused by unreferenced indices, leaving some meta files to be cleaned up only)

original-brownbear · 2019-08-18T19:56:27Z

Closing this as #42189 (and numerous follow-ups) resolved the bulk of the un-referenced file leaks on every delete operation and #43900 will bring a solution to cleanup the remainder (while acknowledging that some leaking can occur on errors that must be cleaned up via the cleanup endpoint that #43900 introdcues)

nkvoll added the :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs label Aug 27, 2015

clintongormley added >bug help wanted adoptme labels Aug 28, 2015

abeyad mentioned this issue May 5, 2016

Use UUIDs in working with snapshots #18156

Closed

8 tasks

original-brownbear self-assigned this Jan 28, 2019

original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Feb 4, 2019

elastic#13159 start

afae671

original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Feb 12, 2019

more elastic#13159

3f64e5f

original-brownbear mentioned this issue Mar 20, 2019

Add Bulk Delete API to Blob Store Interface #40250

Closed

original-brownbear mentioned this issue May 17, 2019

Recursively Delete Unreferenced Index Directories #42189

Merged

original-brownbear closed this as completed in #42189 Jun 21, 2019

original-brownbear reopened this Jun 21, 2019

original-brownbear mentioned this issue Jul 8, 2019

Recursively Delete Unreferenced Index Directories (#42189) #44051

Merged

original-brownbear closed this as completed Aug 18, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Crashing during snapshot deletion might result in unreferenced data left in repository #13159

Crashing during snapshot deletion might result in unreferenced data left in repository #13159

nkvoll commented Aug 27, 2015

tlrx commented Mar 21, 2018

original-brownbear commented Mar 20, 2019

original-brownbear commented Mar 20, 2019

original-brownbear commented Jun 21, 2019 •

edited

Loading

original-brownbear commented Aug 18, 2019

Crashing during snapshot deletion might result in unreferenced data left in repository #13159

Crashing during snapshot deletion might result in unreferenced data left in repository #13159

Comments

nkvoll commented Aug 27, 2015

tlrx commented Mar 21, 2018

original-brownbear commented Mar 20, 2019

original-brownbear commented Mar 20, 2019

original-brownbear commented Jun 21, 2019 • edited Loading

original-brownbear commented Aug 18, 2019

original-brownbear commented Jun 21, 2019 •

edited

Loading