Snapshot restore does not fail properly when files are missing. #9433

S-Callier · 2015-01-27T04:46:15Z

In some cases where files are deleted/lost from a snapshot, Elastic search can go into infinite loop or never return answer to restoreSnapshot request.

Steps to reproduce:
Missing segment file:

Create an index with multiple shards and index a few documents
Create a snapshot of this index
In the snapshot folder delete one of the segment files for one shard
Restore this index from the snapshot with setWaitForCompletion(true).execute().actionGet();
The corruption is detected but the ListenableActionFuture never returns.

Missing index data:

Create an index with multiple shards and index a few documents
Create a snapshot of this index
In the snapshot folder delete each of the shard folders but let the metadata file called "snapshot-snapshot name"
Restore this index from the snapshot with setWaitForCompletion(false);
The restore task seems to go in an infinite loop.

imotov · 2015-01-28T00:21:38Z

Currently we don't differentiate between retryable and non-retryable errors and instead retrying to restore the index until the restore is successful or index is deleted. The two main reasons for this behavior is that 1) identifying if an error is retryable or not can be really tricky and 2) up until #5924 there was no simple way to fail a recovery. Now that we have this mechanism, we can make retry logic to be responsibility of repositories and fail shard instead of trying to recover from it. This way the index will remain closed and if a replica is available it will be still possible to fall back to the previous state of the index by recovering from the non-corrupted replicas.

clintongormley · 2016-11-26T18:55:07Z

@abeyad has this already been improved or do we still need to do something more?

gracefully when the repository is missing some data that is required for the restore operation. This test currently fails due to elastic#9433.

abeyad · 2016-11-27T19:00:57Z

@clintongormley This has not been improved. I wrote this simple test that proves we do not gracefully terminate the restore process in the case of lost/missing snapshot data: https://github.com/elastic/elasticsearch/compare/master...abeyad:handle_restore_missing_files_gracefully?expand=1. Removing the AwaitsFix and running the test causes the test to run indefinitely.

I will put it on my list of things to work on for snapshot/restore and brainstorm with @imotov and @s1monw the best approaches for solving it.

S-Callier · 2016-11-28T02:21:43Z

Just as abeyad mentionned this issue is still present in our code too.
This not really a critical issue for us, but thank you for having a look at it!

tlrx · 2018-01-09T13:42:56Z

I'm happy to say that the situation is now improved for both situations explained in this issue.

The restore process does not hang anymore and now returns the number of failed shards (should be greater than 0 in the situations described here). And when a shard file is missing in the snapshot the shard will fail to be allocated. The Cluster Allocation Explain API can then be used to retrieve details on why the shard failed to be restored.

See #27493 and #27476 for more details.

clintongormley added the :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs label Jan 27, 2015

clintongormley assigned imotov Jan 27, 2015

minde-eagleeye mentioned this issue May 6, 2015

can't recover index from snapshot to multi-node cluster #11004

Closed

clintongormley added >bug help wanted adoptme labels Dec 4, 2015

clintongormley unassigned imotov Dec 4, 2015

abeyad self-assigned this Nov 27, 2016

abeyad removed their assignment Jul 20, 2017

tlrx closed this as completed Jan 9, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Snapshot restore does not fail properly when files are missing. #9433

Snapshot restore does not fail properly when files are missing. #9433

S-Callier commented Jan 27, 2015

imotov commented Jan 28, 2015

clintongormley commented Nov 26, 2016

abeyad commented Nov 27, 2016

S-Callier commented Nov 28, 2016

tlrx commented Jan 9, 2018

Snapshot restore does not fail properly when files are missing. #9433

Snapshot restore does not fail properly when files are missing. #9433

Comments

S-Callier commented Jan 27, 2015

imotov commented Jan 28, 2015

clintongormley commented Nov 26, 2016

abeyad commented Nov 27, 2016

S-Callier commented Nov 28, 2016

tlrx commented Jan 9, 2018