Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Snapshot restore does not fail properly when files are missing. #9433

Closed
S-Callier opened this issue Jan 27, 2015 · 5 comments
Closed

Snapshot restore does not fail properly when files are missing. #9433

S-Callier opened this issue Jan 27, 2015 · 5 comments
Labels
>bug :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs help wanted adoptme

Comments

@S-Callier
Copy link

In some cases where files are deleted/lost from a snapshot, Elastic search can go into infinite loop or never return answer to restoreSnapshot request.

Steps to reproduce:
Missing segment file:

  • Create an index with multiple shards and index a few documents
  • Create a snapshot of this index
  • In the snapshot folder delete one of the segment files for one shard
  • Restore this index from the snapshot with setWaitForCompletion(true).execute().actionGet();
  • The corruption is detected but the ListenableActionFuture never returns.

Missing index data:

  • Create an index with multiple shards and index a few documents
  • Create a snapshot of this index
  • In the snapshot folder delete each of the shard folders but let the metadata file called "snapshot-snapshot name"
  • Restore this index from the snapshot with setWaitForCompletion(false);
  • The restore task seems to go in an infinite loop.
@clintongormley clintongormley added the :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs label Jan 27, 2015
@imotov
Copy link
Contributor

imotov commented Jan 28, 2015

Currently we don't differentiate between retryable and non-retryable errors and instead retrying to restore the index until the restore is successful or index is deleted. The two main reasons for this behavior is that 1) identifying if an error is retryable or not can be really tricky and 2) up until #5924 there was no simple way to fail a recovery. Now that we have this mechanism, we can make retry logic to be responsibility of repositories and fail shard instead of trying to recover from it. This way the index will remain closed and if a replica is available it will be still possible to fall back to the previous state of the index by recovering from the non-corrupted replicas.

@clintongormley
Copy link
Contributor

@abeyad has this already been improved or do we still need to do something more?

abeyad pushed a commit to abeyad/elasticsearch that referenced this issue Nov 27, 2016
gracefully when the repository is missing some data that is
required for the restore operation.  This test currently fails
due to elastic#9433.
@abeyad
Copy link

abeyad commented Nov 27, 2016

@clintongormley This has not been improved. I wrote this simple test that proves we do not gracefully terminate the restore process in the case of lost/missing snapshot data: https://github.com/elastic/elasticsearch/compare/master...abeyad:handle_restore_missing_files_gracefully?expand=1. Removing the AwaitsFix and running the test causes the test to run indefinitely.

I will put it on my list of things to work on for snapshot/restore and brainstorm with @imotov and @s1monw the best approaches for solving it.

@abeyad abeyad self-assigned this Nov 27, 2016
@S-Callier
Copy link
Author

Just as abeyad mentionned this issue is still present in our code too.
This not really a critical issue for us, but thank you for having a look at it!

@abeyad abeyad removed their assignment Jul 20, 2017
@tlrx
Copy link
Member

tlrx commented Jan 9, 2018

I'm happy to say that the situation is now improved for both situations explained in this issue.

The restore process does not hang anymore and now returns the number of failed shards (should be greater than 0 in the situations described here). And when a shard file is missing in the snapshot the shard will fail to be allocated. The Cluster Allocation Explain API can then be used to retrieve details on why the shard failed to be restored.

See #27493 and #27476 for more details.

@tlrx tlrx closed this as completed Jan 9, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
>bug :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs help wanted adoptme
Projects
None yet
Development

No branches or pull requests

5 participants