-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Snapshot restore does not fail properly when files are missing. #9433
Comments
Currently we don't differentiate between retryable and non-retryable errors and instead retrying to restore the index until the restore is successful or index is deleted. The two main reasons for this behavior is that 1) identifying if an error is retryable or not can be really tricky and 2) up until #5924 there was no simple way to fail a recovery. Now that we have this mechanism, we can make retry logic to be responsibility of repositories and fail shard instead of trying to recover from it. This way the index will remain closed and if a replica is available it will be still possible to fall back to the previous state of the index by recovering from the non-corrupted replicas. |
@abeyad has this already been improved or do we still need to do something more? |
gracefully when the repository is missing some data that is required for the restore operation. This test currently fails due to elastic#9433.
@clintongormley This has not been improved. I wrote this simple test that proves we do not gracefully terminate the restore process in the case of lost/missing snapshot data: https://github.com/elastic/elasticsearch/compare/master...abeyad:handle_restore_missing_files_gracefully?expand=1. Removing the AwaitsFix and running the test causes the test to run indefinitely. I will put it on my list of things to work on for snapshot/restore and brainstorm with @imotov and @s1monw the best approaches for solving it. |
Just as abeyad mentionned this issue is still present in our code too. |
I'm happy to say that the situation is now improved for both situations explained in this issue. The restore process does not hang anymore and now returns the number of failed shards (should be greater than 0 in the situations described here). And when a shard file is missing in the snapshot the shard will fail to be allocated. The Cluster Allocation Explain API can then be used to retrieve details on why the shard failed to be restored. |
In some cases where files are deleted/lost from a snapshot, Elastic search can go into infinite loop or never return answer to restoreSnapshot request.
Steps to reproduce:
Missing segment file:
Missing index data:
The text was updated successfully, but these errors were encountered: