Cleanup Concurrent RepositoryData Loading (#48329) #48837

original-brownbear · 2019-11-02T17:45:39Z

The loading of RepositoryData is not an atomic operation.
It uses a list + get combination of calls.
This lead to accidentally returning an empty repository data
for generations >=0 which can never not exist unless the repository
is corrupted.
In the test #48122 (and other SLM tests) there was a low chance of
running into this concurrent modification scenario and the repository
actually moving two index generations between listing out the
index-N and loading the latest version of it. Since we only keep
two index-N around at a time this lead to unexpectedly absent
snapshots in status APIs.
Fixing the behavior to be more resilient is non-trivial but in the works.
For now I think we should simply throw in this scenario. This will also
help prevent corruption in the unlikely event but possible of running into this
issue in a snapshot create or delete operation on master failover on a
repository like S3 which doesn't have the "no overwrites" protection on
writing a new index-N.

Fixes #48122

backport of #48329

The loading of `RepositoryData` is not an atomic operation. It uses a list + get combination of calls. This lead to accidentally returning an empty repository data for generations >=0 which can never not exist unless the repository is corrupted. In the test elastic#48122 (and other SLM tests) there was a low chance of running into this concurrent modification scenario and the repository actually moving two index generations between listing out the index-N and loading the latest version of it. Since we only keep two index-N around at a time this lead to unexpectedly absent snapshots in status APIs. Fixing the behavior to be more resilient is non-trivial but in the works. For now I think we should simply throw in this scenario. This will also help prevent corruption in the unlikely event but possible of running into this issue in a snapshot create or delete operation on master failover on a repository like S3 which doesn't have the "no overwrites" protection on writing a new index-N. Fixes elastic#48122

elasticmachine · 2019-11-02T17:45:40Z

Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore)

original-brownbear · 2019-11-25T08:09:25Z

Closing this, as in hindsight it's too risky for 6.8. It added some new failures in 7.x that we then had to fix in additional changes that can't easily be back ported to 6.8.

original-brownbear added :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs backport labels Nov 2, 2019

original-brownbear closed this Nov 25, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cleanup Concurrent RepositoryData Loading (#48329) #48837

Cleanup Concurrent RepositoryData Loading (#48329) #48837

original-brownbear commented Nov 2, 2019

elasticmachine commented Nov 2, 2019

original-brownbear commented Nov 25, 2019

Cleanup Concurrent RepositoryData Loading (#48329) #48837

Cleanup Concurrent RepositoryData Loading (#48329) #48837

Conversation

original-brownbear commented Nov 2, 2019

elasticmachine commented Nov 2, 2019

original-brownbear commented Nov 25, 2019