Cleanup Concurrent RepositoryData Loading #48329

original-brownbear · 2019-10-22T06:35:08Z

The loading of RepositoryData is not an atomic operation.
It uses a list + get combination of calls.
This lead to accidentally returning an empty repository data
for generations >=0 which can never not exist unless the repository
is corrupted.
In the test #48122 (and other SLM tests) there was a low chance of
running into this concurrent modification scenario and the repository
actually moving two index generations between listing out the
index-N and loading the latest version of it. Since we only keep
two index-N around at a time this lead to unexpectedly absent
snapshots in status APIs.
Fixing the behavior to be more resilient is non-trivial but in the works.
For now I think we should simply throw in this scenario. This will also
help prevent corruption in the unlikely event but possible of running into this
issue in a snapshot create or delete operation on master failover on a
repository like S3 which doesn't have the "no overwrites" protection on
writing a new index-N.

I would suggest back porting this all the way since it can theoretically have
repository corrupting effects on S3.

Fixes #48122

The loading of `RepositoryData` is not an atomic operation. It uses a list + get combination of calls. This lead to accidentally returning an empty repository data for generations >=0 which can never not exist unless the repository is corrupted. In the test elastic#48122 (and other SLM tests) there was a low chance of running into this concurrent modification scenario and the repository actually moving two index generations between listing out the index-N and loading the latest version of it. Since we only keep two index-N around at a time this lead to unexpectedly absent snapshots in status APIs. Fixing the behavior to be more resilient is non-trivial but in the works. For now I think we should simply throw in this scenario. This will also help prevent corruption in the unlikely event but possible of running into this issue in a snapshot create or delete operation on master failover on a repository like S3 which doesn't have the "no overwrites" protection on writing a new index-N. Fixes elastic#48122

elasticmachine · 2019-10-22T06:35:10Z

Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore)

original-brownbear · 2019-10-22T06:36:45Z

I didn't yet get around to a full test run here, so if this doesn't yet go green please hold off reviewing till I can fix the tests I may have missed :)

original-brownbear · 2019-10-22T06:37:58Z

server/src/main/java/org/elasticsearch/repositories/blobstore/BlobStoreRepository.java

@@ -812,9 +812,6 @@ public void endVerification(String seed) {
    public RepositoryData getRepositoryData() {
        try {
            return getRepositoryData(latestIndexBlobId());
-        } catch (NoSuchFileException ex) {


There is no good reason to catch here or below. We break out on generation -1 and return empty data so whenever we fail to find data for gen. >= 0 it's a problem and we mustn't ignore it.

original-brownbear · 2019-10-22T07:12:14Z

Jup, this is not entirely unsurprisingly failing some REST tests:

Caused by: org.elasticsearch.ElasticsearchException: Elasticsearch exception [type=no_such_file_exception, reason=/dev/shm/elastic+elasticsearch+pull-request-1/client/rest-high-level/build/testclusters/integTest-0/repo/index-16]
	at org.elasticsearch.ElasticsearchException.innerFromXContent(ElasticsearchException.java:496)
	at org.elasticsearch.ElasticsearchException.fromXContent(ElasticsearchException.java:407)
	at org.elasticsearch.ElasticsearchException.innerFromXContent(ElasticsearchException.java:437)
	at org.elasticsearch.ElasticsearchException.failureFromXContent(ElasticsearchException.java:603)
	at org.elasticsearch.rest.BytesRestResponse.errorFromXContent(BytesRestResponse.java:169)
	... 45 more

-> WIP

EDIT: #46250 fixed the tests here sort of by accident by changing the order of shard and root level meta updates :) -> should be good to review now

tlrx

LGTM, thanks Armin

tlrx · 2019-10-23T08:02:55Z

server/src/main/java/org/elasticsearch/repositories/blobstore/BlobStoreRepository.java

@@ -925,9 +922,6 @@ private RepositoryData getRepositoryData(long indexGen) {
                repositoryData = RepositoryData.snapshotsFromXContent(parser, indexGen);
            }
            return repositoryData;
-        } catch (NoSuchFileException ex) {


nit: while we are at it, there are unneeded Long.toString() and local variable repositoryData usages

I can't find it right now, but others disagreed with removing the Long.toString in another PR :( But yea let's clean up the redundant locals :)

original-brownbear · 2019-10-23T10:32:45Z

Thanks Tanguy!

Just like elastic#48329 (and using the changes) in that PR we can run into a concurrent repo modification that we will throw on and must retry until consistent handling of this situation is implemented. Closes elastic#47384

Just like #48329 (and using the changes) in that PR we can run into a concurrent repo modification that we will throw on and must retry until consistent handling of this situation is implemented. Closes #47834

The loading of `RepositoryData` is not an atomic operation. It uses a list + get combination of calls. This lead to accidentally returning an empty repository data for generations >=0 which can never not exist unless the repository is corrupted. In the test elastic#48122 (and other SLM tests) there was a low chance of running into this concurrent modification scenario and the repository actually moving two index generations between listing out the index-N and loading the latest version of it. Since we only keep two index-N around at a time this lead to unexpectedly absent snapshots in status APIs. Fixing the behavior to be more resilient is non-trivial but in the works. For now I think we should simply throw in this scenario. This will also help prevent corruption in the unlikely event but possible of running into this issue in a snapshot create or delete operation on master failover on a repository like S3 which doesn't have the "no overwrites" protection on writing a new index-N. Fixes elastic#48122

The loading of `RepositoryData` is not an atomic operation. It uses a list + get combination of calls. This lead to accidentally returning an empty repository data for generations >=0 which can never not exist unless the repository is corrupted. In the test #48122 (and other SLM tests) there was a low chance of running into this concurrent modification scenario and the repository actually moving two index generations between listing out the index-N and loading the latest version of it. Since we only keep two index-N around at a time this lead to unexpectedly absent snapshots in status APIs. Fixing the behavior to be more resilient is non-trivial but in the works. For now I think we should simply throw in this scenario. This will also help prevent corruption in the unlikely event but possible of running into this issue in a snapshot create or delete operation on master failover on a repository like S3 which doesn't have the "no overwrites" protection on writing a new index-N. Fixes #48122

original-brownbear added >bug :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs v8.0.0 v7.5.0 v7.6.0 v7.4.2 v6.8.5 labels Oct 22, 2019

original-brownbear mentioned this pull request Oct 22, 2019

[CI] SLMSnapshotBlockingIntegTests.testBasicPartialRetention failed #48122

Closed

original-brownbear commented Oct 22, 2019

View reviewed changes

original-brownbear added the WIP label Oct 22, 2019

original-brownbear added 2 commits October 22, 2019 19:30

Merge remote-tracking branch 'elastic/master' into 48122

4171652

Merge remote-tracking branch 'elastic/master' into 48122

3c5d0d9

original-brownbear removed the WIP label Oct 23, 2019

original-brownbear requested review from ywelsch and tlrx October 23, 2019 07:29

tlrx approved these changes Oct 23, 2019

View reviewed changes

original-brownbear added 2 commits October 23, 2019 10:53

shorter

6692fd9

Merge remote-tracking branch 'elastic/master' into 48122

2524515

original-brownbear merged commit a0f80bd into elastic:master Oct 23, 2019

original-brownbear deleted the 48122 branch October 23, 2019 10:33

original-brownbear added the backport pending label Oct 23, 2019

original-brownbear mentioned this pull request Oct 23, 2019

[CI] SLMSnapshotBlockingIntegTests.testRetentionWhileSnapshotInProgress failing #47834

Closed

original-brownbear mentioned this pull request Oct 23, 2019

Handle Concurrent Repo Modification to Fix Test #48433

Merged

original-brownbear mentioned this pull request Nov 2, 2019

Cleanup Concurrent RepositoryData Loading (#48329) #48834

Merged

original-brownbear mentioned this pull request Nov 2, 2019

Cleanup Concurrent RepositoryData Loading (#48329) #48835

Merged

original-brownbear mentioned this pull request Nov 2, 2019

Cleanup Concurrent RepositoryData Loading (#48329) #48836

Closed

original-brownbear removed the backport pending label Nov 2, 2019

original-brownbear mentioned this pull request Nov 2, 2019

Cleanup Concurrent RepositoryData Loading (#48329) #48837

Closed

original-brownbear removed v6.8.5 v7.4.2 labels Nov 25, 2019

This was referenced Feb 3, 2020

[meta] 7.6 release elastic/elasticsearch-net#4340

Closed

[meta] 7.6 release elastic/elasticsearch-net#4341

Closed

original-brownbear restored the 48122 branch January 6, 2021 14:07

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Cleanup Concurrent RepositoryData Loading #48329

Cleanup Concurrent RepositoryData Loading #48329

original-brownbear commented Oct 22, 2019

elasticmachine commented Oct 22, 2019

original-brownbear commented Oct 22, 2019

original-brownbear Oct 22, 2019

original-brownbear commented Oct 22, 2019 •

edited

Loading

tlrx left a comment

tlrx Oct 23, 2019

original-brownbear Oct 23, 2019

original-brownbear commented Oct 23, 2019

Cleanup Concurrent RepositoryData Loading #48329

Cleanup Concurrent RepositoryData Loading #48329

Conversation

original-brownbear commented Oct 22, 2019

elasticmachine commented Oct 22, 2019

original-brownbear commented Oct 22, 2019

original-brownbear Oct 22, 2019

Choose a reason for hiding this comment

original-brownbear commented Oct 22, 2019 • edited Loading

tlrx left a comment

Choose a reason for hiding this comment

tlrx Oct 23, 2019

Choose a reason for hiding this comment

original-brownbear Oct 23, 2019

Choose a reason for hiding this comment

original-brownbear commented Oct 23, 2019

original-brownbear commented Oct 22, 2019 •

edited

Loading