Use ClusterState as Consistency Source for Snapshot Repositories #49060

original-brownbear · 2019-11-13T21:50:38Z

Follow up to #49729

This change removes falling back to listing out the repository contents to find the latest index-N in write-mounted blob store repositories.
This saves 2-3 list operations on each snapshot create and delete operation. Also it makes all the snapshot status APIs cheaper (and faster) by saving one list operation there as well in many cases.
This removes the resiliency to concurrent modifications of the repository as a result and puts a repository in a corrupted state in case loading RepositoryData failed from the assumed generation.

Note to reviewers:

I suggest starting from the new test ``CorruptedBlobStoreRepositoryIT` illustrating the mechanics of a repository being marked as "corrupted" to make sure we're in agreement on the behavior change here. I know we discussed in the snapshot meeting a while back, but let's make sure we're all in agreement on the tradeoffs made here:

(good) Slightly faster and cheaper (cloud repos) repository operations (both dimensions are somewhat irrelevant I'd say for most users)
(good-neutral) Concurrent modifications to the repository by multiple clusters are detected much more reliably and force the user to re-add the repo by hand, making it so that the dangerous situation of having two clusters write to the repo at the same time is impossible. This seems like mostly a win, but he off situation where a user may use the repo for writes across two clusters and manually ensures that operations don't run at the same time would be broken by this change

elasticmachine · 2019-11-13T21:50:40Z

Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore)

original-brownbear · 2019-11-19T09:04:34Z

Jenkins run elasticsearch-ci/1

This API call in most implementations is fairly IO heavy and slow so it is more natural to be async in the first place. Concretely though, this change is a prerequisite of elastic#49060 since determining the repository generation from the cluster state introduces situations where this call would have to wait for other operations to finish. Doing so in a blocking manner would break `SnapshotResiliencyTests` and waste a thread. Also, this sets up the possibility to in the future make use of async IO where provided by the underlying Repository implementation. In a follow-up `SnapshotsService#getRepositoryData` will be made async as well (did not do it here, since it's another huge change to do so). Note: This change for now does not alter the threading behavior in any way with the exception of blocking a GENERIC thread in `SnapshotsService#getRepositoryData` (this should be fine since this API is only used by status APIs) and is purely mechanical otherwise.

This API call in most implementations is fairly IO heavy and slow so it is more natural to be async in the first place. Concretely though, this change is a prerequisite of #49060 since determining the repository generation from the cluster state introduces situations where this call would have to wait for other operations to finish. Doing so in a blocking manner would break `SnapshotResiliencyTests` and waste a thread. Also, this sets up the possibility to in the future make use of async IO where provided by the underlying Repository implementation. In a follow-up `SnapshotsService#getRepositoryData` will be made async as well (did not do it here, since it's another huge change to do so). Note: This change for now does not alter the threading behaviour in any way (since `Repository#getRepositoryData` isn't forking) and is purely mechanical.

This API call in most implementations is fairly IO heavy and slow so it is more natural to be async in the first place. Concretely though, this change is a prerequisite of elastic#49060 since determining the repository generation from the cluster state introduces situations where this call would have to wait for other operations to finish. Doing so in a blocking manner would break `SnapshotResiliencyTests` and waste a thread. Also, this sets up the possibility to in the future make use of async IO where provided by the underlying Repository implementation. In a follow-up `SnapshotsService#getRepositoryData` will be made async as well (did not do it here, since it's another huge change to do so). Note: This change for now does not alter the threading behaviour in any way (since `Repository#getRepositoryData` isn't forking) and is purely mechanical.

This API call in most implementations is fairly IO heavy and slow so it is more natural to be async in the first place. Concretely though, this change is a prerequisite of #49060 since determining the repository generation from the cluster state introduces situations where this call would have to wait for other operations to finish. Doing so in a blocking manner would break `SnapshotResiliencyTests` and waste a thread. Also, this sets up the possibility to in the future make use of async IO where provided by the underlying Repository implementation. In a follow-up `SnapshotsService#getRepositoryData` will be made async as well (did not do it here, since it's another huge change to do so). Note: This change for now does not alter the threading behaviour in any way (since `Repository#getRepositoryData` isn't forking) and is purely mechanical.

original-brownbear · 2019-11-24T17:51:02Z

Need to get #49514 in first to stabilize SnapshotResiliencyTests enough so that they don't become unstable from this change

A few enhancements to `SnapshotResiliencyTests`: 1. Test running requests from random nodes in more spots to enhance coverage (this is particularly motivated by #49060 where the additional number of cluster state updates makes it more interesting to fully cover all kinds of network failures) 2. Fix issue with restarting only master node in one test (doing so breaks the test at an incredibly low frequency, that becomes not so low in #49060 with the additional cluster state updates between request and response) 3. Improved cluster formation checks (now properly checks the term as well when forming cluster) + makes sure all nodes are connected to all other nodes (previously the data nodes would at times not be connected to other data nodes, which was shaken out now by adding the `client()` method 4. Make sure the cluster left behind by the test makes sense by running the repo cleanup action on it (this also increases coverage of the repository cleanup action obviously and adds the basis of making it part of more resiliency tests)

A few enhancements to `SnapshotResiliencyTests`: 1. Test running requests from random nodes in more spots to enhance coverage (this is particularly motivated by elastic#49060 where the additional number of cluster state updates makes it more interesting to fully cover all kinds of network failures) 2. Fix issue with restarting only master node in one test (doing so breaks the test at an incredibly low frequency, that becomes not so low in elastic#49060 with the additional cluster state updates between request and response) 3. Improved cluster formation checks (now properly checks the term as well when forming cluster) + makes sure all nodes are connected to all other nodes (previously the data nodes would at times not be connected to other data nodes, which was shaken out now by adding the `client()` method 4. Make sure the cluster left behind by the test makes sense by running the repo cleanup action on it (this also increases coverage of the repository cleanup action obviously and adds the basis of making it part of more resiliency tests)

A few enhancements to `SnapshotResiliencyTests`: 1. Test running requests from random nodes in more spots to enhance coverage (this is particularly motivated by #49060 where the additional number of cluster state updates makes it more interesting to fully cover all kinds of network failures) 2. Fix issue with restarting only master node in one test (doing so breaks the test at an incredibly low frequency, that becomes not so low in #49060 with the additional cluster state updates between request and response) 3. Improved cluster formation checks (now properly checks the term as well when forming cluster) + makes sure all nodes are connected to all other nodes (previously the data nodes would at times not be connected to other data nodes, which was shaken out now by adding the `client()` method 4. Make sure the cluster left behind by the test makes sense by running the repo cleanup action on it (this also increases coverage of the repository cleanup action obviously and adds the basis of making it part of more resiliency tests)

Preliminary to shorten the diff of elastic#49060. In elastic#49060 we execute cluster state updates during the writing of a new index gen and thus it must be an async API.

Preliminary to shorten the diff of #49060. In #49060 we execute cluster state updates during the writing of a new index gen and thus it must be an async API.

Preliminary to shorten the diff of elastic#49060. In elastic#49060 we execute cluster state updates during the writing of a new index gen and thus it must be an async API.

Preliminary to shorten the diff of #49060. In #49060 we execute cluster state updates during the writing of a new index gen and thus it must be an async API.

The repo factories aren't supposed to start the repository they create, that happens in `RepositoriesService`. This has no effect in `master` currently but I noticed it in elastic#49060 which will introduce logic to the repo start.

original-brownbear · 2019-11-27T11:26:43Z

Jenkins run elasticsearch-ci/1

This is a preliminary to elastic#49060. It does not introduce any substantial behavior change to how the blob store repository operates. What it does is to add all the infrastructure changes around passing the cluster service to the blob store, associated test changes and a best effort approach to tracking the latest repository generation on all nodes from cluster state updates. This brings a slight improvement to the consistency by which non-master nodes (or master directly after a failover) will be able to determine the latest repository generation. It does not however do any tricky checks for the situation after a repository operation (create, delete or cleanup) that could theoretically be used to get even greater accuracy to keep this change simple. This change does not in any way alter the behavior of the blobstore repository other than adding a better "guess" for the value of the latest repo generation and is mainly intended to isolate the actual logical change to how the repository operates in elastic#49060

ywelsch · 2019-12-13T16:45:23Z

it will be retried by the next master that will just finalize the still ongoing operation

probably only unless you have a full cluster restart. In that case, SnapshotInProgress and friends are lost whereas RepositoriesMetaData is not.

I want to mainly understand the worst case here, and how that could possibly affect a user. In theory we could fix the concurrent master issue by appending the master term as an additional tie breaker to the file name.

original-brownbear · 2019-12-13T16:52:42Z

I want to mainly understand the worst case here, and how that could possibly affect a user. In theory we could fix the concurrent master issue by appending the master term as an additional tie breaker to the file name.

I think the worst case is as you mention the full cluster restart right before a snapshot finishes (i.e. before RepositoryMetadata generations get aligned). In that case, if you were to restore last snapshot taken and concurrently ran either a repository cleanup or snapshot delete the restore may fail mid-way because the cleanup or delete will interpret the last snapshot's files as unreferenced and remove them.
I feel like this is a really fringe case though. Right now you have a bunch of failure scenarios in various ES versions when restoring while deleting from a repository (the step of reading RepositoryData which happens on the data nodes restoring is error prone in any version if a new one is written concurrently ... ) so I'm not sure it's worth it worrying about this? It's super unlikely and even without this change deleting anything while restoring has the potential to kill the restore, we're just adding one new very unlikely scenario.

original-brownbear · 2019-12-13T16:59:20Z

I want to mainly understand the worst case here, and how that could possibly affect a user. In theory we could fix the concurrent master issue by appending the master term as an additional tie breaker to the file name.

I think the worst case is as you mention the full cluster restart right before a snapshot finishes (i.e. before RepositoryMetadata generations get aligned). In that case, if you were to restore last snapshot taken and concurrently ran either a repository cleanup or snapshot delete the restore may fail mid-way because the cleanup or delete will interpret the last snapshot's files as unreferenced and remove them.
I feel like this is a really fringe case though. Right now you have a bunch of failure scenarios in various ES versions when restoring while deleting from a repository (the step of reading RepositoryData which happens on the data nodes restoring is error prone in any version if a new one is written concurrently ... ) so I'm not sure it's worth it worrying about this? It's super unlikely and even without this change deleting anything while restoring has the potential to kill the restore, we're just adding one new very unlikely scenario.

appending the master term as an additional tie breaker to the file name

So basically we'd have index-{N}-{term} and if we run into say pending generation 5, we'd list index-5-* and move to that as the safe generation if we were to find a blob for index-5-{someTerm}? I think we could do that, but since it really only helps the concurrent repository access case I'm not sure it's worth the changes (we'd have to change what we store in index.latest and refactor the generation to be two long or a String wouldn't we?).

Scratch this I was being stupid, incrementing the pending generation ensure we never collide on a generation anyway. Even if we add the term it doesn't change anything regarding the failure scenario of reading a not-yet-created index-N IMO does it? Either it makes us dependt on listing which may be inconsistent anyway because we only store the generation but not the term in the CS or it makes us potentially read a non existing blob and run into the mentioned consistency trouble.

original-brownbear · 2019-12-16T21:42:05Z

@ywelsch as discussed earlier today, I added logic to identify a "dirty" startup of the repository now in e3f7ff2 (code and test) and 564202b (docs) that shoul make us never lose the latest state (assuming the list operation works well enough on full cluster restart :) but it should be fine and even if it isn't once in a blue moon, the repo can not become corrupted as a result still).

ywelsch

LGTM. Can you, in a follow-up, also add more tests for read-only repositories, given that they have a slightly different way to load stuff?

original-brownbear · 2019-12-17T08:45:23Z

Thanks Yannick!

Can you, in a follow-up, also add more tests for read-only repositories, given that they have a slightly different way to load stuff?

Will do :)

…stic#49060) Follow up to elastic#49729 This change removes falling back to listing out the repository contents to find the latest `index-N` in write-mounted blob store repositories. This saves 2-3 list operations on each snapshot create and delete operation. Also it makes all the snapshot status APIs cheaper (and faster) by saving one list operation there as well in many cases. This removes the resiliency to concurrent modifications of the repository as a result and puts a repository in a `corrupted` state in case loading `RepositoryData` failed from the assumed generation.

) (#50267) Follow up to #49729 This change removes falling back to listing out the repository contents to find the latest `index-N` in write-mounted blob store repositories. This saves 2-3 list operations on each snapshot create and delete operation. Also it makes all the snapshot status APIs cheaper (and faster) by saving one list operation there as well in many cases. This removes the resiliency to concurrent modifications of the repository as a result and puts a repository in a `corrupted` state in case loading `RepositoryData` failed from the assumed generation.

This is a preliminary to elastic#49060. It does not introduce any substantial behavior change to how the blob store repository operates. What it does is to add all the infrastructure changes around passing the cluster service to the blob store, associated test changes and a best effort approach to tracking the latest repository generation on all nodes from cluster state updates. This brings a slight improvement to the consistency by which non-master nodes (or master directly after a failover) will be able to determine the latest repository generation. It does not however do any tricky checks for the situation after a repository operation (create, delete or cleanup) that could theoretically be used to get even greater accuracy to keep this change simple. This change does not in any way alter the behavior of the blobstore repository other than adding a better "guess" for the value of the latest repo generation and is mainly intended to isolate the actual logical change to how the repository operates in elastic#49060

Step on the road to elastic#49060. This commit adds the logic to keep track of a repository's generation across repository operations. See changes to package level Javadoc for the concrete changes in the distributed state machine. It updates the write side of new repository generations to be fully consistent via the cluster state. With this change, no `index-N` will be overwritten for the same repository ever. So eventual consistency issues around conflicting updates to the same `index-N` are not a possibility any longer. With this change the read side will still use listing of repository contents instead of relying solely on the cluster state contents. The logic for that will be introduced in elastic#49060. This retains the ability to externally delete the contents of a repository and continue using it afterwards for the time being. In elastic#49060 the use of listing to determine the repository generation will be removed in all cases (except for full-cluster restart) as the last step in this effort.

…stic#49060) Follow up to elastic#49729 This change removes falling back to listing out the repository contents to find the latest `index-N` in write-mounted blob store repositories. This saves 2-3 list operations on each snapshot create and delete operation. Also it makes all the snapshot status APIs cheaper (and faster) by saving one list operation there as well in many cases. This removes the resiliency to concurrent modifications of the repository as a result and puts a repository in a `corrupted` state in case loading `RepositoryData` failed from the assumed generation.

SwiftRepository extends BlobStoreRepository which, as of this commit[1], is aware of the ClusterState. This was done to facilitate this change[2]. [1]: elastic/elasticsearch@459d8ed [2]: elastic/elasticsearch#49060

original-brownbear added WIP :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs labels Nov 13, 2019

original-brownbear changed the title ~~Use ClusterState as Consistency Source~~ Use ClusterState as Consistency Source for Snapshot Repositories Nov 13, 2019

original-brownbear mentioned this pull request Nov 18, 2019

Make Snapshot Delete Concurrency Exception Consistent #49266

Merged

original-brownbear mentioned this pull request Nov 19, 2019

Make Repository.getRepositoryData an Async API #49299

Merged

original-brownbear mentioned this pull request Nov 19, 2019

Make Repository.getRepositoryData an Async API (#49299) #49312

Merged

original-brownbear mentioned this pull request Nov 24, 2019

Enhance SnapshotResiliencyTests #49514

Merged

original-brownbear mentioned this pull request Nov 25, 2019

Enhance SnapshotResiliencyTests (#49514) #49541

Merged

original-brownbear mentioned this pull request Nov 26, 2019

Make BlobStoreRepository#writeIndexGen API Async #49584

Merged

original-brownbear mentioned this pull request Nov 26, 2019

Make BlobStoreRepository#writeIndexGen API Async (#49584) #49610

Merged

original-brownbear mentioned this pull request Nov 27, 2019

Fix Incorrect Repo Start in SnapshotResiliencyTests #49626

Closed

original-brownbear mentioned this pull request Nov 27, 2019

Make BlobStoreRepository Aware of ClusterState #49639

Merged

original-brownbear added 2 commits December 13, 2019 17:21

Merge remote-tracking branch 'elastic/master' into repo-uses-cs

7313d38

rename an remove pointless prop

cb84ee1

original-brownbear requested a review from ywelsch December 13, 2019 16:26

original-brownbear added 4 commits December 16, 2019 16:13

Merge remote-tracking branch 'elastic/master' into repo-uses-cs

2158ea7

Merge remote-tracking branch 'elastic/master' into repo-uses-cs

3dfcf10

correct handling unclean full cluster restart

e3f7ff2

docs

564202b

original-brownbear added 2 commits December 17, 2019 08:12

stable

028e860

Merge remote-tracking branch 'elastic/master' into repo-uses-cs

edde42f

ywelsch approved these changes Dec 17, 2019

View reviewed changes

original-brownbear merged commit 8402ad9 into elastic:master Dec 17, 2019

original-brownbear deleted the repo-uses-cs branch December 17, 2019 08:45

original-brownbear mentioned this pull request Dec 17, 2019

Use ClusterState as Consistency Source for Snapshot Repositories (#49060) #50267

Merged

This was referenced Feb 3, 2020

[meta] 7.6 release elastic/elasticsearch-net#4340

Closed

[meta] 7.6 release elastic/elasticsearch-net#4341

Closed

original-brownbear restored the repo-uses-cs branch August 6, 2020 18:23

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use ClusterState as Consistency Source for Snapshot Repositories #49060

Use ClusterState as Consistency Source for Snapshot Repositories #49060

original-brownbear commented Nov 13, 2019 •

edited

Loading

elasticmachine commented Nov 13, 2019

original-brownbear commented Nov 19, 2019

original-brownbear commented Nov 24, 2019

original-brownbear commented Nov 27, 2019

ywelsch commented Dec 13, 2019

original-brownbear commented Dec 13, 2019

original-brownbear commented Dec 13, 2019 •

edited

Loading

original-brownbear commented Dec 16, 2019

ywelsch left a comment

original-brownbear commented Dec 17, 2019

Use ClusterState as Consistency Source for Snapshot Repositories #49060

Use ClusterState as Consistency Source for Snapshot Repositories #49060

Conversation

original-brownbear commented Nov 13, 2019 • edited Loading

elasticmachine commented Nov 13, 2019

original-brownbear commented Nov 19, 2019

original-brownbear commented Nov 24, 2019

original-brownbear commented Nov 27, 2019

ywelsch commented Dec 13, 2019

original-brownbear commented Dec 13, 2019

original-brownbear commented Dec 13, 2019 • edited Loading

original-brownbear commented Dec 16, 2019

ywelsch left a comment

Choose a reason for hiding this comment

original-brownbear commented Dec 17, 2019

original-brownbear commented Nov 13, 2019 •

edited

Loading

original-brownbear commented Dec 13, 2019 •

edited

Loading