-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use ClusterState as Consistency Source for Snapshot Repositories #49060
Use ClusterState as Consistency Source for Snapshot Repositories #49060
Conversation
Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore) |
Jenkins run elasticsearch-ci/1 |
This API call in most implementations is fairly IO heavy and slow so it is more natural to be async in the first place. Concretely though, this change is a prerequisite of elastic#49060 since determining the repository generation from the cluster state introduces situations where this call would have to wait for other operations to finish. Doing so in a blocking manner would break `SnapshotResiliencyTests` and waste a thread. Also, this sets up the possibility to in the future make use of async IO where provided by the underlying Repository implementation. In a follow-up `SnapshotsService#getRepositoryData` will be made async as well (did not do it here, since it's another huge change to do so). Note: This change for now does not alter the threading behavior in any way with the exception of blocking a GENERIC thread in `SnapshotsService#getRepositoryData` (this should be fine since this API is only used by status APIs) and is purely mechanical otherwise.
This API call in most implementations is fairly IO heavy and slow so it is more natural to be async in the first place. Concretely though, this change is a prerequisite of #49060 since determining the repository generation from the cluster state introduces situations where this call would have to wait for other operations to finish. Doing so in a blocking manner would break `SnapshotResiliencyTests` and waste a thread. Also, this sets up the possibility to in the future make use of async IO where provided by the underlying Repository implementation. In a follow-up `SnapshotsService#getRepositoryData` will be made async as well (did not do it here, since it's another huge change to do so). Note: This change for now does not alter the threading behaviour in any way (since `Repository#getRepositoryData` isn't forking) and is purely mechanical.
This API call in most implementations is fairly IO heavy and slow so it is more natural to be async in the first place. Concretely though, this change is a prerequisite of elastic#49060 since determining the repository generation from the cluster state introduces situations where this call would have to wait for other operations to finish. Doing so in a blocking manner would break `SnapshotResiliencyTests` and waste a thread. Also, this sets up the possibility to in the future make use of async IO where provided by the underlying Repository implementation. In a follow-up `SnapshotsService#getRepositoryData` will be made async as well (did not do it here, since it's another huge change to do so). Note: This change for now does not alter the threading behaviour in any way (since `Repository#getRepositoryData` isn't forking) and is purely mechanical.
This API call in most implementations is fairly IO heavy and slow so it is more natural to be async in the first place. Concretely though, this change is a prerequisite of #49060 since determining the repository generation from the cluster state introduces situations where this call would have to wait for other operations to finish. Doing so in a blocking manner would break `SnapshotResiliencyTests` and waste a thread. Also, this sets up the possibility to in the future make use of async IO where provided by the underlying Repository implementation. In a follow-up `SnapshotsService#getRepositoryData` will be made async as well (did not do it here, since it's another huge change to do so). Note: This change for now does not alter the threading behaviour in any way (since `Repository#getRepositoryData` isn't forking) and is purely mechanical.
Need to get #49514 in first to stabilize |
A few enhancements to `SnapshotResiliencyTests`: 1. Test running requests from random nodes in more spots to enhance coverage (this is particularly motivated by #49060 where the additional number of cluster state updates makes it more interesting to fully cover all kinds of network failures) 2. Fix issue with restarting only master node in one test (doing so breaks the test at an incredibly low frequency, that becomes not so low in #49060 with the additional cluster state updates between request and response) 3. Improved cluster formation checks (now properly checks the term as well when forming cluster) + makes sure all nodes are connected to all other nodes (previously the data nodes would at times not be connected to other data nodes, which was shaken out now by adding the `client()` method 4. Make sure the cluster left behind by the test makes sense by running the repo cleanup action on it (this also increases coverage of the repository cleanup action obviously and adds the basis of making it part of more resiliency tests)
A few enhancements to `SnapshotResiliencyTests`: 1. Test running requests from random nodes in more spots to enhance coverage (this is particularly motivated by elastic#49060 where the additional number of cluster state updates makes it more interesting to fully cover all kinds of network failures) 2. Fix issue with restarting only master node in one test (doing so breaks the test at an incredibly low frequency, that becomes not so low in elastic#49060 with the additional cluster state updates between request and response) 3. Improved cluster formation checks (now properly checks the term as well when forming cluster) + makes sure all nodes are connected to all other nodes (previously the data nodes would at times not be connected to other data nodes, which was shaken out now by adding the `client()` method 4. Make sure the cluster left behind by the test makes sense by running the repo cleanup action on it (this also increases coverage of the repository cleanup action obviously and adds the basis of making it part of more resiliency tests)
A few enhancements to `SnapshotResiliencyTests`: 1. Test running requests from random nodes in more spots to enhance coverage (this is particularly motivated by #49060 where the additional number of cluster state updates makes it more interesting to fully cover all kinds of network failures) 2. Fix issue with restarting only master node in one test (doing so breaks the test at an incredibly low frequency, that becomes not so low in #49060 with the additional cluster state updates between request and response) 3. Improved cluster formation checks (now properly checks the term as well when forming cluster) + makes sure all nodes are connected to all other nodes (previously the data nodes would at times not be connected to other data nodes, which was shaken out now by adding the `client()` method 4. Make sure the cluster left behind by the test makes sense by running the repo cleanup action on it (this also increases coverage of the repository cleanup action obviously and adds the basis of making it part of more resiliency tests)
Preliminary to shorten the diff of elastic#49060. In elastic#49060 we execute cluster state updates during the writing of a new index gen and thus it must be an async API.
Preliminary to shorten the diff of elastic#49060. In elastic#49060 we execute cluster state updates during the writing of a new index gen and thus it must be an async API.
The repo factories aren't supposed to start the repository they create, that happens in `RepositoriesService`. This has no effect in `master` currently but I noticed it in elastic#49060 which will introduce logic to the repo start.
Jenkins run elasticsearch-ci/1 |
This is a preliminary to elastic#49060. It does not introduce any substantial behavior change to how the blob store repository operates. What it does is to add all the infrastructure changes around passing the cluster service to the blob store, associated test changes and a best effort approach to tracking the latest repository generation on all nodes from cluster state updates. This brings a slight improvement to the consistency by which non-master nodes (or master directly after a failover) will be able to determine the latest repository generation. It does not however do any tricky checks for the situation after a repository operation (create, delete or cleanup) that could theoretically be used to get even greater accuracy to keep this change simple. This change does not in any way alter the behavior of the blobstore repository other than adding a better "guess" for the value of the latest repo generation and is mainly intended to isolate the actual logical change to how the repository operates in elastic#49060
probably only unless you have a full cluster restart. In that case, SnapshotInProgress and friends are lost whereas RepositoriesMetaData is not. I want to mainly understand the worst case here, and how that could possibly affect a user. In theory we could fix the concurrent master issue by appending the master term as an additional tie breaker to the file name. |
I think the worst case is as you mention the full cluster restart right before a snapshot finishes (i.e. before |
I think the worst case is as you mention the full cluster restart right before a snapshot finishes (i.e. before
Scratch this I was being stupid, incrementing the pending generation ensure we never collide on a generation anyway. Even if we add the term it doesn't change anything regarding the failure scenario of reading a not-yet-created index-N IMO does it? Either it makes us dependt on listing which may be inconsistent anyway because we only store the generation but not the term in the CS or it makes us potentially read a non existing blob and run into the mentioned consistency trouble. |
@ywelsch as discussed earlier today, I added logic to identify a "dirty" startup of the repository now in e3f7ff2 (code and test) and 564202b (docs) that shoul make us never lose the latest state (assuming the list operation works well enough on full cluster restart :) but it should be fine and even if it isn't once in a blue moon, the repo can not become corrupted as a result still). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Can you, in a follow-up, also add more tests for read-only repositories, given that they have a slightly different way to load stuff?
Thanks Yannick!
Will do :) |
…stic#49060) Follow up to elastic#49729 This change removes falling back to listing out the repository contents to find the latest `index-N` in write-mounted blob store repositories. This saves 2-3 list operations on each snapshot create and delete operation. Also it makes all the snapshot status APIs cheaper (and faster) by saving one list operation there as well in many cases. This removes the resiliency to concurrent modifications of the repository as a result and puts a repository in a `corrupted` state in case loading `RepositoryData` failed from the assumed generation.
) (#50267) Follow up to #49729 This change removes falling back to listing out the repository contents to find the latest `index-N` in write-mounted blob store repositories. This saves 2-3 list operations on each snapshot create and delete operation. Also it makes all the snapshot status APIs cheaper (and faster) by saving one list operation there as well in many cases. This removes the resiliency to concurrent modifications of the repository as a result and puts a repository in a `corrupted` state in case loading `RepositoryData` failed from the assumed generation.
This is a preliminary to elastic#49060. It does not introduce any substantial behavior change to how the blob store repository operates. What it does is to add all the infrastructure changes around passing the cluster service to the blob store, associated test changes and a best effort approach to tracking the latest repository generation on all nodes from cluster state updates. This brings a slight improvement to the consistency by which non-master nodes (or master directly after a failover) will be able to determine the latest repository generation. It does not however do any tricky checks for the situation after a repository operation (create, delete or cleanup) that could theoretically be used to get even greater accuracy to keep this change simple. This change does not in any way alter the behavior of the blobstore repository other than adding a better "guess" for the value of the latest repo generation and is mainly intended to isolate the actual logical change to how the repository operates in elastic#49060
Step on the road to elastic#49060. This commit adds the logic to keep track of a repository's generation across repository operations. See changes to package level Javadoc for the concrete changes in the distributed state machine. It updates the write side of new repository generations to be fully consistent via the cluster state. With this change, no `index-N` will be overwritten for the same repository ever. So eventual consistency issues around conflicting updates to the same `index-N` are not a possibility any longer. With this change the read side will still use listing of repository contents instead of relying solely on the cluster state contents. The logic for that will be introduced in elastic#49060. This retains the ability to externally delete the contents of a repository and continue using it afterwards for the time being. In elastic#49060 the use of listing to determine the repository generation will be removed in all cases (except for full-cluster restart) as the last step in this effort.
…stic#49060) Follow up to elastic#49729 This change removes falling back to listing out the repository contents to find the latest `index-N` in write-mounted blob store repositories. This saves 2-3 list operations on each snapshot create and delete operation. Also it makes all the snapshot status APIs cheaper (and faster) by saving one list operation there as well in many cases. This removes the resiliency to concurrent modifications of the repository as a result and puts a repository in a `corrupted` state in case loading `RepositoryData` failed from the assumed generation.
SwiftRepository extends BlobStoreRepository which, as of this commit[1], is aware of the ClusterState. This was done to facilitate this change[2]. [1]: elastic/elasticsearch@459d8ed [2]: elastic/elasticsearch#49060
SwiftRepository extends BlobStoreRepository which, as of this commit[1], is aware of the ClusterState. This was done to facilitate this change[2]. [1]: elastic/elasticsearch@459d8ed [2]: elastic/elasticsearch#49060
Follow up to #49729
This change removes falling back to listing out the repository contents to find the latest
index-N
in write-mounted blob store repositories.This saves 2-3 list operations on each snapshot create and delete operation. Also it makes all the snapshot status APIs cheaper (and faster) by saving one list operation there as well in many cases.
This removes the resiliency to concurrent modifications of the repository as a result and puts a repository in a
corrupted
state in case loadingRepositoryData
failed from the assumed generation.Note to reviewers:
I suggest starting from the new test ``CorruptedBlobStoreRepositoryIT` illustrating the mechanics of a repository being marked as "corrupted" to make sure we're in agreement on the behavior change here. I know we discussed in the snapshot meeting a while back, but let's make sure we're all in agreement on the tradeoffs made here:
(good) Slightly faster and cheaper (cloud repos) repository operations (both dimensions are somewhat irrelevant I'd say for most users)
(good-neutral) Concurrent modifications to the repository by multiple clusters are detected much more reliably and force the user to re-add the repo by hand, making it so that the dangerous situation of having two clusters write to the repo at the same time is impossible. This seems like mostly a win, but he off situation where a user may use the repo for writes across two clusters and manually ensures that operations don't run at the same time would be broken by this change