Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use ClusterState as Consistency Source for Snapshot Repositories #49060

Merged
merged 26 commits into from
Dec 17, 2019

Conversation

original-brownbear
Copy link
Member

@original-brownbear original-brownbear commented Nov 13, 2019

Follow up to #49729

This change removes falling back to listing out the repository contents to find the latest index-N in write-mounted blob store repositories.
This saves 2-3 list operations on each snapshot create and delete operation. Also it makes all the snapshot status APIs cheaper (and faster) by saving one list operation there as well in many cases.
This removes the resiliency to concurrent modifications of the repository as a result and puts a repository in a corrupted state in case loading RepositoryData failed from the assumed generation.

Note to reviewers:

I suggest starting from the new test ``CorruptedBlobStoreRepositoryIT` illustrating the mechanics of a repository being marked as "corrupted" to make sure we're in agreement on the behavior change here. I know we discussed in the snapshot meeting a while back, but let's make sure we're all in agreement on the tradeoffs made here:

  • (good) Slightly faster and cheaper (cloud repos) repository operations (both dimensions are somewhat irrelevant I'd say for most users)

  • (good-neutral) Concurrent modifications to the repository by multiple clusters are detected much more reliably and force the user to re-add the repo by hand, making it so that the dangerous situation of having two clusters write to the repo at the same time is impossible. This seems like mostly a win, but he off situation where a user may use the repo for writes across two clusters and manually ensures that operations don't run at the same time would be broken by this change

@original-brownbear original-brownbear added WIP :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs labels Nov 13, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore)

@original-brownbear original-brownbear changed the title Use ClusterState as Consistency Source Use ClusterState as Consistency Source for Snapshot Repositories Nov 13, 2019
@original-brownbear
Copy link
Member Author

Jenkins run elasticsearch-ci/1

original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Nov 19, 2019
This API call in most implementations is fairly IO heavy and slow
so it is more natural to be async in the first place.
Concretely though, this change is a prerequisite of elastic#49060 since
determining the repository generation from the cluster state
introduces situations where this call would have to wait for other
operations to finish. Doing so in a blocking manner would break
`SnapshotResiliencyTests` and waste a thread.
Also, this sets up the possibility to in the future make use of async IO
where provided by the underlying Repository implementation.

In a follow-up `SnapshotsService#getRepositoryData` will be made async
as well (did not do it here, since it's another huge change to do so).
Note: This change for now does not alter the threading behavior in any way
with the exception of blocking a GENERIC thread in `SnapshotsService#getRepositoryData`
(this should be fine since this API is only used by status APIs)
and is purely mechanical otherwise.
original-brownbear added a commit that referenced this pull request Nov 19, 2019
This API call in most implementations is fairly IO heavy and slow
so it is more natural to be async in the first place.
Concretely though, this change is a prerequisite of #49060 since
determining the repository generation from the cluster state
introduces situations where this call would have to wait for other
operations to finish. Doing so in a blocking manner would break
`SnapshotResiliencyTests` and waste a thread.
Also, this sets up the possibility to in the future make use of async IO
where provided by the underlying Repository implementation.

In a follow-up `SnapshotsService#getRepositoryData` will be made async
as well (did not do it here, since it's another huge change to do so).
Note: This change for now does not alter the threading behaviour in any way (since `Repository#getRepositoryData` isn't forking) and is purely mechanical.
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Nov 19, 2019
This API call in most implementations is fairly IO heavy and slow
so it is more natural to be async in the first place.
Concretely though, this change is a prerequisite of elastic#49060 since
determining the repository generation from the cluster state
introduces situations where this call would have to wait for other
operations to finish. Doing so in a blocking manner would break
`SnapshotResiliencyTests` and waste a thread.
Also, this sets up the possibility to in the future make use of async IO
where provided by the underlying Repository implementation.

In a follow-up `SnapshotsService#getRepositoryData` will be made async
as well (did not do it here, since it's another huge change to do so).
Note: This change for now does not alter the threading behaviour in any way (since `Repository#getRepositoryData` isn't forking) and is purely mechanical.
original-brownbear added a commit that referenced this pull request Nov 19, 2019
This API call in most implementations is fairly IO heavy and slow
so it is more natural to be async in the first place.
Concretely though, this change is a prerequisite of #49060 since
determining the repository generation from the cluster state
introduces situations where this call would have to wait for other
operations to finish. Doing so in a blocking manner would break
`SnapshotResiliencyTests` and waste a thread.
Also, this sets up the possibility to in the future make use of async IO
where provided by the underlying Repository implementation.

In a follow-up `SnapshotsService#getRepositoryData` will be made async
as well (did not do it here, since it's another huge change to do so).
Note: This change for now does not alter the threading behaviour in any way (since `Repository#getRepositoryData` isn't forking) and is purely mechanical.
@original-brownbear
Copy link
Member Author

Need to get #49514 in first to stabilize SnapshotResiliencyTests enough so that they don't become unstable from this change

original-brownbear added a commit that referenced this pull request Nov 25, 2019
A few enhancements to `SnapshotResiliencyTests`:
1. Test running requests from random nodes in more spots to enhance coverage (this is particularly motivated by #49060 where the additional number of cluster state updates makes it more interesting to fully cover all kinds of network failures)
2. Fix issue with restarting only master node in one test (doing so breaks the test at an incredibly low frequency, that becomes not so low in #49060 with the additional cluster state updates between request and response)
3. Improved cluster formation checks (now properly checks the term as well when forming cluster) + makes sure all nodes are connected to all other nodes (previously the data nodes would at times not be connected to other data nodes, which was shaken out now by adding the `client()` method
4. Make sure the cluster left behind by the test makes sense by running the repo cleanup action on it (this also increases coverage of the repository cleanup action obviously and adds the basis of making it part of more resiliency tests)
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Nov 25, 2019
A few enhancements to `SnapshotResiliencyTests`:
1. Test running requests from random nodes in more spots to enhance coverage (this is particularly motivated by elastic#49060 where the additional number of cluster state updates makes it more interesting to fully cover all kinds of network failures)
2. Fix issue with restarting only master node in one test (doing so breaks the test at an incredibly low frequency, that becomes not so low in elastic#49060 with the additional cluster state updates between request and response)
3. Improved cluster formation checks (now properly checks the term as well when forming cluster) + makes sure all nodes are connected to all other nodes (previously the data nodes would at times not be connected to other data nodes, which was shaken out now by adding the `client()` method
4. Make sure the cluster left behind by the test makes sense by running the repo cleanup action on it (this also increases coverage of the repository cleanup action obviously and adds the basis of making it part of more resiliency tests)
original-brownbear added a commit that referenced this pull request Nov 25, 2019
A few enhancements to `SnapshotResiliencyTests`:
1. Test running requests from random nodes in more spots to enhance coverage (this is particularly motivated by #49060 where the additional number of cluster state updates makes it more interesting to fully cover all kinds of network failures)
2. Fix issue with restarting only master node in one test (doing so breaks the test at an incredibly low frequency, that becomes not so low in #49060 with the additional cluster state updates between request and response)
3. Improved cluster formation checks (now properly checks the term as well when forming cluster) + makes sure all nodes are connected to all other nodes (previously the data nodes would at times not be connected to other data nodes, which was shaken out now by adding the `client()` method
4. Make sure the cluster left behind by the test makes sense by running the repo cleanup action on it (this also increases coverage of the repository cleanup action obviously and adds the basis of making it part of more resiliency tests)
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Nov 26, 2019
Preliminary to shorten the diff of elastic#49060. In elastic#49060
we execute cluster state updates during the writing of a new
index gen and thus it must be an async API.
original-brownbear added a commit that referenced this pull request Nov 26, 2019
Preliminary to shorten the diff of #49060. In #49060
we execute cluster state updates during the writing of a new
index gen and thus it must be an async API.
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Nov 26, 2019
Preliminary to shorten the diff of elastic#49060. In elastic#49060
we execute cluster state updates during the writing of a new
index gen and thus it must be an async API.
original-brownbear added a commit that referenced this pull request Nov 26, 2019
Preliminary to shorten the diff of #49060. In #49060
we execute cluster state updates during the writing of a new
index gen and thus it must be an async API.
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Nov 27, 2019
The repo factories aren't supposed to start the repository
they create, that happens in `RepositoriesService`. This has
no effect in `master` currently but I noticed it in elastic#49060
which will introduce logic to the repo start.
@original-brownbear
Copy link
Member Author

Jenkins run elasticsearch-ci/1

original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Nov 27, 2019
This is a preliminary to elastic#49060.

It does not introduce any substantial behavior change to how the blob store repository
operates. What it does is to add all the infrastructure changes around passing the cluster service
to the blob store, associated test changes and a best effort approach to tracking the latest repository
generation on all nodes from cluster state updates. This brings a slight improvement to the consistency
by which non-master nodes (or master directly after a failover) will be able to determine the latest
repository generation. It does not however do any tricky checks for the situation after a repository operation
(create, delete or cleanup) that could theoretically be used to get even greater accuracy to keep this change
simple.
This change does not in any way alter the behavior of the blobstore repository other than adding a better "guess"
for the value of the latest repo generation and is mainly intended to isolate the actual logical change to how the
repository operates in elastic#49060
@ywelsch
Copy link
Contributor

ywelsch commented Dec 13, 2019

it will be retried by the next master that will just finalize the still ongoing operation

probably only unless you have a full cluster restart. In that case, SnapshotInProgress and friends are lost whereas RepositoriesMetaData is not.

I want to mainly understand the worst case here, and how that could possibly affect a user. In theory we could fix the concurrent master issue by appending the master term as an additional tie breaker to the file name.

@original-brownbear
Copy link
Member Author

I want to mainly understand the worst case here, and how that could possibly affect a user. In theory we could fix the concurrent master issue by appending the master term as an additional tie breaker to the file name.

I think the worst case is as you mention the full cluster restart right before a snapshot finishes (i.e. before RepositoryMetadata generations get aligned). In that case, if you were to restore last snapshot taken and concurrently ran either a repository cleanup or snapshot delete the restore may fail mid-way because the cleanup or delete will interpret the last snapshot's files as unreferenced and remove them.
I feel like this is a really fringe case though. Right now you have a bunch of failure scenarios in various ES versions when restoring while deleting from a repository (the step of reading RepositoryData which happens on the data nodes restoring is error prone in any version if a new one is written concurrently ... ) so I'm not sure it's worth it worrying about this? It's super unlikely and even without this change deleting anything while restoring has the potential to kill the restore, we're just adding one new very unlikely scenario.

@original-brownbear
Copy link
Member Author

original-brownbear commented Dec 13, 2019

I want to mainly understand the worst case here, and how that could possibly affect a user. In theory we could fix the concurrent master issue by appending the master term as an additional tie breaker to the file name.

I think the worst case is as you mention the full cluster restart right before a snapshot finishes (i.e. before RepositoryMetadata generations get aligned). In that case, if you were to restore last snapshot taken and concurrently ran either a repository cleanup or snapshot delete the restore may fail mid-way because the cleanup or delete will interpret the last snapshot's files as unreferenced and remove them.
I feel like this is a really fringe case though. Right now you have a bunch of failure scenarios in various ES versions when restoring while deleting from a repository (the step of reading RepositoryData which happens on the data nodes restoring is error prone in any version if a new one is written concurrently ... ) so I'm not sure it's worth it worrying about this? It's super unlikely and even without this change deleting anything while restoring has the potential to kill the restore, we're just adding one new very unlikely scenario.

appending the master term as an additional tie breaker to the file name

So basically we'd have index-{N}-{term} and if we run into say pending generation 5, we'd list index-5-* and move to that as the safe generation if we were to find a blob for index-5-{someTerm}? I think we could do that, but since it really only helps the concurrent repository access case I'm not sure it's worth the changes (we'd have to change what we store in index.latest and refactor the generation to be two long or a String wouldn't we?).

Scratch this I was being stupid, incrementing the pending generation ensure we never collide on a generation anyway. Even if we add the term it doesn't change anything regarding the failure scenario of reading a not-yet-created index-N IMO does it? Either it makes us dependt on listing which may be inconsistent anyway because we only store the generation but not the term in the CS or it makes us potentially read a non existing blob and run into the mentioned consistency trouble.

@original-brownbear
Copy link
Member Author

@ywelsch as discussed earlier today, I added logic to identify a "dirty" startup of the repository now in e3f7ff2 (code and test) and 564202b (docs) that shoul make us never lose the latest state (assuming the list operation works well enough on full cluster restart :) but it should be fine and even if it isn't once in a blue moon, the repo can not become corrupted as a result still).

Copy link
Contributor

@ywelsch ywelsch left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Can you, in a follow-up, also add more tests for read-only repositories, given that they have a slightly different way to load stuff?

@original-brownbear
Copy link
Member Author

Thanks Yannick!

Can you, in a follow-up, also add more tests for read-only repositories, given that they have a slightly different way to load stuff?

Will do :)

@original-brownbear original-brownbear merged commit 8402ad9 into elastic:master Dec 17, 2019
@original-brownbear original-brownbear deleted the repo-uses-cs branch December 17, 2019 08:45
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this pull request Dec 17, 2019
…stic#49060)

Follow up to elastic#49729

This change removes falling back to listing out the repository contents to find the latest `index-N` in write-mounted blob store repositories.
This saves 2-3 list operations on each snapshot create and delete operation. Also it makes all the snapshot status APIs cheaper (and faster) by saving one list operation there as well in many cases.
This removes the resiliency to concurrent modifications of the repository as a result and puts a repository in a `corrupted` state in case loading `RepositoryData` failed from the assumed generation.
original-brownbear added a commit that referenced this pull request Dec 17, 2019
) (#50267)

Follow up to #49729

This change removes falling back to listing out the repository contents to find the latest `index-N` in write-mounted blob store repositories.
This saves 2-3 list operations on each snapshot create and delete operation. Also it makes all the snapshot status APIs cheaper (and faster) by saving one list operation there as well in many cases.
This removes the resiliency to concurrent modifications of the repository as a result and puts a repository in a `corrupted` state in case loading `RepositoryData` failed from the assumed generation.
SivagurunathanV pushed a commit to SivagurunathanV/elasticsearch that referenced this pull request Jan 23, 2020
This is a preliminary to elastic#49060.

It does not introduce any substantial behavior change to how the blob store repository
operates. What it does is to add all the infrastructure changes around passing the cluster service to the blob store, associated test changes and a best effort approach to tracking the latest repository generation on all nodes from cluster state updates. This brings a slight improvement to the consistency
by which non-master nodes (or master directly after a failover) will be able to determine the latest repository generation. It does not however do any tricky checks for the situation after a repository operation
(create, delete or cleanup) that could theoretically be used to get even greater accuracy to keep this change simple.
This change does not in any way alter the behavior of the blobstore repository other than adding a better "guess" for the value of the latest repo generation and is mainly intended to isolate the actual logical change to how the
repository operates in elastic#49060
SivagurunathanV pushed a commit to SivagurunathanV/elasticsearch that referenced this pull request Jan 23, 2020
Step on the road to elastic#49060. 

This commit adds the logic to keep track of a repository's generation
across repository operations. See changes to package level Javadoc for the concrete changes in the distributed state machine.

It updates the write side of new repository generations to be fully consistent via the cluster state. With this change, no `index-N` will be overwritten for the same repository ever. So eventual consistency issues around conflicting updates to the same `index-N` are not a possibility any longer. 

With this change the read side will still use listing of repository contents instead of relying solely on the cluster state contents.
The logic for that will be introduced in elastic#49060. This retains the ability to externally delete the contents of a repository and continue using it afterwards for the time being. In elastic#49060 the use of listing to determine the repository generation will be removed in all cases (except for full-cluster restart) as the last step in this effort.
SivagurunathanV pushed a commit to SivagurunathanV/elasticsearch that referenced this pull request Jan 23, 2020
…stic#49060)

Follow up to elastic#49729 

This change removes falling back to listing out the repository contents to find the latest `index-N` in write-mounted blob store repositories.
This saves 2-3 list operations on each snapshot create and delete operation. Also it makes all the snapshot status APIs cheaper (and faster) by saving one list operation there as well in many cases.
This removes the resiliency to concurrent modifications of the repository as a result and puts a repository in a `corrupted` state in case loading `RepositoryData` failed from the assumed generation.
@original-brownbear original-brownbear restored the repo-uses-cs branch August 6, 2020 18:23
avgerin0s added a commit to skroutz/elasticsearch-repository-swift that referenced this pull request Aug 13, 2020
SwiftRepository extends BlobStoreRepository which, as of this commit[1],
is aware of the ClusterState. This was done to facilitate this
change[2].

[1]: elastic/elasticsearch@459d8ed
[2]: elastic/elasticsearch#49060
avgerin0s added a commit to skroutz/elasticsearch-repository-swift that referenced this pull request Aug 18, 2020
SwiftRepository extends BlobStoreRepository which, as of this commit[1],
is aware of the ClusterState. This was done to facilitate this
change[2].

[1]: elastic/elasticsearch@459d8ed
[2]: elastic/elasticsearch#49060
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants