-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Make BlobStoreRepository Aware of ClusterState #49639
Make BlobStoreRepository Aware of ClusterState #49639
Conversation
This is a preliminary to elastic#49060. It does not introduce any substantial behavior change to how the blob store repository operates. What it does is to add all the infrastructure changes around passing the cluster service to the blob store, associated test changes and a best effort approach to tracking the latest repository generation on all nodes from cluster state updates. This brings a slight improvement to the consistency by which non-master nodes (or master directly after a failover) will be able to determine the latest repository generation. It does not however do any tricky checks for the situation after a repository operation (create, delete or cleanup) that could theoretically be used to get even greater accuracy to keep this change simple. This change does not in any way alter the behavior of the blobstore repository other than adding a better "guess" for the value of the latest repo generation and is mainly intended to isolate the actual logical change to how the repository operates in elastic#49060
Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore) |
@ywelsch @tlrx I know that the practical improvement to resiliency from this change is pretty small (see PR description ... we're always picking up the best known state before an operation), but I think it should take a lot of the pain out of reviewing #49060 by containing almost all non-functional changes in that PR. Let me know what you think :) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left some comments. Thanks!
server/src/main/java/org/elasticsearch/repositories/blobstore/BlobStoreRepository.java
Outdated
Show resolved
Hide resolved
// Inspects all cluster state elements that contain a hint about what the current repository generation is and updates | ||
// #latestKnownRepoGen if a newer than currently known generation is found | ||
@Override | ||
public void applyClusterState(ClusterChangedEvent event) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if RepositoriesService
should update the relevant repository about changes to their snapshots. RepositoriesService
is already a cluster state listener, which means that we don't need an additional lifecycle here.
This would also be closer to how we inform shards about updates (see IndexShard.updateShardState
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought about that initially but then dropped the idea because that would've meant leaking the specifics of BlobStoreRepository
into RepositoriesService
.
But come to think of it now ... doing this would also save a bit of tricky of code in #49060 to get the initialization of the repo on non-master nodes right.
Let's do it -> will do :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if
RepositoriesService
should update the relevant repository about changes to their snapshots.
I agree, that seems to be the right thing to do. I'm also not super happy to have BlobStoreRepository have its own lifecycle.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could I warm your heart for cfb104f maybe? :)
I think in the case of the repositories it's better to pass the whole ClusterState
to the repo instead of figuring out what parts go to what repo in RepositoriesService
. Otherwise we needlessly waste cycles on the repos that don't require the cluster state (e.g. read-only repos, ccr, ...). Also, in #49060 the parts of the CS that need to be inspected will change, but we may want to keep the current logic from here as BwC fallback which is much easier if we just pass the full ClusterState
down.
Also, I realized that I actually had to move the cluster state application method to Repository
because of FilterRepository
(and upcoming encrypted repo wrapper etc.) so this didn't leak any blob store repo specifics to RepositoriesService
after all :)
final SnapshotsInProgress snapshotsInProgress = state.custom(SnapshotsInProgress.TYPE); | ||
if (snapshotsInProgress != null) { | ||
final SnapshotsInProgress.Entry entry = snapshotsInProgress.entries().stream() | ||
.filter(e -> e.snapshot().getRepository().equals(repoName)).findFirst() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why findFirst? Let's take the max of all ongoing snapshots for this repo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, currently there's just one entry here at all times. I don't think that will change before this logic gets replaced by the more inolved logic in #49060 and I'm not sure if and when we move to parallel snapshot taking, that those will in fact have different repository generations set (in my prototype for parallel ops they wouldn't have at least ...). Could just assert that the count of snapshots in progress is always 1
here? :)
} | ||
|
||
final SnapshotDeletionsInProgress deletionsInProgress = state.custom(SnapshotDeletionsInProgress.TYPE); | ||
if (bestGenerationFromCS == RepositoryData.EMPTY_REPO_GEN && deletionsInProgress != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's remove this extra condition (bestGenerationFromCS == RepositoryData.EMPTY_REPO_GEN
), it's an optimization which does not mattter I think and could hurt us later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No actually that's not true I think. We have the sitaution where if you abort a snapshot, the delete entry will be created with currentRepoGeneration + 1
since that's what the repo will be at when the delete actually runs. That's why I added this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this condition. Why are we not just taking the max of all the entries that we see? Also, I suppose that we don't allow concurrent operations right now, so we assume that we have only one of these 3 metadata for the current repo?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is the one special case of aborting a snapshot where you can have an in progress snapshot and a delete. The delete will contain the future repository generation for after the snapshot finished -> we can't use that. That's why I added this condition.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we fix that situation? And can you add a comment to that effect? The condition as it stands here today is very unintuitive.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we fix that situation?
Yes, once we have #49060 we can clean this up when implementing concurrent operations on the repo. That will require the whole business of associating an operation with a strict repo generation to go away anyway.
=> Added a comment for now :)
server/src/main/java/org/elasticsearch/repositories/blobstore/BlobStoreRepository.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good already, I'm waiting for Yannick's feedback to be addressed before LGTM :)
...y-azure/src/test/java/org/elasticsearch/repositories/azure/AzureRepositorySettingsTests.java
Outdated
Show resolved
Hide resolved
// Inspects all cluster state elements that contain a hint about what the current repository generation is and updates | ||
// #latestKnownRepoGen if a newer than currently known generation is found | ||
@Override | ||
public void applyClusterState(ClusterChangedEvent event) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if
RepositoriesService
should update the relevant repository about changes to their snapshots.
I agree, that seems to be the right thing to do. I'm also not super happy to have BlobStoreRepository have its own lifecycle.
@@ -411,14 +415,21 @@ private Repository createRepository(RepositoryMetaData repositoryMetaData, Map<S | |||
throw new RepositoryException(repositoryMetaData.name(), | |||
"repository type [" + repositoryMetaData.type() + "] does not exist"); | |||
} | |||
boolean success = false; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't strickly necessary with the changes just now, but maybe fine to keep it here since we discovered it here and I think we should def. cleanup on a failed start
out of principle?
@@ -345,6 +346,9 @@ public void applyClusterState(ClusterChangedEvent event) { | |||
} | |||
} | |||
repositories = Collections.unmodifiableMap(builder); | |||
for (Repository repo : repositories.values()) { | |||
repo.updateState(state); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to catch a potential RepositoryException here, log it and continue to update the others repositories.
I misread the code, sorry... But it maybe still worth it, just in case?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder, looking at the code above we aren't defensive around closeRepository
either (though that might be called in a loop as well) but catch (which has a TODO about it) on createRepository
.
Maybe we should rather wrap this in a
try {
repo.updateState(state);
} catch (Exception e) {
assert false;
throw e;
}
and do the same for closeRepository
? Could do it in a follow up and remove that todo while we're at it :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My main contention to just catch and log here would be that once the state update is important for the proper functioning of blob store repositories, then what would an uncaught exception even mean? (imo it would mean that the repo can't be used for writes any longer ... but that would be something the repo would have to set in its internal state when handling exceptions)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The bigger issue I think is to get rid of the condition:
if ((oldMetaData == null && newMetaData == null) || (oldMetaData != null && oldMetaData.equals(newMetaData))) {
which currently compares the current with the previous cluster state. If anything went wrong applying the previous cluster state, we will not update the repo again.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks Armin. I left a small comment where I think we should be defensive just in case an exception is thrown while updating the repositories states.
Thanks Armin! As discussed I think we should have a look at the other issues in the applyClusterState() method (like maybe only create + update repositories before closing the existing ones) and that could be done in a follow up PR. |
Jenkins run elasticsearch-ci/1 (unrelated transport test failure) |
@@ -345,6 +346,9 @@ public void applyClusterState(ClusterChangedEvent event) { | |||
} | |||
} | |||
repositories = Collections.unmodifiableMap(builder); | |||
for (Repository repo : repositories.values()) { | |||
repo.updateState(state); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The bigger issue I think is to get rid of the condition:
if ((oldMetaData == null && newMetaData == null) || (oldMetaData != null && oldMetaData.equals(newMetaData))) {
which currently compares the current with the previous cluster state. If anything went wrong applying the previous cluster state, we will not update the repo again.
server/src/main/java/org/elasticsearch/repositories/RepositoriesService.java
Outdated
Show resolved
Hide resolved
} | ||
|
||
final SnapshotDeletionsInProgress deletionsInProgress = state.custom(SnapshotDeletionsInProgress.TYPE); | ||
if (bestGenerationFromCS == RepositoryData.EMPTY_REPO_GEN && deletionsInProgress != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand this condition. Why are we not just taking the max of all the entries that we see? Also, I suppose that we don't allow concurrent operations right now, so we assume that we have only one of these 3 metadata for the current repo?
server/src/main/java/org/elasticsearch/repositories/blobstore/BlobStoreRepository.java
Outdated
Show resolved
Hide resolved
Jenkins run elasticsearch-ci/bwc (seems like a messed up version got build ...) |
Somethings messed up with the BwC tests here (some version constant test), not related to my changes. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks Yannick + Tanguy! |
This is a preliminary to elastic#49060. It does not introduce any substantial behavior change to how the blob store repository operates. What it does is to add all the infrastructure changes around passing the cluster service to the blob store, associated test changes and a best effort approach to tracking the latest repository generation on all nodes from cluster state updates. This brings a slight improvement to the consistency by which non-master nodes (or master directly after a failover) will be able to determine the latest repository generation. It does not however do any tricky checks for the situation after a repository operation (create, delete or cleanup) that could theoretically be used to get even greater accuracy to keep this change simple. This change does not in any way alter the behavior of the blobstore repository other than adding a better "guess" for the value of the latest repo generation and is mainly intended to isolate the actual logical change to how the repository operates in elastic#49060
* Make BlobStoreRepository Aware of ClusterState (#49639) This is a preliminary to #49060. It does not introduce any substantial behavior change to how the blob store repository operates. What it does is to add all the infrastructure changes around passing the cluster service to the blob store, associated test changes and a best effort approach to tracking the latest repository generation on all nodes from cluster state updates. This brings a slight improvement to the consistency by which non-master nodes (or master directly after a failover) will be able to determine the latest repository generation. It does not however do any tricky checks for the situation after a repository operation (create, delete or cleanup) that could theoretically be used to get even greater accuracy to keep this change simple. This change does not in any way alter the behavior of the blobstore repository other than adding a better "guess" for the value of the latest repo generation and is mainly intended to isolate the actual logical change to how the repository operates in #49060
The current forwarding of the state is disabled if the repo metadata did not change, so it's somewhat useless. Relates elastic#49639
This is a preliminary to elastic#49060. It does not introduce any substantial behavior change to how the blob store repository operates. What it does is to add all the infrastructure changes around passing the cluster service to the blob store, associated test changes and a best effort approach to tracking the latest repository generation on all nodes from cluster state updates. This brings a slight improvement to the consistency by which non-master nodes (or master directly after a failover) will be able to determine the latest repository generation. It does not however do any tricky checks for the situation after a repository operation (create, delete or cleanup) that could theoretically be used to get even greater accuracy to keep this change simple. This change does not in any way alter the behavior of the blobstore repository other than adding a better "guess" for the value of the latest repo generation and is mainly intended to isolate the actual logical change to how the repository operates in elastic#49060
This is a preliminary to #49060.
It does not introduce any substantial behavior change to how the blob store repository
operates. What it does is to add all the infrastructure changes around passing the cluster service
to the blob store, associated test changes and a best effort approach to tracking the latest repository
generation on all nodes from cluster state updates. This brings a slight improvement to the consistency by which non-master nodes (or master directly after a failover) will be able to determine the latest
repository generation. It does not however do any tricky checks for the situation after a repository operation (create, delete or cleanup) that could theoretically be used to get even greater accuracy to keep this change simple.
This change does not in any way alter the behavior of the blobstore repository other than adding a better "guess" for the value of the latest repo generation and is mainly intended to isolate the actual logical change to how the repository operates in #49060