-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable Fully Concurrent Snapshot Operations #56911
Enable Fully Concurrent Snapshot Operations #56911
Conversation
Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore) |
Implements fully concurrent snapshot operations. See documentation changes to snapshot package level JavaDoc for details.
923aebd
to
346f306
Compare
* in the thread pool (for example, tests that use the mock repository that | ||
* block on master). | ||
*/ | ||
public class MinThreadsSnapshotRestoreIT extends AbstractSnapshotIntegTestCase { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All scenarios covered by this test become obsolete. The actual premise of this test (checking that we don't dead-lock from blocked threads) is covered by the fact that SnapshotResiliencyTests
work for the most part anyway.
server/src/main/java/org/elasticsearch/snapshots/SnapshotShardsService.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thanks for tackling this challenge, and good luck with the backport.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM This is great work, thanks Armin! I carefully reviewed most parts but I must admit that I lightly reviewed the "repository loop" part.
I deeply apologize for the time it took me to review this PR. The reviewing experience was not great for me due to the amount of code changes, I think using a dedicated branch here would have made sense.
private static String startDataNodeWithLargeSnapshotPool() { | ||
return internalCluster().startDataOnlyNode(LARGE_SNAPSHOT_POOL_SETTINGS); | ||
} | ||
public void testSnapshotRunsAfterInProgressDelete() throws Exception { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: add extra line
assertThat(secondSnapshotResponse.isDone(), is(false)); | ||
|
||
unblockNode(repoName, dataNode); | ||
assertThat(firstSnapshotResponse.get().getSnapshotInfo().state(), is(SnapshotState.FAILED)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we check that the 1st snapshot failed because it was aborted?
ensureStableCluster(3); | ||
|
||
awaitNoMoreRunningOperations(); | ||
expectThrows(RepositoryException.class, deleteFuture::actionGet); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there a meaningful error message we could check here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not really, it's just "failed to update repository". It's all in the cause here, but that's also just a JSON parse failure.
this(in.readList(Entry::new)); | ||
} | ||
|
||
private static boolean assertConsistency(List<Entry> entries) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: assertConsistency -> assertNoConcurrentDeletionsForSameRepository() ?
try { | ||
assert assertConsistentEntries(entries); | ||
} catch (AssertionError e) { | ||
throw e; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure to understand why we catch and rethrow here
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So I could put a debug breakpoint there :D Thanks for spotting!
private final OngoingRepositoryOperations repositoryOperations = new OngoingRepositoryOperations(); | ||
|
||
/** | ||
* Setting that specifies the maximum number of allow concurrent snapshot create and delete operations in the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
allow -> allowed
currentState.custom(SnapshotDeletionsInProgress.TYPE, SnapshotDeletionsInProgress.EMPTY); | ||
if (deletionsInProgress.hasDeletionsInProgress()) { | ||
currentState.custom(SnapshotDeletionsInProgress.TYPE, SnapshotDeletionsInProgress.EMPTY); | ||
if (deletionsInProgress.hasDeletionsInProgress() && concurrentOperationsAllowed == false) { | ||
throw new ConcurrentSnapshotExecutionException(repositoryName, snapshotName, | ||
"cannot snapshot while a snapshot deletion is in-progress in [" + deletionsInProgress + "]"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When backporting, we could maybe indicate in the error message that concurrent snapshot/deletions are possible in version 7.9?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 Right, I put down a note for that when doing the back-port work.
throw new ConcurrentSnapshotExecutionException(repositoryName, snapshotName, " a snapshot is already running"); | ||
} | ||
ensureBelowConcurrencyLimit(repositoryName, snapshotName, snapshots, deletionsInProgress); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happen if multiple snapshot operations are started but the maxConcurrentOperations settings is updated to a value lower than the current number of concurrent ops? Would it still be possible to enque more ops?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What happen if multiple snapshot operations are started but the maxConcurrentOperations settings is updated to a value lower than the current number of concurrent ops?
Existing ops won't be affected but you can't start new ones.
Would it still be possible to enque more ops?
No then the ops have to come down to below the new limit first before we can enqueue more.
Enables fully concurrent snapshot operations: * Snapshot create- and delete operations can be started in any order * Delete operations wait for snapshot finalization to finish, are batched as much as possible to improve efficiency and once enqueued in the cluster state prevent new snapshots from starting on data nodes until executed * We could be even more concurrent here in a follow-up by interleaving deletes and snapshots on a per-shard level. I decided not to do this for now since it seemed not worth the added complexity yet. Due to batching+deduplicating of deletes the pain of having a delete stuck behind a long -running snapshot seemed manageable (dropped client connections + resulting retries don't cause issues due to deduplication of delete jobs, batching of deletes allows enqueuing more and more deletes even if a snapshot blocks for a long time that will all be executed in essentially constant time (due to bulk snapshot deletion, deleting multiple snapshots is mostly about as fast as deleting a single one)) * Snapshot creation is completely concurrent across shards, but per shard snapshots are linearized for each repository as are snapshot finalizations See updated JavaDoc and added test cases for more details and illustration on the functionality. Some notes: The queuing of snapshot finalizations and deletes and the related locking/synchronization is a little awkward in this version but can be much simplified with some refactoring. The problem is that snapshot finalizations resolve their listeners on the `SNAPSHOT` pool while deletes resolve the listener on the master update thread. With some refactoring both of these could be moved to the master update thread, effectively removing the need for any synchronization around the `SnapshotService` state. I didn't do this refactoring here because it's a fairly large change and not necessary for the functionality but plan to do so in a follow-up. This change allows for completely removing any trickery around synchronizing deletes and snapshots from SLM and 100% does away with SLM errors from collisions between deletes and snapshots. Snapshotting a single index in parallel to a long running full backup will execute without having to wait for the long running backup as required by the ILM/SLM use case of moving indices to "snapshot tier". Finalizations are linearized but ordered according to which snapshot saw all of its shards complete first
Enables fully concurrent snapshot operations: * Snapshot create- and delete operations can be started in any order * Delete operations wait for snapshot finalization to finish, are batched as much as possible to improve efficiency and once enqueued in the cluster state prevent new snapshots from starting on data nodes until executed * We could be even more concurrent here in a follow-up by interleaving deletes and snapshots on a per-shard level. I decided not to do this for now since it seemed not worth the added complexity yet. Due to batching+deduplicating of deletes the pain of having a delete stuck behind a long -running snapshot seemed manageable (dropped client connections + resulting retries don't cause issues due to deduplication of delete jobs, batching of deletes allows enqueuing more and more deletes even if a snapshot blocks for a long time that will all be executed in essentially constant time (due to bulk snapshot deletion, deleting multiple snapshots is mostly about as fast as deleting a single one)) * Snapshot creation is completely concurrent across shards, but per shard snapshots are linearized for each repository as are snapshot finalizations See updated JavaDoc and added test cases for more details and illustration on the functionality. Some notes: The queuing of snapshot finalizations and deletes and the related locking/synchronization is a little awkward in this version but can be much simplified with some refactoring. The problem is that snapshot finalizations resolve their listeners on the `SNAPSHOT` pool while deletes resolve the listener on the master update thread. With some refactoring both of these could be moved to the master update thread, effectively removing the need for any synchronization around the `SnapshotService` state. I didn't do this refactoring here because it's a fairly large change and not necessary for the functionality but plan to do so in a follow-up. This change allows for completely removing any trickery around synchronizing deletes and snapshots from SLM and 100% does away with SLM errors from collisions between deletes and snapshots. Snapshotting a single index in parallel to a long running full backup will execute without having to wait for the long running backup as required by the ILM/SLM use case of moving indices to "snapshot tier". Finalizations are linearized but ordered according to which snapshot saw all of its shards complete first
There were two subtle bugs here from backporting elastic#56911 to 7.x. 1. We passed `null` for the `shards` map which isn't nullable any longer when creating `SnapshotsInProgress.Entry`, fixed by just passing an empty map like the `null` handling did in the past. 2. The removal of a failed `INIT` state snapshot from the cluster state tried removing it from the finalization loop (the set of repository names that are currently finalizing). This will trip an assertion since the snapshot failed before its repository was put into the set. I made the logic ignore the set in case we remove a failed `INIT` state snapshot to restore the old logic to exactly as it was before the concurrent snapshots backport to be on the safe side here. Also, added tests that explicitly call the old code paths because as can be seen from initially missing this, the BwC tests will only run in the configuration new version master, old version nodes ever so often and having a deterministic test for the old state machine seems the safest bet here. Closes elastic#59986
There were two subtle bugs here from backporting #56911 to 7.x. 1. We passed `null` for the `shards` map which isn't nullable any longer when creating `SnapshotsInProgress.Entry`, fixed by just passing an empty map like the `null` handling did in the past. 2. The removal of a failed `INIT` state snapshot from the cluster state tried removing it from the finalization loop (the set of repository names that are currently finalizing). This will trip an assertion since the snapshot failed before its repository was put into the set. I made the logic ignore the set in case we remove a failed `INIT` state snapshot to restore the old logic to exactly as it was before the concurrent snapshots backport to be on the safe side here. Also, added tests that explicitly call the old code paths because as can be seen from initially missing this, the BwC tests will only run in the configuration new version master, old version nodes ever so often and having a deterministic test for the old state machine seems the safest bet here. Closes #59986
There were two subtle bugs here from backporting elastic#56911 to 7.x. 1. We passed `null` for the `shards` map which isn't nullable any longer when creating `SnapshotsInProgress.Entry`, fixed by just passing an empty map like the `null` handling did in the past. 2. The removal of a failed `INIT` state snapshot from the cluster state tried removing it from the finalization loop (the set of repository names that are currently finalizing). This will trip an assertion since the snapshot failed before its repository was put into the set. I made the logic ignore the set in case we remove a failed `INIT` state snapshot to restore the old logic to exactly as it was before the concurrent snapshots backport to be on the safe side here. Also, added tests that explicitly call the old code paths because as can be seen from initially missing this, the BwC tests will only run in the configuration new version master, old version nodes ever so often and having a deterministic test for the old state machine seems the safest bet here. Closes elastic#59986
There were two subtle bugs here from backporting #56911 to 7.x. 1. We passed `null` for the `shards` map which isn't nullable any longer when creating `SnapshotsInProgress.Entry`, fixed by just passing an empty map like the `null` handling did in the past. 2. The removal of a failed `INIT` state snapshot from the cluster state tried removing it from the finalization loop (the set of repository names that are currently finalizing). This will trip an assertion since the snapshot failed before its repository was put into the set. I made the logic ignore the set in case we remove a failed `INIT` state snapshot to restore the old logic to exactly as it was before the concurrent snapshots backport to be on the safe side here. Also, added tests that explicitly call the old code paths because as can be seen from initially missing this, the BwC tests will only run in the configuration new version master, old version nodes ever so often and having a deterministic test for the old state machine seems the safest bet here. Closes #59986
Enables fully concurrent snapshot operations:
See updated JavaDoc and added test cases for more details and illustration on the functionality.
Some notes:
The queuing of snapshot finalizations and deletes and the related locking/synchronization is a little awkward in this version but can be much simplified with some refactoring. The problem is that snapshot finalizations resolve their listeners on the
SNAPSHOT
pool while deletes resolve the listener on the master update thread. With some refactoring both of these could be moved to the master update thread, effectively removing the need for any synchronization around theSnapshotService
state. I didn't do this refactoring here because it's a fairly large change and not necessary for the functionality but plan to do so in a follow-up.This change allows for completely removing any trickery around synchronizing deletes and snapshots from SLM and 100% does away with SLM errors from collisions between deletes and snapshots.
Snapshotting a single index in parallel to a long running full backup will execute without having to wait for the long running backup as required by the ILM/SLM use case of moving indices to "snapshot tier". Finalizations are linearized but ordered according to which snapshot saw all of its shards complete first