-
Notifications
You must be signed in to change notification settings - Fork 25.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
testDataNodeRestartAfterShardSnapshotFailure fails by leaking a shard snapshot on a failed node #48526
Comments
Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore) |
I worked out how to reproduce this: diff --git a/server/src/main/java/org/elasticsearch/cluster/service/ClusterApplierService.java b/server/src/main/java/org/elasticsearch/cluster/service/ClusterApplierService.java
index f5bbe2d420b..644a349b989 100644
--- a/server/src/main/java/org/elasticsearch/cluster/service/ClusterApplierService.java
+++ b/server/src/main/java/org/elasticsearch/cluster/service/ClusterApplierService.java
@@ -466,6 +466,13 @@ public class ClusterApplierService extends AbstractLifecycleComponent implements
summary, newClusterState.term(), newClusterState.version(), task.source);
}
}
+ if (nodesDelta.removed()) {
+ try {
+ Thread.sleep(10000);
+ } catch (InterruptedException e) {
+ throw new AssertionError("unexpected", e);
+ }
+ }
logger.trace("connecting to nodes of cluster state with version {}", newClusterState.version());
try (Releasable ignored = stopWatch.timing("connecting to new nodes")) { With a sufficiently long pause here, the master processes the |
@DaveCTurner jup that's how far I got today as well. Unfortunately, there's a number of related issues here because the node re-joining may start the snapshot again if it's still in |
Fixes the shard snapshot status reporting for failed shards in the corner case of failing the shard because of an exception thrown in `SnapshotShardsService` and not the repository. We were missing the update on the `snapshotStatus` instance in this case which made the transport APIs using this field report back an incorrect status. Fixed by moving the failure handling to the `SnapshotShardsService` for all cases (which also simplifies the code, the ex. wrapping in the repository was pointless as we only used the ex. trace upstream anyway). Also, added an assertion to another test that explicitly checks this failure situation (ex. in the `SnapshotShardsService`) already. Closes elastic#48526
Status APIs struck again it turns out :) -> fix incoming in #48556 The fact that we were missing the node-left event is irrelevant it turns out since that build had a proper failure on the primary node still (from missing the index allocation after restart). I don't think we want to change the logic here to make sure we always catch the node-left somehow ... in practice if the node restarts and picks up the snapshot for its shards correctly that's fine and not something to fail on after all :) |
Fixes the shard snapshot status reporting for failed shards in the corner case of failing the shard because of an exception thrown in `SnapshotShardsService` and not the repository. We were missing the update on the `snapshotStatus` instance in this case which made the transport APIs using this field report back an incorrect status. Fixed by moving the failure handling to the `SnapshotShardsService` for all cases (which also simplifies the code, the ex. wrapping in the repository was pointless as we only used the ex. trace upstream anyway). Also, added an assertion to another test that explicitly checks this failure situation (ex. in the `SnapshotShardsService`) already. Closes #48526
Fixes the shard snapshot status reporting for failed shards in the corner case of failing the shard because of an exception thrown in `SnapshotShardsService` and not the repository. We were missing the update on the `snapshotStatus` instance in this case which made the transport APIs using this field report back an incorrect status. Fixed by moving the failure handling to the `SnapshotShardsService` for all cases (which also simplifies the code, the ex. wrapping in the repository was pointless as we only used the ex. trace upstream anyway). Also, added an assertion to another test that explicitly checks this failure situation (ex. in the `SnapshotShardsService`) already. Closes elastic#48526
Fixes the shard snapshot status reporting for failed shards in the corner case of failing the shard because of an exception thrown in `SnapshotShardsService` and not the repository. We were missing the update on the `snapshotStatus` instance in this case which made the transport APIs using this field report back an incorrect status. Fixed by moving the failure handling to the `SnapshotShardsService` for all cases (which also simplifies the code, the ex. wrapping in the repository was pointless as we only used the ex. trace upstream anyway). Also, added an assertion to another test that explicitly checks this failure situation (ex. in the `SnapshotShardsService`) already. Closes #48526
Fixes the shard snapshot status reporting for failed shards in the corner case of failing the shard because of an exception thrown in `SnapshotShardsService` and not the repository. We were missing the update on the `snapshotStatus` instance in this case which made the transport APIs using this field report back an incorrect status. Fixed by moving the failure handling to the `SnapshotShardsService` for all cases (which also simplifies the code, the ex. wrapping in the repository was pointless as we only used the ex. trace upstream anyway). Also, added an assertion to another test that explicitly checks this failure situation (ex. in the `SnapshotShardsService`) already. Closes elastic#48526
Fixes the shard snapshot status reporting for failed shards in the corner case of failing the shard because of an exception thrown in `SnapshotShardsService` and not the repository. We were missing the update on the `snapshotStatus` instance in this case which made the transport APIs using this field report back an incorrect status. Fixed by moving the failure handling to the `SnapshotShardsService` for all cases (which also simplifies the code, the ex. wrapping in the repository was pointless as we only used the ex. trace upstream anyway). Also, added an assertion to another test that explicitly checks this failure situation (ex. in the `SnapshotShardsService`) already. Closes #48526
This test failed in this CI run: https://gradle-enterprise.elastic.co/s/rlbqh5tapantg/ (in
7.x
). It shuts down a node while that node is snapshotting a shard, but the shard snapshot does not fail:The text was updated successfully, but these errors were encountered: