testDataNodeRestartAfterShardSnapshotFailure fails by leaking a shard snapshot on a failed node #48526

DaveCTurner · 2019-10-25T13:18:27Z

This test failed in this CI run: https://gradle-enterprise.elastic.co/s/rlbqh5tapantg/ (in 7.x). It shuts down a node while that node is snapshotting a shard, but the shard snapshot does not fail:

  1> [2019-10-25T01:22:40,150][INFO ][o.e.s.DedicatedClusterSnapshotRestoreIT] [testDataNodeRestartAfterShardSnapshotFailure] -->  snapshot
  1> [2019-10-25T01:22:40,160][INFO ][o.e.s.SnapshotsService   ] [node_tm0] snapshot [test-repo:test-snap/8CMeKZucSfKdomiV2gX9Lw] started
  1> [2019-10-25T01:22:40,217][INFO ][o.e.s.DedicatedClusterSnapshotRestoreIT] [testDataNodeRestartAfterShardSnapshotFailure] -->  restarting first data node, which should cause the primary shard on it to be failed
  1> [2019-10-25T01:22:40,217][INFO ][o.e.t.InternalTestCluster] [testDataNodeRestartAfterShardSnapshotFailure] Restarting node [node_td1] 
  1> [2019-10-25T01:22:40,218][INFO ][o.e.n.Node               ] [testDataNodeRestartAfterShardSnapshotFailure] stopping ...
  1> [2019-10-25T01:22:40,220][INFO ][o.e.c.c.Coordinator      ] [node_td1] master node [{node_tm0}{ztB71aT-Qfmpv14TI36OOg}{JRG4MuoYREqTFOGkDH3D2g}{127.0.0.1}{127.0.0.1:34059}{m}] failed, restarting discovery
  1> org.elasticsearch.transport.NodeDisconnectedException: [node_tm0][127.0.0.1:34059][disconnected] disconnected
  1> [2019-10-25T01:22:40,223][INFO ][o.e.s.m.MockRepository   ] [node_td2] [test-repo] blocking I/O operation for file [__Gee36ShzRIWE9EIdskMBfw.part0] at path [[indices][Md-z-wvRQPuh-C_U1reAfQ][1]]
  1> [2019-10-25T01:22:40,224][INFO ][o.e.c.s.MasterService    ] [node_tm0] node-left[{node_td1}{Kn1zpH90QPqXMtL1NqqVOg}{DKlTH9pkRymZtUbRoe8_rA}{127.0.0.1}{127.0.0.1:37309}{di} reason: disconnected], term: 1, version: 12, delta: removed {{node_td1}{Kn1zpH90QPqXMtL1NqqVOg}{DKlTH9pkRymZtUbRoe8_rA}{127.0.0.1}{127.0.0.1:37309}{di}}
  1> [2019-10-25T01:22:40,224][WARN ][o.e.s.SnapshotShardsService] [node_td1] [[test-idx][0]][test-repo:test-snap/8CMeKZucSfKdomiV2gX9Lw] failed to snapshot shard
  1> org.elasticsearch.index.snapshots.IndexShardSnapshotFailedException: Aborted
  1>  at org.elasticsearch.repositories.blobstore.BlobStoreRepository.snapshotShard(BlobStoreRepository.java:1112) ~[main/:?]
  1>  at org.elasticsearch.snapshots.SnapshotShardsService.snapshot(SnapshotShardsService.java:337) ~[main/:?]
  1>  at org.elasticsearch.snapshots.SnapshotShardsService.lambda$startNewShards$1(SnapshotShardsService.java:285) ~[main/:?]
  1>  at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:703) [main/:?]
  1>  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_221]
  1>  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_221]
  1>  at java.lang.Thread.run(Thread.java:748) [?:1.8.0_221]
  1> [2019-10-25T01:22:40,226][INFO ][o.e.n.Node               ] [testDataNodeRestartAfterShardSnapshotFailure] stopped
  1> [2019-10-25T01:22:40,227][INFO ][o.e.n.Node               ] [testDataNodeRestartAfterShardSnapshotFailure] closing ...
  1> [2019-10-25T01:22:40,229][WARN ][o.e.s.SnapshotShardsService] [node_td1] [test-repo:test-snap/8CMeKZucSfKdomiV2gX9Lw] [ShardSnapshotStatus[state=FAILED, nodeId=Kn1zpH90QPqXMtL1NqqVOg, reason=[test-idx/l-lv0KpjQX6MGCl5PTKXUw][[test-idx][0]] IndexShardSnapshotFailedException[Aborted]
  1>  at org.elasticsearch.repositories.blobstore.BlobStoreRepository.snapshotShard(BlobStoreRepository.java:1112)
  1>  at org.elasticsearch.snapshots.SnapshotShardsService.snapshot(SnapshotShardsService.java:337)
  1>  at org.elasticsearch.snapshots.SnapshotShardsService.lambda$startNewShards$1(SnapshotShardsService.java:285)
  1>  at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:703)
  1>  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
  1>  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
  1>  at java.lang.Thread.run(Thread.java:748)
  1> , generation=null]] failed to update snapshot state
  1> org.elasticsearch.transport.SendRequestTransportException: [node_td1][127.0.0.1:37309][internal:cluster/snapshot/update_snapshot_status]
  1>  at org.elasticsearch.transport.TransportService.sendRequestInternal(TransportService.java:704) ~[main/:?]
  1>  at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:602) ~[main/:?]
  1>  at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:577) ~[main/:?]
  1>  at org.elasticsearch.snapshots.SnapshotShardsService.lambda$sendSnapshotShardUpdate$2(SnapshotShardsService.java:471) ~[main/:?]
  1>  at org.elasticsearch.transport.TransportRequestDeduplicator.executeOnce(TransportRequestDeduplicator.java:52) ~[main/:?]
  1>  at org.elasticsearch.snapshots.SnapshotShardsService.sendSnapshotShardUpdate(SnapshotShardsService.java:457) ~[main/:?]
  1>  at org.elasticsearch.snapshots.SnapshotShardsService.notifyFailedSnapshotShard(SnapshotShardsService.java:451) ~[main/:?]
  1>  at org.elasticsearch.snapshots.SnapshotShardsService.access$400(SnapshotShardsService.java:91) ~[main/:?]
  1>  at org.elasticsearch.snapshots.SnapshotShardsService$1.onFailure(SnapshotShardsService.java:301) ~[main/:?]
  1>  at org.elasticsearch.action.ActionListener$5.onFailure(ActionListener.java:256) ~[main/:?]
  1>  at org.elasticsearch.repositories.blobstore.BlobStoreRepository.lambda$snapshotShard$32(BlobStoreRepository.java:1061) ~[main/:?]
  1>  at org.elasticsearch.action.ActionListener$1.onFailure(ActionListener.java:71) ~[main/:?]
  1>  at org.elasticsearch.repositories.blobstore.BlobStoreRepository.snapshotShard(BlobStoreRepository.java:1239) ~[main/:?]
  1>  at org.elasticsearch.snapshots.SnapshotShardsService.snapshot(SnapshotShardsService.java:337) ~[main/:?]
  1>  at org.elasticsearch.snapshots.SnapshotShardsService.lambda$startNewShards$1(SnapshotShardsService.java:285) ~[main/:?]
  1>  at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:703) ~[main/:?]
  1>  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_221]
  1>  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_221]
  1>  at java.lang.Thread.run(Thread.java:748) [?:1.8.0_221]
  1> Caused by: org.elasticsearch.transport.TransportException: TransportService is closed stopped can't send request
  1>  at org.elasticsearch.transport.TransportService.sendRequestInternal(TransportService.java:686) ~[main/:?]
  1>  ... 18 more
  1> [2019-10-25T01:22:40,232][INFO ][o.e.n.Node               ] [testDataNodeRestartAfterShardSnapshotFailure] closed
  1> [2019-10-25T01:22:40,240][INFO ][o.e.e.NodeEnvironment    ] [testDataNodeRestartAfterShardSnapshotFailure] using [1] data paths, mounts [[/ (/dev/sda1)]], net usable_space [307gb], net total_space [349.9gb], types [xfs]
  1> [2019-10-25T01:22:40,240][INFO ][o.e.e.NodeEnvironment    ] [testDataNodeRestartAfterShardSnapshotFailure] heap size [491mb], compressed ordinary object pointers [true]
  1> [2019-10-25T01:22:40,249][INFO ][o.e.n.Node               ] [testDataNodeRestartAfterShardSnapshotFailure] node name [node_td1], node ID [Kn1zpH90QPqXMtL1NqqVOg], cluster name [TEST-TEST_WORKER_VM=[435]-CLUSTER_SEED=[-2213907199717514849]-HASH=[17AAA92C130]-cluster]
  1> [2019-10-25T01:22:40,249][INFO ][o.e.n.Node               ] [testDataNodeRestartAfterShardSnapshotFailure] version[7.6.0], pid[217518], build[unknown/unknown/unknown/unknown], OS[Linux/4.14.35-1902.6.6.el7uek.x86_64/amd64], JVM[Oracle Corporation/Java HotSpot(TM) 64-Bit Server VM/1.8.0_221/25.221-b11]
  1> [2019-10-25T01:22:40,249][INFO ][o.e.n.Node               ] [testDataNodeRestartAfterShardSnapshotFailure] JVM home [/var/lib/jenkins/.java/oracle-8u221-linux/jre]
  1> [2019-10-25T01:22:40,249][INFO ][o.e.n.Node               ] [testDataNodeRestartAfterShardSnapshotFailure] JVM arguments [-Dfile.encoding=UTF8, -Dcompiler.java=12, -Des.scripting.update.ctx_in_params=false, -Des.transport.cname_in_publish_address=true, -Dgradle.dist.lib=/var/lib/jenkins/.gradle/wrapper/dists/gradle-5.6.2-all/9st6wgf78h16so49nn74lgtbb/gradle-5.6.2/lib, -Dgradle.user.home=/var/lib/jenkins/.gradle, -Dgradle.worker.jar=/var/lib/jenkins/.gradle/caches/5.6.2/workerMain/gradle-worker.jar, -Dio.netty.noKeySetOptimization=true, -Dio.netty.noUnsafe=true, -Dio.netty.recycler.maxCapacityPerThread=0, -Djava.awt.headless=true, -Djava.locale.providers=SPI,JRE, -Djava.security.manager=worker.org.gradle.process.internal.worker.child.BootstrapSecurityManager, -Djna.nosys=true, -Dorg.gradle.native=false, -Druntime.java=8, -Dtests.artifact=server, -Dtests.gradle=true, -Dtests.logger.level=WARN, -Dtests.security.manager=true, -Dtests.seed=14AC7937BAB45A6E, -Dtests.task=:server:integTest, -esa, -XX:+HeapDumpOnOutOfMemoryError, -XX:HeapDumpPath=/var/lib/jenkins/workspace/elastic+elasticsearch+7.x+multijob-unix-compatibility/os/oraclelinux-7&&immutable/server/build/heapdump, -Xms512m, -Xmx512m, -Dfile.encoding=UTF-8, -Djava.io.tmpdir=./temp, -Duser.country=US, -Duser.language=en, -Duser.variant, -ea]
  1> [2019-10-25T01:22:40,250][INFO ][o.e.p.PluginsService     ] [testDataNodeRestartAfterShardSnapshotFailure] no modules loaded
  1> [2019-10-25T01:22:40,250][INFO ][o.e.p.PluginsService     ] [testDataNodeRestartAfterShardSnapshotFailure] loaded plugin [org.elasticsearch.snapshots.DedicatedClusterSnapshotRestoreIT$BrokenSettingPlugin]
  1> [2019-10-25T01:22:40,250][INFO ][o.e.p.PluginsService     ] [testDataNodeRestartAfterShardSnapshotFailure] loaded plugin [org.elasticsearch.snapshots.DedicatedClusterSnapshotRestoreIT$TestCustomMetaDataPlugin]
  1> [2019-10-25T01:22:40,250][INFO ][o.e.p.PluginsService     ] [testDataNodeRestartAfterShardSnapshotFailure] loaded plugin [org.elasticsearch.snapshots.mockstore.MockRepository$Plugin]
  1> [2019-10-25T01:22:40,250][INFO ][o.e.p.PluginsService     ] [testDataNodeRestartAfterShardSnapshotFailure] loaded plugin [org.elasticsearch.test.ESIntegTestCase$AssertActionNamePlugin]
  1> [2019-10-25T01:22:40,250][INFO ][o.e.p.PluginsService     ] [testDataNodeRestartAfterShardSnapshotFailure] loaded plugin [org.elasticsearch.test.ESIntegTestCase$TestSeedPlugin]
  1> [2019-10-25T01:22:40,250][INFO ][o.e.p.PluginsService     ] [testDataNodeRestartAfterShardSnapshotFailure] loaded plugin [org.elasticsearch.test.MockHttpTransport$TestPlugin]
  1> [2019-10-25T01:22:40,250][INFO ][o.e.p.PluginsService     ] [testDataNodeRestartAfterShardSnapshotFailure] loaded plugin [org.elasticsearch.transport.nio.MockNioTransportPlugin]
  1> [2019-10-25T01:22:40,269][INFO ][o.e.d.DiscoveryModule    ] [testDataNodeRestartAfterShardSnapshotFailure] using discovery type [zen] and seed hosts providers [settings, file]
  1> [2019-10-25T01:22:40,290][INFO ][o.e.n.Node               ] [testDataNodeRestartAfterShardSnapshotFailure] initialized
  1> [2019-10-25T01:22:40,290][INFO ][o.e.n.Node               ] [testDataNodeRestartAfterShardSnapshotFailure] starting ...
  1> [2019-10-25T01:22:40,294][INFO ][o.e.t.TransportService   ] [testDataNodeRestartAfterShardSnapshotFailure] publish_address {127.0.0.1:38475}, bound_addresses {[::1]:35853}, {127.0.0.1:38475}
  1> [2019-10-25T01:22:40,301][INFO ][o.e.c.s.ClusterApplierService] [node_td2] removed {{node_td1}{Kn1zpH90QPqXMtL1NqqVOg}{DKlTH9pkRymZtUbRoe8_rA}{127.0.0.1}{127.0.0.1:37309}{di}}, term: 1, version: 12, reason: ApplyCommitRequest{term=1, version=12, sourceNode={node_tm0}{ztB71aT-Qfmpv14TI36OOg}{JRG4MuoYREqTFOGkDH3D2g}{127.0.0.1}{127.0.0.1:34059}{m}}
  1> [2019-10-25T01:22:40,305][INFO ][o.e.c.s.ClusterApplierService] [node_tm0] removed {{node_td1}{Kn1zpH90QPqXMtL1NqqVOg}{DKlTH9pkRymZtUbRoe8_rA}{127.0.0.1}{127.0.0.1:37309}{di}}, term: 1, version: 12, reason: Publication{term=1, version=12}
  1> [2019-10-25T01:22:40,309][INFO ][o.e.c.c.Coordinator      ] [testDataNodeRestartAfterShardSnapshotFailure] cluster UUID [6WUNaFxyS9e_GZgDQ7k3FQ]
  1> [2019-10-25T01:22:40,356][INFO ][o.e.c.s.MasterService    ] [node_tm0] node-join[{node_td1}{Kn1zpH90QPqXMtL1NqqVOg}{AdrMl80YTKu-MHo2vIUEhw}{127.0.0.1}{127.0.0.1:38475}{di} join existing leader], term: 1, version: 14, delta: added {{node_td1}{Kn1zpH90QPqXMtL1NqqVOg}{AdrMl80YTKu-MHo2vIUEhw}{127.0.0.1}{127.0.0.1:38475}{di}}
  1> [2019-10-25T01:22:40,359][INFO ][o.e.c.s.ClusterApplierService] [node_td2] added {{node_td1}{Kn1zpH90QPqXMtL1NqqVOg}{AdrMl80YTKu-MHo2vIUEhw}{127.0.0.1}{127.0.0.1:38475}{di}}, term: 1, version: 14, reason: ApplyCommitRequest{term=1, version=14, sourceNode={node_tm0}{ztB71aT-Qfmpv14TI36OOg}{JRG4MuoYREqTFOGkDH3D2g}{127.0.0.1}{127.0.0.1:34059}{m}}
  1> [2019-10-25T01:22:40,359][INFO ][o.e.c.s.ClusterApplierService] [node_td1] master node changed {previous [], current [{node_tm0}{ztB71aT-Qfmpv14TI36OOg}{JRG4MuoYREqTFOGkDH3D2g}{127.0.0.1}{127.0.0.1:34059}{m}]}, added {{node_td2}{mpwldoJOQTucDLY79bMP6w}{Ge9Glg2TTyCthw9_1wtfnQ}{127.0.0.1}{127.0.0.1:36515}{di},{node_tm0}{ztB71aT-Qfmpv14TI36OOg}{JRG4MuoYREqTFOGkDH3D2g}{127.0.0.1}{127.0.0.1:34059}{m}}, term: 1, version: 14, reason: ApplyCommitRequest{term=1, version=14, sourceNode={node_tm0}{ztB71aT-Qfmpv14TI36OOg}{JRG4MuoYREqTFOGkDH3D2g}{127.0.0.1}{127.0.0.1:34059}{m}}
  1> [2019-10-25T01:22:40,362][INFO ][o.e.s.m.MockRepository   ] [node_td1] starting mock repository with random prefix default
  1> [2019-10-25T01:22:40,442][WARN ][o.e.s.SnapshotShardsService] [node_td1] [[test-idx][0]][test-repo:test-snap/8CMeKZucSfKdomiV2gX9Lw] failed to snapshot shard
  1> org.elasticsearch.index.IndexNotFoundException: no such index [test-idx]
  1>  at org.elasticsearch.indices.IndicesService.indexServiceSafe(IndicesService.java:458) ~[main/:?]
  1>  at org.elasticsearch.snapshots.SnapshotShardsService.snapshot(SnapshotShardsService.java:317) ~[main/:?]
  1>  at org.elasticsearch.snapshots.SnapshotShardsService.lambda$startNewShards$1(SnapshotShardsService.java:285) ~[main/:?]
  1>  at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:703) [main/:?]
  1>  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_221]
  1>  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_221]
  1>  at java.lang.Thread.run(Thread.java:748) [?:1.8.0_221]
  1> [2019-10-25T01:22:40,444][INFO ][o.e.n.Node               ] [testDataNodeRestartAfterShardSnapshotFailure] started
  1> [2019-10-25T01:22:40,444][INFO ][o.e.c.s.ClusterApplierService] [node_tm0] added {{node_td1}{Kn1zpH90QPqXMtL1NqqVOg}{AdrMl80YTKu-MHo2vIUEhw}{127.0.0.1}{127.0.0.1:38475}{di}}, term: 1, version: 14, reason: Publication{term=1, version=14}
  1> [2019-10-25T01:22:40,447][INFO ][o.e.s.DedicatedClusterSnapshotRestoreIT] [testDataNodeRestartAfterShardSnapshotFailure] -->  wait for shard snapshot of first primary to show as failed
  1> [2019-10-25T01:22:40,632][INFO ][o.e.c.r.a.AllocationService] [node_tm0] Cluster health status changed from [RED] to [GREEN] (reason: [shards started [[test-idx][0]]]).
  1> [2019-10-25T01:23:46,034][INFO ][o.e.a.a.c.r.c.TransportCleanupRepositoryAction] [node_tm0] Running cleanup operations on repository [test-repo][-1]
  1> [2019-10-25T01:23:46,036][WARN ][o.e.a.a.c.r.c.TransportCleanupRepositoryAction] [node_tm0] Failed to run repository cleanup operations on [test-repo][-1]
  1> java.lang.IllegalStateException: Cannot cleanup [test-repo] - a snapshot is currently running
  1>  at org.elasticsearch.action.admin.cluster.repositories.cleanup.TransportCleanupRepositoryAction$2.execute(TransportCleanupRepositoryAction.java:187) ~[main/:?]
  1>  at org.elasticsearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:47) ~[main/:?]
  1>  at org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:702) ~[main/:?]
  1>  at org.elasticsearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:324) ~[main/:?]
  1>  at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:219) [main/:?]
  1>  at org.elasticsearch.cluster.service.MasterService.access$000(MasterService.java:73) [main/:?]
  1>  at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:151) [main/:?]
  1>  at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150) [main/:?]
  1>  at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188) [main/:?]
  1>  at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:703) [main/:?]
  1>  at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252) [main/:?]
  1>  at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215) [main/:?]
  1>  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_221]
  1>  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_221]
  1>  at java.lang.Thread.run(Thread.java:748) [?:1.8.0_221]
  1> [2019-10-25T01:23:46,039][INFO ][o.e.s.DedicatedClusterSnapshotRestoreIT] [testDataNodeRestartAfterShardSnapshotFailure] [DedicatedClusterSnapshotRestoreIT#testDataNodeRestartAfterShardSnapshotFailure]: cleaning up after test

The text was updated successfully, but these errors were encountered:

elasticmachine · 2019-10-25T13:18:28Z

Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore)

DaveCTurner · 2019-10-25T21:34:05Z

I worked out how to reproduce this:

diff --git a/server/src/main/java/org/elasticsearch/cluster/service/ClusterApplierService.java b/server/src/main/java/org/elasticsearch/cluster/service/ClusterApplierService.java
index f5bbe2d420b..644a349b989 100644
--- a/server/src/main/java/org/elasticsearch/cluster/service/ClusterApplierService.java
+++ b/server/src/main/java/org/elasticsearch/cluster/service/ClusterApplierService.java
@@ -466,6 +466,13 @@ public class ClusterApplierService extends AbstractLifecycleComponent implements
                     summary, newClusterState.term(), newClusterState.version(), task.source);
             }
         }
+        if (nodesDelta.removed()) {
+            try {
+                Thread.sleep(10000);
+            } catch (InterruptedException e) {
+                throw new AssertionError("unexpected", e);
+            }
+        }

         logger.trace("connecting to nodes of cluster state with version {}", newClusterState.version());
         try (Releasable ignored = stopWatch.timing("connecting to new nodes")) {

With a sufficiently long pause here, the master processes the node-left and node-join updates without processing any snapshot-related updates in between, and this means it skips the updates that would fail the shard snapshot.

original-brownbear · 2019-10-25T21:45:30Z

@DaveCTurner jup that's how far I got today as well. Unfortunately, there's a number of related issues here because the node re-joining may start the snapshot again if it's still in INIT state etc. depending on the timing. Will be looking into how to clean this up tomorrow :)
Also, the snapshot status API is somehow broken in this scenario and fails to properly translate the failed shard in the CS to a response for a node that rejoined the cluster.

Fixes the shard snapshot status reporting for failed shards in the corner case of failing the shard because of an exception thrown in `SnapshotShardsService` and not the repository. We were missing the update on the `snapshotStatus` instance in this case which made the transport APIs using this field report back an incorrect status. Fixed by moving the failure handling to the `SnapshotShardsService` for all cases (which also simplifies the code, the ex. wrapping in the repository was pointless as we only used the ex. trace upstream anyway). Also, added an assertion to another test that explicitly checks this failure situation (ex. in the `SnapshotShardsService`) already. Closes elastic#48526

original-brownbear · 2019-10-27T14:48:48Z

Status APIs struck again it turns out :) -> fix incoming in #48556

The fact that we were missing the node-left event is irrelevant it turns out since that build had a proper failure on the primary node still (from missing the index allocation after restart). I don't think we want to change the logic here to make sure we always catch the node-left somehow ... in practice if the node restarts and picks up the snapshot for its shards correctly that's fine and not something to fail on after all :)

Fixes the shard snapshot status reporting for failed shards in the corner case of failing the shard because of an exception thrown in `SnapshotShardsService` and not the repository. We were missing the update on the `snapshotStatus` instance in this case which made the transport APIs using this field report back an incorrect status. Fixed by moving the failure handling to the `SnapshotShardsService` for all cases (which also simplifies the code, the ex. wrapping in the repository was pointless as we only used the ex. trace upstream anyway). Also, added an assertion to another test that explicitly checks this failure situation (ex. in the `SnapshotShardsService`) already. Closes #48526

Fixes the shard snapshot status reporting for failed shards in the corner case of failing the shard because of an exception thrown in `SnapshotShardsService` and not the repository. We were missing the update on the `snapshotStatus` instance in this case which made the transport APIs using this field report back an incorrect status. Fixed by moving the failure handling to the `SnapshotShardsService` for all cases (which also simplifies the code, the ex. wrapping in the repository was pointless as we only used the ex. trace upstream anyway). Also, added an assertion to another test that explicitly checks this failure situation (ex. in the `SnapshotShardsService`) already. Closes elastic#48526

Fixes the shard snapshot status reporting for failed shards in the corner case of failing the shard because of an exception thrown in `SnapshotShardsService` and not the repository. We were missing the update on the `snapshotStatus` instance in this case which made the transport APIs using this field report back an incorrect status. Fixed by moving the failure handling to the `SnapshotShardsService` for all cases (which also simplifies the code, the ex. wrapping in the repository was pointless as we only used the ex. trace upstream anyway). Also, added an assertion to another test that explicitly checks this failure situation (ex. in the `SnapshotShardsService`) already. Closes #48526

Fixes the shard snapshot status reporting for failed shards in the corner case of failing the shard because of an exception thrown in `SnapshotShardsService` and not the repository. We were missing the update on the `snapshotStatus` instance in this case which made the transport APIs using this field report back an incorrect status. Fixed by moving the failure handling to the `SnapshotShardsService` for all cases (which also simplifies the code, the ex. wrapping in the repository was pointless as we only used the ex. trace upstream anyway). Also, added an assertion to another test that explicitly checks this failure situation (ex. in the `SnapshotShardsService`) already. Closes elastic#48526

Fixes the shard snapshot status reporting for failed shards in the corner case of failing the shard because of an exception thrown in `SnapshotShardsService` and not the repository. We were missing the update on the `snapshotStatus` instance in this case which made the transport APIs using this field report back an incorrect status. Fixed by moving the failure handling to the `SnapshotShardsService` for all cases (which also simplifies the code, the ex. wrapping in the repository was pointless as we only used the ex. trace upstream anyway). Also, added an assertion to another test that explicitly checks this failure situation (ex. in the `SnapshotShardsService`) already. Closes #48526

DaveCTurner added :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >test-failure Triaged test failures from CI labels Oct 25, 2019

DaveCTurner assigned original-brownbear Oct 25, 2019

original-brownbear mentioned this issue Oct 27, 2019

Fix SnapshotShardStatus Reporting for Failed Shard #48556

Merged

original-brownbear closed this as completed in #48556 Oct 29, 2019

original-brownbear mentioned this issue Oct 30, 2019

Fix SnapshotShardStatus Reporting for Failed Shard (#48556) #48687

Merged

original-brownbear mentioned this issue Oct 30, 2019

Fix SnapshotShardStatus Reporting for Failed Shard (#48556) #48689

Merged

This was referenced Feb 3, 2020

[meta] 7.6 release elastic/elasticsearch-net#4340

Closed

[meta] 7.6 release elastic/elasticsearch-net#4341

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

testDataNodeRestartAfterShardSnapshotFailure fails by leaking a shard snapshot on a failed node #48526

testDataNodeRestartAfterShardSnapshotFailure fails by leaking a shard snapshot on a failed node #48526

DaveCTurner commented Oct 25, 2019

elasticmachine commented Oct 25, 2019

DaveCTurner commented Oct 25, 2019

original-brownbear commented Oct 25, 2019 •

edited

Loading

original-brownbear commented Oct 27, 2019 •

edited

Loading

testDataNodeRestartAfterShardSnapshotFailure fails by leaking a shard snapshot on a failed node #48526

testDataNodeRestartAfterShardSnapshotFailure fails by leaking a shard snapshot on a failed node #48526

Comments

DaveCTurner commented Oct 25, 2019

elasticmachine commented Oct 25, 2019

DaveCTurner commented Oct 25, 2019

original-brownbear commented Oct 25, 2019 • edited Loading

original-brownbear commented Oct 27, 2019 • edited Loading

original-brownbear commented Oct 25, 2019 •

edited

Loading

original-brownbear commented Oct 27, 2019 •

edited

Loading