Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

testDataNodeRestartAfterShardSnapshotFailure fails by leaking a shard snapshot on a failed node #48526

Closed
DaveCTurner opened this issue Oct 25, 2019 · 4 comments · Fixed by #48556
Assignees
Labels
:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >test-failure Triaged test failures from CI

Comments

@DaveCTurner
Copy link
Contributor

This test failed in this CI run: https://gradle-enterprise.elastic.co/s/rlbqh5tapantg/ (in 7.x). It shuts down a node while that node is snapshotting a shard, but the shard snapshot does not fail:

  1> [2019-10-25T01:22:40,150][INFO ][o.e.s.DedicatedClusterSnapshotRestoreIT] [testDataNodeRestartAfterShardSnapshotFailure] -->  snapshot
  1> [2019-10-25T01:22:40,160][INFO ][o.e.s.SnapshotsService   ] [node_tm0] snapshot [test-repo:test-snap/8CMeKZucSfKdomiV2gX9Lw] started
  1> [2019-10-25T01:22:40,217][INFO ][o.e.s.DedicatedClusterSnapshotRestoreIT] [testDataNodeRestartAfterShardSnapshotFailure] -->  restarting first data node, which should cause the primary shard on it to be failed
  1> [2019-10-25T01:22:40,217][INFO ][o.e.t.InternalTestCluster] [testDataNodeRestartAfterShardSnapshotFailure] Restarting node [node_td1] 
  1> [2019-10-25T01:22:40,218][INFO ][o.e.n.Node               ] [testDataNodeRestartAfterShardSnapshotFailure] stopping ...
  1> [2019-10-25T01:22:40,220][INFO ][o.e.c.c.Coordinator      ] [node_td1] master node [{node_tm0}{ztB71aT-Qfmpv14TI36OOg}{JRG4MuoYREqTFOGkDH3D2g}{127.0.0.1}{127.0.0.1:34059}{m}] failed, restarting discovery
  1> org.elasticsearch.transport.NodeDisconnectedException: [node_tm0][127.0.0.1:34059][disconnected] disconnected
  1> [2019-10-25T01:22:40,223][INFO ][o.e.s.m.MockRepository   ] [node_td2] [test-repo] blocking I/O operation for file [__Gee36ShzRIWE9EIdskMBfw.part0] at path [[indices][Md-z-wvRQPuh-C_U1reAfQ][1]]
  1> [2019-10-25T01:22:40,224][INFO ][o.e.c.s.MasterService    ] [node_tm0] node-left[{node_td1}{Kn1zpH90QPqXMtL1NqqVOg}{DKlTH9pkRymZtUbRoe8_rA}{127.0.0.1}{127.0.0.1:37309}{di} reason: disconnected], term: 1, version: 12, delta: removed {{node_td1}{Kn1zpH90QPqXMtL1NqqVOg}{DKlTH9pkRymZtUbRoe8_rA}{127.0.0.1}{127.0.0.1:37309}{di}}
  1> [2019-10-25T01:22:40,224][WARN ][o.e.s.SnapshotShardsService] [node_td1] [[test-idx][0]][test-repo:test-snap/8CMeKZucSfKdomiV2gX9Lw] failed to snapshot shard
  1> org.elasticsearch.index.snapshots.IndexShardSnapshotFailedException: Aborted
  1>  at org.elasticsearch.repositories.blobstore.BlobStoreRepository.snapshotShard(BlobStoreRepository.java:1112) ~[main/:?]
  1>  at org.elasticsearch.snapshots.SnapshotShardsService.snapshot(SnapshotShardsService.java:337) ~[main/:?]
  1>  at org.elasticsearch.snapshots.SnapshotShardsService.lambda$startNewShards$1(SnapshotShardsService.java:285) ~[main/:?]
  1>  at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:703) [main/:?]
  1>  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_221]
  1>  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_221]
  1>  at java.lang.Thread.run(Thread.java:748) [?:1.8.0_221]
  1> [2019-10-25T01:22:40,226][INFO ][o.e.n.Node               ] [testDataNodeRestartAfterShardSnapshotFailure] stopped
  1> [2019-10-25T01:22:40,227][INFO ][o.e.n.Node               ] [testDataNodeRestartAfterShardSnapshotFailure] closing ...
  1> [2019-10-25T01:22:40,229][WARN ][o.e.s.SnapshotShardsService] [node_td1] [test-repo:test-snap/8CMeKZucSfKdomiV2gX9Lw] [ShardSnapshotStatus[state=FAILED, nodeId=Kn1zpH90QPqXMtL1NqqVOg, reason=[test-idx/l-lv0KpjQX6MGCl5PTKXUw][[test-idx][0]] IndexShardSnapshotFailedException[Aborted]
  1>  at org.elasticsearch.repositories.blobstore.BlobStoreRepository.snapshotShard(BlobStoreRepository.java:1112)
  1>  at org.elasticsearch.snapshots.SnapshotShardsService.snapshot(SnapshotShardsService.java:337)
  1>  at org.elasticsearch.snapshots.SnapshotShardsService.lambda$startNewShards$1(SnapshotShardsService.java:285)
  1>  at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:703)
  1>  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
  1>  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
  1>  at java.lang.Thread.run(Thread.java:748)
  1> , generation=null]] failed to update snapshot state
  1> org.elasticsearch.transport.SendRequestTransportException: [node_td1][127.0.0.1:37309][internal:cluster/snapshot/update_snapshot_status]
  1>  at org.elasticsearch.transport.TransportService.sendRequestInternal(TransportService.java:704) ~[main/:?]
  1>  at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:602) ~[main/:?]
  1>  at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:577) ~[main/:?]
  1>  at org.elasticsearch.snapshots.SnapshotShardsService.lambda$sendSnapshotShardUpdate$2(SnapshotShardsService.java:471) ~[main/:?]
  1>  at org.elasticsearch.transport.TransportRequestDeduplicator.executeOnce(TransportRequestDeduplicator.java:52) ~[main/:?]
  1>  at org.elasticsearch.snapshots.SnapshotShardsService.sendSnapshotShardUpdate(SnapshotShardsService.java:457) ~[main/:?]
  1>  at org.elasticsearch.snapshots.SnapshotShardsService.notifyFailedSnapshotShard(SnapshotShardsService.java:451) ~[main/:?]
  1>  at org.elasticsearch.snapshots.SnapshotShardsService.access$400(SnapshotShardsService.java:91) ~[main/:?]
  1>  at org.elasticsearch.snapshots.SnapshotShardsService$1.onFailure(SnapshotShardsService.java:301) ~[main/:?]
  1>  at org.elasticsearch.action.ActionListener$5.onFailure(ActionListener.java:256) ~[main/:?]
  1>  at org.elasticsearch.repositories.blobstore.BlobStoreRepository.lambda$snapshotShard$32(BlobStoreRepository.java:1061) ~[main/:?]
  1>  at org.elasticsearch.action.ActionListener$1.onFailure(ActionListener.java:71) ~[main/:?]
  1>  at org.elasticsearch.repositories.blobstore.BlobStoreRepository.snapshotShard(BlobStoreRepository.java:1239) ~[main/:?]
  1>  at org.elasticsearch.snapshots.SnapshotShardsService.snapshot(SnapshotShardsService.java:337) ~[main/:?]
  1>  at org.elasticsearch.snapshots.SnapshotShardsService.lambda$startNewShards$1(SnapshotShardsService.java:285) ~[main/:?]
  1>  at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:703) ~[main/:?]
  1>  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_221]
  1>  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_221]
  1>  at java.lang.Thread.run(Thread.java:748) [?:1.8.0_221]
  1> Caused by: org.elasticsearch.transport.TransportException: TransportService is closed stopped can't send request
  1>  at org.elasticsearch.transport.TransportService.sendRequestInternal(TransportService.java:686) ~[main/:?]
  1>  ... 18 more
  1> [2019-10-25T01:22:40,232][INFO ][o.e.n.Node               ] [testDataNodeRestartAfterShardSnapshotFailure] closed
  1> [2019-10-25T01:22:40,240][INFO ][o.e.e.NodeEnvironment    ] [testDataNodeRestartAfterShardSnapshotFailure] using [1] data paths, mounts [[/ (/dev/sda1)]], net usable_space [307gb], net total_space [349.9gb], types [xfs]
  1> [2019-10-25T01:22:40,240][INFO ][o.e.e.NodeEnvironment    ] [testDataNodeRestartAfterShardSnapshotFailure] heap size [491mb], compressed ordinary object pointers [true]
  1> [2019-10-25T01:22:40,249][INFO ][o.e.n.Node               ] [testDataNodeRestartAfterShardSnapshotFailure] node name [node_td1], node ID [Kn1zpH90QPqXMtL1NqqVOg], cluster name [TEST-TEST_WORKER_VM=[435]-CLUSTER_SEED=[-2213907199717514849]-HASH=[17AAA92C130]-cluster]
  1> [2019-10-25T01:22:40,249][INFO ][o.e.n.Node               ] [testDataNodeRestartAfterShardSnapshotFailure] version[7.6.0], pid[217518], build[unknown/unknown/unknown/unknown], OS[Linux/4.14.35-1902.6.6.el7uek.x86_64/amd64], JVM[Oracle Corporation/Java HotSpot(TM) 64-Bit Server VM/1.8.0_221/25.221-b11]
  1> [2019-10-25T01:22:40,249][INFO ][o.e.n.Node               ] [testDataNodeRestartAfterShardSnapshotFailure] JVM home [/var/lib/jenkins/.java/oracle-8u221-linux/jre]
  1> [2019-10-25T01:22:40,249][INFO ][o.e.n.Node               ] [testDataNodeRestartAfterShardSnapshotFailure] JVM arguments [-Dfile.encoding=UTF8, -Dcompiler.java=12, -Des.scripting.update.ctx_in_params=false, -Des.transport.cname_in_publish_address=true, -Dgradle.dist.lib=/var/lib/jenkins/.gradle/wrapper/dists/gradle-5.6.2-all/9st6wgf78h16so49nn74lgtbb/gradle-5.6.2/lib, -Dgradle.user.home=/var/lib/jenkins/.gradle, -Dgradle.worker.jar=/var/lib/jenkins/.gradle/caches/5.6.2/workerMain/gradle-worker.jar, -Dio.netty.noKeySetOptimization=true, -Dio.netty.noUnsafe=true, -Dio.netty.recycler.maxCapacityPerThread=0, -Djava.awt.headless=true, -Djava.locale.providers=SPI,JRE, -Djava.security.manager=worker.org.gradle.process.internal.worker.child.BootstrapSecurityManager, -Djna.nosys=true, -Dorg.gradle.native=false, -Druntime.java=8, -Dtests.artifact=server, -Dtests.gradle=true, -Dtests.logger.level=WARN, -Dtests.security.manager=true, -Dtests.seed=14AC7937BAB45A6E, -Dtests.task=:server:integTest, -esa, -XX:+HeapDumpOnOutOfMemoryError, -XX:HeapDumpPath=/var/lib/jenkins/workspace/elastic+elasticsearch+7.x+multijob-unix-compatibility/os/oraclelinux-7&&immutable/server/build/heapdump, -Xms512m, -Xmx512m, -Dfile.encoding=UTF-8, -Djava.io.tmpdir=./temp, -Duser.country=US, -Duser.language=en, -Duser.variant, -ea]
  1> [2019-10-25T01:22:40,250][INFO ][o.e.p.PluginsService     ] [testDataNodeRestartAfterShardSnapshotFailure] no modules loaded
  1> [2019-10-25T01:22:40,250][INFO ][o.e.p.PluginsService     ] [testDataNodeRestartAfterShardSnapshotFailure] loaded plugin [org.elasticsearch.snapshots.DedicatedClusterSnapshotRestoreIT$BrokenSettingPlugin]
  1> [2019-10-25T01:22:40,250][INFO ][o.e.p.PluginsService     ] [testDataNodeRestartAfterShardSnapshotFailure] loaded plugin [org.elasticsearch.snapshots.DedicatedClusterSnapshotRestoreIT$TestCustomMetaDataPlugin]
  1> [2019-10-25T01:22:40,250][INFO ][o.e.p.PluginsService     ] [testDataNodeRestartAfterShardSnapshotFailure] loaded plugin [org.elasticsearch.snapshots.mockstore.MockRepository$Plugin]
  1> [2019-10-25T01:22:40,250][INFO ][o.e.p.PluginsService     ] [testDataNodeRestartAfterShardSnapshotFailure] loaded plugin [org.elasticsearch.test.ESIntegTestCase$AssertActionNamePlugin]
  1> [2019-10-25T01:22:40,250][INFO ][o.e.p.PluginsService     ] [testDataNodeRestartAfterShardSnapshotFailure] loaded plugin [org.elasticsearch.test.ESIntegTestCase$TestSeedPlugin]
  1> [2019-10-25T01:22:40,250][INFO ][o.e.p.PluginsService     ] [testDataNodeRestartAfterShardSnapshotFailure] loaded plugin [org.elasticsearch.test.MockHttpTransport$TestPlugin]
  1> [2019-10-25T01:22:40,250][INFO ][o.e.p.PluginsService     ] [testDataNodeRestartAfterShardSnapshotFailure] loaded plugin [org.elasticsearch.transport.nio.MockNioTransportPlugin]
  1> [2019-10-25T01:22:40,269][INFO ][o.e.d.DiscoveryModule    ] [testDataNodeRestartAfterShardSnapshotFailure] using discovery type [zen] and seed hosts providers [settings, file]
  1> [2019-10-25T01:22:40,290][INFO ][o.e.n.Node               ] [testDataNodeRestartAfterShardSnapshotFailure] initialized
  1> [2019-10-25T01:22:40,290][INFO ][o.e.n.Node               ] [testDataNodeRestartAfterShardSnapshotFailure] starting ...
  1> [2019-10-25T01:22:40,294][INFO ][o.e.t.TransportService   ] [testDataNodeRestartAfterShardSnapshotFailure] publish_address {127.0.0.1:38475}, bound_addresses {[::1]:35853}, {127.0.0.1:38475}
  1> [2019-10-25T01:22:40,301][INFO ][o.e.c.s.ClusterApplierService] [node_td2] removed {{node_td1}{Kn1zpH90QPqXMtL1NqqVOg}{DKlTH9pkRymZtUbRoe8_rA}{127.0.0.1}{127.0.0.1:37309}{di}}, term: 1, version: 12, reason: ApplyCommitRequest{term=1, version=12, sourceNode={node_tm0}{ztB71aT-Qfmpv14TI36OOg}{JRG4MuoYREqTFOGkDH3D2g}{127.0.0.1}{127.0.0.1:34059}{m}}
  1> [2019-10-25T01:22:40,305][INFO ][o.e.c.s.ClusterApplierService] [node_tm0] removed {{node_td1}{Kn1zpH90QPqXMtL1NqqVOg}{DKlTH9pkRymZtUbRoe8_rA}{127.0.0.1}{127.0.0.1:37309}{di}}, term: 1, version: 12, reason: Publication{term=1, version=12}
  1> [2019-10-25T01:22:40,309][INFO ][o.e.c.c.Coordinator      ] [testDataNodeRestartAfterShardSnapshotFailure] cluster UUID [6WUNaFxyS9e_GZgDQ7k3FQ]
  1> [2019-10-25T01:22:40,356][INFO ][o.e.c.s.MasterService    ] [node_tm0] node-join[{node_td1}{Kn1zpH90QPqXMtL1NqqVOg}{AdrMl80YTKu-MHo2vIUEhw}{127.0.0.1}{127.0.0.1:38475}{di} join existing leader], term: 1, version: 14, delta: added {{node_td1}{Kn1zpH90QPqXMtL1NqqVOg}{AdrMl80YTKu-MHo2vIUEhw}{127.0.0.1}{127.0.0.1:38475}{di}}
  1> [2019-10-25T01:22:40,359][INFO ][o.e.c.s.ClusterApplierService] [node_td2] added {{node_td1}{Kn1zpH90QPqXMtL1NqqVOg}{AdrMl80YTKu-MHo2vIUEhw}{127.0.0.1}{127.0.0.1:38475}{di}}, term: 1, version: 14, reason: ApplyCommitRequest{term=1, version=14, sourceNode={node_tm0}{ztB71aT-Qfmpv14TI36OOg}{JRG4MuoYREqTFOGkDH3D2g}{127.0.0.1}{127.0.0.1:34059}{m}}
  1> [2019-10-25T01:22:40,359][INFO ][o.e.c.s.ClusterApplierService] [node_td1] master node changed {previous [], current [{node_tm0}{ztB71aT-Qfmpv14TI36OOg}{JRG4MuoYREqTFOGkDH3D2g}{127.0.0.1}{127.0.0.1:34059}{m}]}, added {{node_td2}{mpwldoJOQTucDLY79bMP6w}{Ge9Glg2TTyCthw9_1wtfnQ}{127.0.0.1}{127.0.0.1:36515}{di},{node_tm0}{ztB71aT-Qfmpv14TI36OOg}{JRG4MuoYREqTFOGkDH3D2g}{127.0.0.1}{127.0.0.1:34059}{m}}, term: 1, version: 14, reason: ApplyCommitRequest{term=1, version=14, sourceNode={node_tm0}{ztB71aT-Qfmpv14TI36OOg}{JRG4MuoYREqTFOGkDH3D2g}{127.0.0.1}{127.0.0.1:34059}{m}}
  1> [2019-10-25T01:22:40,362][INFO ][o.e.s.m.MockRepository   ] [node_td1] starting mock repository with random prefix default
  1> [2019-10-25T01:22:40,442][WARN ][o.e.s.SnapshotShardsService] [node_td1] [[test-idx][0]][test-repo:test-snap/8CMeKZucSfKdomiV2gX9Lw] failed to snapshot shard
  1> org.elasticsearch.index.IndexNotFoundException: no such index [test-idx]
  1>  at org.elasticsearch.indices.IndicesService.indexServiceSafe(IndicesService.java:458) ~[main/:?]
  1>  at org.elasticsearch.snapshots.SnapshotShardsService.snapshot(SnapshotShardsService.java:317) ~[main/:?]
  1>  at org.elasticsearch.snapshots.SnapshotShardsService.lambda$startNewShards$1(SnapshotShardsService.java:285) ~[main/:?]
  1>  at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:703) [main/:?]
  1>  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_221]
  1>  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_221]
  1>  at java.lang.Thread.run(Thread.java:748) [?:1.8.0_221]
  1> [2019-10-25T01:22:40,444][INFO ][o.e.n.Node               ] [testDataNodeRestartAfterShardSnapshotFailure] started
  1> [2019-10-25T01:22:40,444][INFO ][o.e.c.s.ClusterApplierService] [node_tm0] added {{node_td1}{Kn1zpH90QPqXMtL1NqqVOg}{AdrMl80YTKu-MHo2vIUEhw}{127.0.0.1}{127.0.0.1:38475}{di}}, term: 1, version: 14, reason: Publication{term=1, version=14}
  1> [2019-10-25T01:22:40,447][INFO ][o.e.s.DedicatedClusterSnapshotRestoreIT] [testDataNodeRestartAfterShardSnapshotFailure] -->  wait for shard snapshot of first primary to show as failed
  1> [2019-10-25T01:22:40,632][INFO ][o.e.c.r.a.AllocationService] [node_tm0] Cluster health status changed from [RED] to [GREEN] (reason: [shards started [[test-idx][0]]]).
  1> [2019-10-25T01:23:46,034][INFO ][o.e.a.a.c.r.c.TransportCleanupRepositoryAction] [node_tm0] Running cleanup operations on repository [test-repo][-1]
  1> [2019-10-25T01:23:46,036][WARN ][o.e.a.a.c.r.c.TransportCleanupRepositoryAction] [node_tm0] Failed to run repository cleanup operations on [test-repo][-1]
  1> java.lang.IllegalStateException: Cannot cleanup [test-repo] - a snapshot is currently running
  1>  at org.elasticsearch.action.admin.cluster.repositories.cleanup.TransportCleanupRepositoryAction$2.execute(TransportCleanupRepositoryAction.java:187) ~[main/:?]
  1>  at org.elasticsearch.cluster.ClusterStateUpdateTask.execute(ClusterStateUpdateTask.java:47) ~[main/:?]
  1>  at org.elasticsearch.cluster.service.MasterService.executeTasks(MasterService.java:702) ~[main/:?]
  1>  at org.elasticsearch.cluster.service.MasterService.calculateTaskOutputs(MasterService.java:324) ~[main/:?]
  1>  at org.elasticsearch.cluster.service.MasterService.runTasks(MasterService.java:219) [main/:?]
  1>  at org.elasticsearch.cluster.service.MasterService.access$000(MasterService.java:73) [main/:?]
  1>  at org.elasticsearch.cluster.service.MasterService$Batcher.run(MasterService.java:151) [main/:?]
  1>  at org.elasticsearch.cluster.service.TaskBatcher.runIfNotProcessed(TaskBatcher.java:150) [main/:?]
  1>  at org.elasticsearch.cluster.service.TaskBatcher$BatchedTask.run(TaskBatcher.java:188) [main/:?]
  1>  at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:703) [main/:?]
  1>  at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.runAndClean(PrioritizedEsThreadPoolExecutor.java:252) [main/:?]
  1>  at org.elasticsearch.common.util.concurrent.PrioritizedEsThreadPoolExecutor$TieBreakingPrioritizedRunnable.run(PrioritizedEsThreadPoolExecutor.java:215) [main/:?]
  1>  at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) [?:1.8.0_221]
  1>  at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) [?:1.8.0_221]
  1>  at java.lang.Thread.run(Thread.java:748) [?:1.8.0_221]
  1> [2019-10-25T01:23:46,039][INFO ][o.e.s.DedicatedClusterSnapshotRestoreIT] [testDataNodeRestartAfterShardSnapshotFailure] [DedicatedClusterSnapshotRestoreIT#testDataNodeRestartAfterShardSnapshotFailure]: cleaning up after test
@DaveCTurner DaveCTurner added :Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >test-failure Triaged test failures from CI labels Oct 25, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore)

@DaveCTurner
Copy link
Contributor Author

I worked out how to reproduce this:

diff --git a/server/src/main/java/org/elasticsearch/cluster/service/ClusterApplierService.java b/server/src/main/java/org/elasticsearch/cluster/service/ClusterApplierService.java
index f5bbe2d420b..644a349b989 100644
--- a/server/src/main/java/org/elasticsearch/cluster/service/ClusterApplierService.java
+++ b/server/src/main/java/org/elasticsearch/cluster/service/ClusterApplierService.java
@@ -466,6 +466,13 @@ public class ClusterApplierService extends AbstractLifecycleComponent implements
                     summary, newClusterState.term(), newClusterState.version(), task.source);
             }
         }
+        if (nodesDelta.removed()) {
+            try {
+                Thread.sleep(10000);
+            } catch (InterruptedException e) {
+                throw new AssertionError("unexpected", e);
+            }
+        }

         logger.trace("connecting to nodes of cluster state with version {}", newClusterState.version());
         try (Releasable ignored = stopWatch.timing("connecting to new nodes")) {

With a sufficiently long pause here, the master processes the node-left and node-join updates without processing any snapshot-related updates in between, and this means it skips the updates that would fail the shard snapshot.

@original-brownbear
Copy link
Member

original-brownbear commented Oct 25, 2019

@DaveCTurner jup that's how far I got today as well. Unfortunately, there's a number of related issues here because the node re-joining may start the snapshot again if it's still in INIT state etc. depending on the timing. Will be looking into how to clean this up tomorrow :)
Also, the snapshot status API is somehow broken in this scenario and fails to properly translate the failed shard in the CS to a response for a node that rejoined the cluster.

original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Oct 27, 2019
Fixes the shard snapshot status reporting for failed shards
in the corner case of failing the shard because of an exception
thrown in `SnapshotShardsService` and not the repository.
We were missing the update on the `snapshotStatus` instance in
this case which made the transport APIs using this field report
back an incorrect status.
Fixed by moving the failure handling to the `SnapshotShardsService`
for all cases (which also simplifies the code, the ex. wrapping in
the repository was pointless as we only used the ex. trace upstream
anyway).
Also, added an assertion to another test that explicitly checks this
failure situation (ex. in the `SnapshotShardsService`) already.

Closes elastic#48526
@original-brownbear
Copy link
Member

original-brownbear commented Oct 27, 2019

Status APIs struck again it turns out :) -> fix incoming in #48556

The fact that we were missing the node-left event is irrelevant it turns out since that build had a proper failure on the primary node still (from missing the index allocation after restart). I don't think we want to change the logic here to make sure we always catch the node-left somehow ... in practice if the node restarts and picks up the snapshot for its shards correctly that's fine and not something to fail on after all :)

original-brownbear added a commit that referenced this issue Oct 29, 2019
Fixes the shard snapshot status reporting for failed shards
in the corner case of failing the shard because of an exception
thrown in `SnapshotShardsService` and not the repository.
We were missing the update on the `snapshotStatus` instance in
this case which made the transport APIs using this field report
back an incorrect status.
Fixed by moving the failure handling to the `SnapshotShardsService`
for all cases (which also simplifies the code, the ex. wrapping in
the repository was pointless as we only used the ex. trace upstream
anyway).
Also, added an assertion to another test that explicitly checks this
failure situation (ex. in the `SnapshotShardsService`) already.

Closes #48526
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Oct 30, 2019
Fixes the shard snapshot status reporting for failed shards
in the corner case of failing the shard because of an exception
thrown in `SnapshotShardsService` and not the repository.
We were missing the update on the `snapshotStatus` instance in
this case which made the transport APIs using this field report
back an incorrect status.
Fixed by moving the failure handling to the `SnapshotShardsService`
for all cases (which also simplifies the code, the ex. wrapping in
the repository was pointless as we only used the ex. trace upstream
anyway).
Also, added an assertion to another test that explicitly checks this
failure situation (ex. in the `SnapshotShardsService`) already.

Closes elastic#48526
original-brownbear added a commit that referenced this issue Oct 30, 2019
Fixes the shard snapshot status reporting for failed shards
in the corner case of failing the shard because of an exception
thrown in `SnapshotShardsService` and not the repository.
We were missing the update on the `snapshotStatus` instance in
this case which made the transport APIs using this field report
back an incorrect status.
Fixed by moving the failure handling to the `SnapshotShardsService`
for all cases (which also simplifies the code, the ex. wrapping in
the repository was pointless as we only used the ex. trace upstream
anyway).
Also, added an assertion to another test that explicitly checks this
failure situation (ex. in the `SnapshotShardsService`) already.

Closes #48526
original-brownbear added a commit to original-brownbear/elasticsearch that referenced this issue Oct 30, 2019
Fixes the shard snapshot status reporting for failed shards
in the corner case of failing the shard because of an exception
thrown in `SnapshotShardsService` and not the repository.
We were missing the update on the `snapshotStatus` instance in
this case which made the transport APIs using this field report
back an incorrect status.
Fixed by moving the failure handling to the `SnapshotShardsService`
for all cases (which also simplifies the code, the ex. wrapping in
the repository was pointless as we only used the ex. trace upstream
anyway).
Also, added an assertion to another test that explicitly checks this
failure situation (ex. in the `SnapshotShardsService`) already.

Closes elastic#48526
original-brownbear added a commit that referenced this issue Oct 30, 2019
Fixes the shard snapshot status reporting for failed shards
in the corner case of failing the shard because of an exception
thrown in `SnapshotShardsService` and not the repository.
We were missing the update on the `snapshotStatus` instance in
this case which made the transport APIs using this field report
back an incorrect status.
Fixed by moving the failure handling to the `SnapshotShardsService`
for all cases (which also simplifies the code, the ex. wrapping in
the repository was pointless as we only used the ex. trace upstream
anyway).
Also, added an assertion to another test that explicitly checks this
failure situation (ex. in the `SnapshotShardsService`) already.

Closes #48526
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Coordination/Snapshot/Restore Anything directly related to the `_snapshot/*` APIs >test-failure Triaged test failures from CI
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants