Wait for new master when failing shard #15748

jasontedor · 2016-01-04T12:36:58Z

This commit handles the situation when we are failing a shard and either
no master is known, or the known master left while failing the shard. We
handle this situation by waiting for a new master to be reelected, and
then sending the shard failed request to the new master.

Relates #14252

jasontedor · 2016-01-04T12:37:58Z

Note that this pull request contains refactorings from #15735 and #15736 that I extracted into separate pull requests.

jasontedor · 2016-01-07T02:29:43Z

@bleskes This is ready for review now.

bleskes · 2016-01-08T14:06:02Z

Discussed with @jasontedor . Decided to try folding all channel related retries (connection, master loss and any unknown exception) into the ShardStateAction , as opposed to dealing with that in TransportReplicationAction.

jasontedor · 2016-01-08T22:52:37Z

@bleskes I pushed f49435c to implement what we have discussed.

bleskes · 2016-01-11T09:29:03Z

.../test/java/org/elasticsearch/action/support/replication/TransportReplicationActionTests.java

+                            // force a new cluster state to simulate a new master having been elected
+                            clusterService.setState(ClusterState.builder(clusterService.state()));
+                            transport.handleResponse(currentRequest.requestId, new NotMasterException("shard-failed-test"));
+                            CapturingTransport.CapturedRequest[] retryRequests = transport.capturedRequests();


we should probably have a clearAndGetCapturedRequests . We use this often.

I opened #15897.

bleskes · 2016-01-11T09:52:19Z

I think this looks great and is the right way to go. Left some comments. I'm doubtful if we need the timeout mechanism - it just adds complexity. Let's talk this one over.

Also - can we a open follow up issues to deal with shard started and remove the retry logics from IndicesClusterService ? (will make that class simpler).

jasontedor · 2016-01-11T17:39:21Z

Also - can we a open follow up issues to deal with shard started

I opened #15895.

and remove the retry logics from IndicesClusterService ? (will make that class simpler).

I opened #15896.

jasontedor · 2016-01-11T19:55:06Z

@bleskes I think this is ready for a final another review round.

bleskes · 2016-01-12T10:25:02Z

core/src/main/java/org/elasticsearch/action/support/replication/TransportReplicationAction.java

@@ -92,8 +92,6 @@
 */
 public abstract class TransportReplicationAction<Request extends ReplicationRequest, ReplicaRequest extends ReplicationRequest, Response extends ReplicationResponse> extends TransportAction<Request, Response> {

-    public static final String SHARD_FAILURE_TIMEOUT = "action.support.replication.shard.failure_timeout";


jasontedor · 2016-01-12T13:40:02Z

@bleskes I pushed more commits.

This commit handles the situation when we are failing a shard and either no master is known, or the known master left while failing the shard. We handle this situation by waiting for a new master to be reelected, and then sending the shard failed request to the new master.

This commit adds a simulation of the master leaving after a shard failure request has been sent. In this case, after a new cluster state is published (simulating a new master having been elected), the request to fail the shard should be retried.

This commit moves the handling of channel failures when failing a shard to o.e.c.a.s.ShardStateAction. This means that shard failure requests that timeout or occur when there is no master or the master leaves after the request is sent will now be retried from here. The listener for a shard failed request will now only be notified upon successful completion of the shard failed request, or when a catastrophic non-channel failure occurs.

This commit removes the timeout retry mechanism from ShardStateAction allowing it to instead be handled by the general master channel retry mechanism. The idea is that if there is a network issue, the master will miss a ping timeout causing the channel to be closed which will expose itself via a NodeDisconnectedException. At this point, we can just wait for a new master and retry, as with any other master channel exception.

This commit adds Discovery.FailedToCommitClusterStateException to the list of channel failures that ShardStateAction handles and retries.

bleskes · 2016-01-15T06:54:30Z

core/src/main/java/org/elasticsearch/cluster/action/shard/ShardStateAction.java

+        observer.waitForNextChange(new ClusterStateObserver.Listener() {
+            @Override
+            public void onNewClusterState(ClusterState state) {
+                sendShardFailed(observer, shardRoutingEntry, listener);


can we add a trace log here?

I pushed fe39d11.

bleskes · 2016-01-15T07:13:15Z

Thanks @jasontedor . Looks great. I left some comments w.r.t testing. A couple of things I miss in that area:

A simple test for success - without a retry/exceptions.
Validation of the message sent to the master - we learned from bitter experience that we should test that ... :)

jasontedor · 2016-01-15T12:06:30Z

Validation of the message sent to the master - we learned from bitter experience that we should test that

@bleskes Can you elaborate what you mean?

bleskes · 2016-01-15T12:24:46Z

I meant that we should check the message is the right class and that it contains the shard routing we gave it.

On 15 jan. 2016 2:06 PM +0200, Jason Tedornotifications@github.com, wrote:

Validation of the message sent to the master - we learned from bitter experience that we should test that

@bleskes(https://github.com/bleskes)Can you elaborate what you mean? I have some commits locally I was planning for another PR (checking if the shard is in the routing table, checking if the shard is a replica not being failed by the primary, checking if the shard is a primary not being failed by the owning node) but want to be sure we're thinking the same thing.

—
Reply to this email directly orview it on GitHub(#15748 (comment)).

This commit tightens the tests in o.e.c.a.s.ShardStateActionTests: - adds a simple test for a success condition that validates the shard failed request is correct and sent to the correct place - remove redundant assertions from the no master and master left tests - an assertion that success is not falsely indicated in the case of a unhandled error

jasontedor · 2016-01-15T14:31:24Z

A simple test for success - without a retry/exceptions.

Validation of the message sent to the master - we learned from bitter experience that we should test that ... :)

@bleskes Pushed a new test in efb1426.

This commit adds a trace log on a cluster state update while waiting for a new master, and changes the log level on cluster service close to the warn level.

This commit enhances the master channel exception test in o.e.c.a.s.ShardStateActionTests to test that a retries loop as expected when requests to the master repeatedly fail.

jasontedor · 2016-01-15T16:39:09Z

@bleskes I think this pull request is ready for another retry loop. ;)

This commit adds a sanity assertion that the cause of a transport exception when sending a shard failure is not null.

bleskes · 2016-01-17T07:35:20Z

core/src/test/java/org/elasticsearch/cluster/action/shard/ShardStateActionTests.java

            @Override
-            public void onShardFailedNoMaster() {
+            public void onSuccess() {


can we make sure onShardFailedFailure is not called?

Addressed in 386d2ab.

bleskes · 2016-01-17T07:36:53Z

LGTM. Left one little comment that doesn't need another cycle. Thanks @jasontedor !

This commit adds some additional assertions that test success is not falsely indicated by adding assertions that success / failure methods are not incorrectly invoked in failure / success scenarios.

Wait for new master when failing shard Relates #14252

jasontedor · 2016-01-17T15:51:15Z

Thanks for another very thorough and helpful review @bleskes.

jasontedor added >enhancement review v5.0.0-alpha1 labels Jan 4, 2016

jasontedor assigned bleskes Jan 4, 2016

jasontedor mentioned this pull request Jan 4, 2016

Wait on shard failures #14252

Closed

9 tasks

bleskes reviewed Jan 11, 2016
View reviewed changes

This was referenced Jan 11, 2016

ShardStateAction#shardStarted should manage retries on channel exceptions #15895

Closed

Remove retry logic from IndicesClusterStateService #15896

Closed

bleskes reviewed Jan 12, 2016
View reviewed changes

jasontedor changed the title ~~Wait for new master when failing shard~~ Handle channel failures on shard state requests Jan 14, 2016

jasontedor added 5 commits January 14, 2016 15:05

Handle FailedToCommitClusterStateException in ShardStateAction

d55c5f6

This commit adds Discovery.FailedToCommitClusterStateException to the list of channel failures that ShardStateAction handles and retries.

jasontedor added 2 commits January 14, 2016 15:07

Remove dead field in o.e.c.a.s.ShardStateActionTests

5a5d788

Add Javadocs for exceptions that are handled by ShardStateAction

8f67dcc

jasontedor changed the title ~~Handle channel failures on shard state requests~~ Wait for new master when failing shard Jan 14, 2016

Use capture and clear convenience method

7f78d52

bleskes reviewed Jan 15, 2016
View reviewed changes

jasontedor added 2 commits January 15, 2016 09:45

Logging in shard state action

fe39d11

This commit adds a trace log on a cluster state update while waiting for a new master, and changes the log level on cluster service close to the warn level.

Add retry loop in shard state action tests

7eefcbb

This commit enhances the master channel exception test in o.e.c.a.s.ShardStateActionTests to test that a retries loop as expected when requests to the master repeatedly fail.

Sanity assertion that exception cause is not null

cf3c0ed

This commit adds a sanity assertion that the cause of a transport exception when sending a shard failure is not null.

bleskes reviewed Jan 17, 2016
View reviewed changes

More tightening of shard state action tests

386d2ab

This commit adds some additional assertions that test success is not falsely indicated by adding assertions that success / failure methods are not incorrectly invoked in failure / success scenarios.

jasontedor added a commit that referenced this pull request Jan 17, 2016

Merge pull request #15748 from jasontedor/shard-failure-no-master-retry

69b21fe

Wait for new master when failing shard Relates #14252

jasontedor merged commit 69b21fe into elastic:master Jan 17, 2016

jasontedor deleted the shard-failure-no-master-retry branch January 17, 2016 15:50

jasontedor mentioned this pull request Jan 18, 2016

Shard state action channel exceptions #16057

Closed

clintongormley added :Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. and removed :Cluster labels Feb 13, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Wait for new master when failing shard #15748

Wait for new master when failing shard #15748

jasontedor commented Jan 4, 2016

jasontedor commented Jan 4, 2016

jasontedor commented Jan 7, 2016

bleskes commented Jan 8, 2016

jasontedor commented Jan 8, 2016

bleskes Jan 11, 2016

jasontedor Jan 11, 2016

bleskes commented Jan 11, 2016

jasontedor commented Jan 11, 2016

jasontedor commented Jan 11, 2016

bleskes Jan 12, 2016

jasontedor commented Jan 12, 2016

bleskes Jan 15, 2016

jasontedor Jan 15, 2016

bleskes commented Jan 15, 2016

jasontedor commented Jan 15, 2016

bleskes commented Jan 15, 2016

jasontedor commented Jan 15, 2016

jasontedor commented Jan 15, 2016

bleskes Jan 17, 2016

jasontedor Jan 17, 2016

bleskes commented Jan 17, 2016

jasontedor commented Jan 17, 2016

Wait for new master when failing shard #15748

Wait for new master when failing shard #15748

Conversation

jasontedor commented Jan 4, 2016

jasontedor commented Jan 4, 2016

jasontedor commented Jan 7, 2016

bleskes commented Jan 8, 2016

jasontedor commented Jan 8, 2016

bleskes Jan 11, 2016

Choose a reason for hiding this comment

jasontedor Jan 11, 2016

Choose a reason for hiding this comment

bleskes commented Jan 11, 2016

jasontedor commented Jan 11, 2016

jasontedor commented Jan 11, 2016

bleskes Jan 12, 2016

Choose a reason for hiding this comment

jasontedor commented Jan 12, 2016

bleskes Jan 15, 2016

Choose a reason for hiding this comment

jasontedor Jan 15, 2016

Choose a reason for hiding this comment

bleskes commented Jan 15, 2016

jasontedor commented Jan 15, 2016

bleskes commented Jan 15, 2016

jasontedor commented Jan 15, 2016

jasontedor commented Jan 15, 2016

bleskes Jan 17, 2016

Choose a reason for hiding this comment

jasontedor Jan 17, 2016

Choose a reason for hiding this comment

bleskes commented Jan 17, 2016

jasontedor commented Jan 17, 2016