Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wait for new master when failing shard #15748

Merged
merged 13 commits into from
Jan 17, 2016
Merged

Wait for new master when failing shard #15748

merged 13 commits into from
Jan 17, 2016

Conversation

jasontedor
Copy link
Member

This commit handles the situation when we are failing a shard and either
no master is known, or the known master left while failing the shard. We
handle this situation by waiting for a new master to be reelected, and
then sending the shard failed request to the new master.

Relates #14252

@jasontedor
Copy link
Member Author

Note that this pull request contains refactorings from #15735 and #15736 that I extracted into separate pull requests.

@jasontedor jasontedor mentioned this pull request Jan 4, 2016
9 tasks
@jasontedor
Copy link
Member Author

@bleskes This is ready for review now.

@bleskes
Copy link
Contributor

bleskes commented Jan 8, 2016

Discussed with @jasontedor . Decided to try folding all channel related retries (connection, master loss and any unknown exception) into the ShardStateAction , as opposed to dealing with that in TransportReplicationAction.

@jasontedor
Copy link
Member Author

@bleskes I pushed f49435c to implement what we have discussed.

// force a new cluster state to simulate a new master having been elected
clusterService.setState(ClusterState.builder(clusterService.state()));
transport.handleResponse(currentRequest.requestId, new NotMasterException("shard-failed-test"));
CapturingTransport.CapturedRequest[] retryRequests = transport.capturedRequests();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we should probably have a clearAndGetCapturedRequests . We use this often.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I opened #15897.

@bleskes
Copy link
Contributor

bleskes commented Jan 11, 2016

I think this looks great and is the right way to go. Left some comments. I'm doubtful if we need the timeout mechanism - it just adds complexity. Let's talk this one over.

Also - can we a open follow up issues to deal with shard started and remove the retry logics from IndicesClusterService ? (will make that class simpler).

@jasontedor
Copy link
Member Author

Also - can we a open follow up issues to deal with shard started

I opened #15895.

and remove the retry logics from IndicesClusterService ? (will make that class simpler).

I opened #15896.

@jasontedor
Copy link
Member Author

@bleskes I think this is ready for a final another review round.

@@ -92,8 +92,6 @@
*/
public abstract class TransportReplicationAction<Request extends ReplicationRequest, ReplicaRequest extends ReplicationRequest, Response extends ReplicationResponse> extends TransportAction<Request, Response> {

public static final String SHARD_FAILURE_TIMEOUT = "action.support.replication.shard.failure_timeout";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you.

@jasontedor
Copy link
Member Author

@bleskes I pushed more commits.

@jasontedor jasontedor changed the title Wait for new master when failing shard Handle channel failures on shard state requests Jan 14, 2016
This commit handles the situation when we are failing a shard and either
no master is known, or the known master left while failing the shard. We
handle this situation by waiting for a new master to be reelected, and
then sending the shard failed request to the new master.
This commit adds a simulation of the master leaving after a shard
failure request has been sent. In this case, after a new cluster state
is published (simulating a new master having been elected), the request
to fail the shard should be retried.
This commit moves the handling of channel failures when failing a shard
to o.e.c.a.s.ShardStateAction. This means that shard failure requests
that timeout or occur when there is no master or the master leaves after
the request is sent will now be retried from here. The listener for a
shard failed request will now only be notified upon successful
completion of the shard failed request, or when a catastrophic
non-channel failure occurs.
This commit removes the timeout retry mechanism from ShardStateAction
allowing it to instead be handled by the general master channel retry
mechanism. The idea is that if there is a network issue, the master will
miss a ping timeout causing the channel to be closed which will expose
itself via a NodeDisconnectedException. At this point, we can just wait
for a new master and retry, as with any other master channel exception.
This commit adds Discovery.FailedToCommitClusterStateException to the
list of channel failures that ShardStateAction handles and retries.
@jasontedor jasontedor changed the title Handle channel failures on shard state requests Wait for new master when failing shard Jan 14, 2016
observer.waitForNextChange(new ClusterStateObserver.Listener() {
@Override
public void onNewClusterState(ClusterState state) {
sendShardFailed(observer, shardRoutingEntry, listener);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we add a trace log here?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I pushed fe39d11.

@bleskes
Copy link
Contributor

bleskes commented Jan 15, 2016

Thanks @jasontedor . Looks great. I left some comments w.r.t testing. A couple of things I miss in that area:

  1. A simple test for success - without a retry/exceptions.
  2. Validation of the message sent to the master - we learned from bitter experience that we should test that ... :)

@jasontedor
Copy link
Member Author

Validation of the message sent to the master - we learned from bitter experience that we should test that

@bleskes Can you elaborate what you mean?

@bleskes
Copy link
Contributor

bleskes commented Jan 15, 2016

I meant that we should check the message is the right class and that it contains the shard routing we gave it.

On 15 jan. 2016 2:06 PM +0200, Jason Tedornotifications@github.com, wrote:

Validation of the message sent to the master - we learned from bitter experience that we should test that

@bleskes(https://github.com/bleskes)Can you elaborate what you mean? I have some commits locally I was planning for another PR (checking if the shard is in the routing table, checking if the shard is a replica not being failed by the primary, checking if the shard is a primary not being failed by the owning node) but want to be sure we're thinking the same thing.


Reply to this email directly orview it on GitHub(#15748 (comment)).

This commit tightens the tests in o.e.c.a.s.ShardStateActionTests:
 - adds a simple test for a success condition that validates the shard
   failed request is correct and sent to the correct place
 - remove redundant assertions from the no master and master left tests
 - an assertion that success is not falsely indicated in the case of a
   unhandled error
@jasontedor
Copy link
Member Author

  1. A simple test for success - without a retry/exceptions.
  2. Validation of the message sent to the master - we learned from bitter experience that we should test that ... :)

@bleskes Pushed a new test in efb1426.

This commit adds a trace log on a cluster state update while waiting for
a new master, and changes the log level on cluster service close to the
warn level.
This commit enhances the master channel exception test in
o.e.c.a.s.ShardStateActionTests to test that a retries loop as expected
when requests to the master repeatedly fail.
@jasontedor
Copy link
Member Author

@bleskes I think this pull request is ready for another retry loop. ;)

This commit adds a sanity assertion that the cause of a transport
exception when sending a shard failure is not null.
@Override
public void onShardFailedNoMaster() {
public void onSuccess() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we make sure onShardFailedFailure is not called?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Addressed in 386d2ab.

@bleskes
Copy link
Contributor

bleskes commented Jan 17, 2016

LGTM. Left one little comment that doesn't need another cycle. Thanks @jasontedor !

This commit adds some additional assertions that test success is not
falsely indicated by adding assertions that success / failure methods
are not incorrectly invoked in failure / success scenarios.
jasontedor added a commit that referenced this pull request Jan 17, 2016
Wait for new master when failing shard

Relates #14252
@jasontedor jasontedor merged commit 69b21fe into elastic:master Jan 17, 2016
@jasontedor jasontedor deleted the shard-failure-no-master-retry branch January 17, 2016 15:50
@jasontedor
Copy link
Member Author

Thanks for another very thorough and helpful review @bleskes.

@clintongormley clintongormley added :Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. and removed :Cluster labels Feb 13, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Distributed Indexing/Distributed A catch all label for anything in the Distributed Area. Please avoid if you can. >enhancement v5.0.0-alpha1
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants