-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wait for new master when failing shard #15748
Wait for new master when failing shard #15748
Conversation
@bleskes This is ready for review now. |
Discussed with @jasontedor . Decided to try folding all channel related retries (connection, master loss and any unknown exception) into the ShardStateAction , as opposed to dealing with that in TransportReplicationAction. |
// force a new cluster state to simulate a new master having been elected | ||
clusterService.setState(ClusterState.builder(clusterService.state())); | ||
transport.handleResponse(currentRequest.requestId, new NotMasterException("shard-failed-test")); | ||
CapturingTransport.CapturedRequest[] retryRequests = transport.capturedRequests(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we should probably have a clearAndGetCapturedRequests . We use this often.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I opened #15897.
I think this looks great and is the right way to go. Left some comments. I'm doubtful if we need the timeout mechanism - it just adds complexity. Let's talk this one over. Also - can we a open follow up issues to deal with shard started and remove the retry logics from IndicesClusterService ? (will make that class simpler). |
@bleskes I think this is ready for a final another review round. |
@@ -92,8 +92,6 @@ | |||
*/ | |||
public abstract class TransportReplicationAction<Request extends ReplicationRequest, ReplicaRequest extends ReplicationRequest, Response extends ReplicationResponse> extends TransportAction<Request, Response> { | |||
|
|||
public static final String SHARD_FAILURE_TIMEOUT = "action.support.replication.shard.failure_timeout"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you.
@bleskes I pushed more commits. |
This commit handles the situation when we are failing a shard and either no master is known, or the known master left while failing the shard. We handle this situation by waiting for a new master to be reelected, and then sending the shard failed request to the new master.
This commit adds a simulation of the master leaving after a shard failure request has been sent. In this case, after a new cluster state is published (simulating a new master having been elected), the request to fail the shard should be retried.
This commit moves the handling of channel failures when failing a shard to o.e.c.a.s.ShardStateAction. This means that shard failure requests that timeout or occur when there is no master or the master leaves after the request is sent will now be retried from here. The listener for a shard failed request will now only be notified upon successful completion of the shard failed request, or when a catastrophic non-channel failure occurs.
This commit removes the timeout retry mechanism from ShardStateAction allowing it to instead be handled by the general master channel retry mechanism. The idea is that if there is a network issue, the master will miss a ping timeout causing the channel to be closed which will expose itself via a NodeDisconnectedException. At this point, we can just wait for a new master and retry, as with any other master channel exception.
This commit adds Discovery.FailedToCommitClusterStateException to the list of channel failures that ShardStateAction handles and retries.
observer.waitForNextChange(new ClusterStateObserver.Listener() { | ||
@Override | ||
public void onNewClusterState(ClusterState state) { | ||
sendShardFailed(observer, shardRoutingEntry, listener); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we add a trace log here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I pushed fe39d11.
Thanks @jasontedor . Looks great. I left some comments w.r.t testing. A couple of things I miss in that area:
|
@bleskes Can you elaborate what you mean? |
I meant that we should check the message is the right class and that it contains the shard routing we gave it. On 15 jan. 2016 2:06 PM +0200, Jason Tedornotifications@github.com, wrote:
|
This commit tightens the tests in o.e.c.a.s.ShardStateActionTests: - adds a simple test for a success condition that validates the shard failed request is correct and sent to the correct place - remove redundant assertions from the no master and master left tests - an assertion that success is not falsely indicated in the case of a unhandled error
This commit adds a trace log on a cluster state update while waiting for a new master, and changes the log level on cluster service close to the warn level.
This commit enhances the master channel exception test in o.e.c.a.s.ShardStateActionTests to test that a retries loop as expected when requests to the master repeatedly fail.
@bleskes I think this pull request is ready for another retry loop. ;) |
This commit adds a sanity assertion that the cause of a transport exception when sending a shard failure is not null.
@Override | ||
public void onShardFailedNoMaster() { | ||
public void onSuccess() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we make sure onShardFailedFailure is not called?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Addressed in 386d2ab.
LGTM. Left one little comment that doesn't need another cycle. Thanks @jasontedor ! |
This commit adds some additional assertions that test success is not falsely indicated by adding assertions that success / failure methods are not incorrectly invoked in failure / success scenarios.
Wait for new master when failing shard Relates #14252
Thanks for another very thorough and helpful review @bleskes. |
This commit handles the situation when we are failing a shard and either
no master is known, or the known master left while failing the shard. We
handle this situation by waiting for a new master to be reelected, and
then sending the shard failed request to the new master.
Relates #14252