-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix Overly Aggressive Request DeDuplication #51270
Conversation
On master failover we have to resent all the shard failed messages, but the transport requests remain the same in the eyes of `equals`. If the master failover is registered and the requests to the new master are sent before all the callbacks have executed and the request to the old master removed from the deduplicator then the requuests to the new master will incorrectly fail and the snapshot get stuck. Closes elastic#51253
Pinging @elastic/es-distributed (:Distributed/Snapshot/Restore) |
This was caused by actually fixing the deduplicator in #50788 |
* Remove all tracked requests from this instance so that the first time {@link #executeOnce} is invoked with any request it triggers | ||
* an actual request execution. Use this e.g. for requests to master that need to be sent again on master failover. | ||
*/ | ||
public void clear() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I know there's just one use case for this now, but just dropping all deduplication seems to be the right approach to master failovers in general if we ever want to reuse this for e.g. put mapping calls or so.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Thanks Nhat! |
On master failover we have to resent all the shard failed messages, but the transport requests remain the same in the eyes of `equals`. If the master failover is registered and the requests to the new master are sent before all the callbacks have executed and the request to the old master removed from the deduplicator then the requuests to the new master will incorrectly fail and the snapshot get stuck. Closes elastic#51253
On master failover we have to resent all the shard failed messages, but the transport requests remain the same in the eyes of `equals`. If the master failover is registered and the requests to the new master are sent before all the callbacks have executed and the request to the old master removed from the deduplicator then the requuests to the new master will incorrectly fail and the snapshot get stuck. Closes elastic#51253
@@ -358,6 +358,9 @@ private void syncShardStatsOnNewMaster(ClusterChangedEvent event) { | |||
return; | |||
} | |||
|
|||
// Clear request deduplicator since we need to send all requests that were potentially not handled by the previous | |||
// master again | |||
remoteFailedRequestDeduplicator.clear(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't understand why this is needed. UpdateSnapshotStatusAction
is a TransportMasterNodeAction
and should resend the request in case of master failover.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think currently we're only retrying on FailedToCommitClusterStateException
and NotMasterException
but not on things like the node closed exception. That's why we have the manual retry/re-send of these messages in SnapshotShardsService
in the first place don't we?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm confused by this stack trace. It shows that the node that's sending the request is closed, no? Why would we want to retry sending a request from a node that's shut down?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry pasted the wrong trace:
1> [2020-01-21T11:49:53,151][WARN ][o.e.s.SnapshotShardsService] [node_td3] [test-repo:test-snap/v_2yOVZsQHu6gk_aFsmkKg] [ShardSnapshotStatus[state=SUCCESS, nodeId=8MquPimwRTCtgnobCaxJlw, reason=null, generation=T-25oH4kSg62xXEtINmviQ]] failed to update snapshot state
1> org.elasticsearch.transport.RemoteTransportException: [node_tm0][127.0.0.1:41909][internal:cluster/snapshot/update_snapshot_status]
1> Caused by: org.elasticsearch.node.NodeClosedException: node closed {node_tm0}{VGisb-JmTiOkIc_vJP8FhA}{Ucpv57LoTEy1J_cuDT1L8A}{127.0.0.1}{127.0.0.1:41909}{im}
1> at org.elasticsearch.action.support.master.TransportMasterNodeAction$AsyncSingleAction$2.onClusterServiceClose(TransportMasterNodeAction.java:213) ~[main/:?]
1> at org.elasticsearch.cluster.ClusterStateObserver$ContextPreservingListener.onClusterServiceClose(ClusterStateObserver.java:318) ~[main/:?]
1> at org.elasticsearch.cluster.ClusterStateObserver$ObserverClusterStateListener.onClose(ClusterStateObserver.java:237) ~[main/:?]
1> at org.elasticsearch.cluster.service.ClusterApplierService.doStop(ClusterApplierService.java:186) ~[main/:?]
1> at org.elasticsearch.common.component.AbstractLifecycleComponent.stop(AbstractLifecycleComponent.java:79) ~[main/:?]
1> at org.elasticsearch.cluster.service.ClusterService.doStop(ClusterService.java:96) ~[main/:?]
1> at org.elasticsearch.common.component.AbstractLifecycleComponent.stop(AbstractLifecycleComponent.java:79) ~[main/:?]
1> at org.elasticsearch.node.Node.stop(Node.java:805) ~[main/:?]
1> at org.elasticsearch.node.Node.close(Node.java:829) ~[main/:?]
1> at org.elasticsearch.test.InternalTestCluster$NodeAndClient.close(InternalTestCluster.java:954) ~[framework-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
1> at org.elasticsearch.test.InternalTestCluster.stopNodesAndClients(InternalTestCluster.java:1573) ~[framework-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
1> at org.elasticsearch.test.InternalTestCluster.stopNodesAndClient(InternalTestCluster.java:1563) ~[framework-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
1> at org.elasticsearch.test.InternalTestCluster.stopCurrentMasterNode(InternalTestCluster.java:1493) ~[framework-8.0.0-SNAPSHOT.jar:8.0.0-SNAPSHOT]
1> at org.elasticsearch.snapshots.DedicatedClusterSnapshotRestoreIT.testMasterShutdownDuringSnapshot(DedicatedClusterSnapshotRestoreIT.java:849) ~[test/:?]
1> at jdk.internal.reflect.NativeMethodAccessorImpl.invoke0(Native Method) ~[?:?]
1> at jdk.internal.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62) ~[?:?]
1> at jdk.internal.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43) ~[?:?]
1> at java.lang.reflect.Method.invoke(Method.java:566) ~[?:?]
1> at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1750) ~[randomizedtesting-runner-2.7.4.jar:?]
1> at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:938) ~[randomizedtesting-runner-2.7.4.jar:?]
1> at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:974) ~[randomizedtesting-runner-2.7.4.jar:?]
1> at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:988) ~[randomizedtesting-runner-2.7.4.jar:?]
We have the same for the remote end as well. I'm not 100% sure we can get that exception outside of a test (because of the shut down order, but as you can see at least in the test requests are clearly not retried on the master node restart tm_0
and just go to the listener's onFailure
). We could improve the master action to retry this case as well, but as long as we have the fail-safe retry on master failover in the snapshot shard's service already, we should make that actually work?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks like a bug in TransportMasterNodeAction
, then. In case where a node is shut down, we first shut down the cluster service, and only then do we shut down the transport service. This means that a node that has sent a MasterNodeRequest
to a master node that is in the process of shutting down can receive a TransportException
with the cause being a NodeClosedException
, for which we don't trigger a retry (we only do that for a ConnectTransportException
). Can we fix that instead of this hack here (which also fixes this situation for any other master-level request)? Long term, the logic in syncShardStatsOnNewMaster
can go away. I think it's mainly still here because the ShardSnapshotStatus
action was not a TransportMasterNodeAction
in some older versions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ywelsch sounds good, I'll look into fixing the TransportMasterNodeAction
and removing the retry in the snapshot shards service then shortly :)
Still, so long as we have syncShardStatsOnNewMaster
I feel like we should have this hack in place as without it that method doesn't really make sense?
On master failover we have to resent all the shard failed messages, but the transport requests remain the same in the eyes of `equals`. If the master failover is registered and the requests to the new master are sent before all the callbacks have executed and the request to the old master removed from the deduplicator then the requuests to the new master will incorrectly fail and the snapshot get stuck. Closes #51253
On master failover we have to resent all the shard failed messages, but the transport requests remain the same in the eyes of `equals`. If the master failover is registered and the requests to the new master are sent before all the callbacks have executed and the request to the old master removed from the deduplicator then the requuests to the new master will incorrectly fail and the snapshot get stuck. Closes #51253
On master failover we have to resent all the shard failed messages,
but the transport requests remain the same in the eyes of
equals
.If the master failover is registered and the requests to the new master
are sent before all the callbacks have executed and the request to the
old master removed from the deduplicator then the requuests to the new
master will incorrectly fail and the snapshot get stuck.
Closes #51253