-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix TransportMasterNodeAction not Retrying NodeClosedException #51325
Fix TransportMasterNodeAction not Retrying NodeClosedException #51325
Conversation
Added node closed exception to the retryable remote exceptions.
Pinging @elastic/es-distributed (:Distributed/Network) |
@@ -178,7 +178,7 @@ protected void doStart(ClusterState clusterState) { | |||
@Override | |||
public void handleException(final TransportException exp) { | |||
Throwable cause = exp.unwrapCause(); | |||
if (cause instanceof ConnectTransportException) { | |||
if (cause instanceof ConnectTransportException || cause instanceof NodeClosedException) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you write an integration test that shows that this solves an issue?
Perhaps a test that restarts the current master while concurrently sending some TransportMasterNodeReadAction
(e.g. cluster health).
I'm surprised that we have not seen this issue in any of our integration tests, where we sometimes restart nodes (but perhaps don't do this to concurrently issuing master-level requests).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There we go: 3378d3d
Bit of a dirty test but reliably reproduces the issue without any other hacks :)
I've researched this a little and the problem seems to be isolated to 7.x+
. When shutting down a master there is a short window during which:
- We will get the cluster state without master node from
org.elasticsearch.cluster.coordination.Coordinator#clusterStateWithNoMasterBlock
- That then causes the master action on the current master node to fail with
NotMasterException
and add the cluster state observer for the retry waiting for the new master. That new master never comes and instead the wait is shut-down with theNodeClosedException
when the cluster service is shut down - Client receives
NodeClosedException
back from the master
=> this fix still seems fine, if we retry on NodeClosedException
because we can interpret it as the master node going away we're good IMO.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the problem seems to be isolated to 7.x+
Does that also include 7.0, 7.1, ...?
I would like to understand why this is not an issue in earlier versions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For snapshots this only started showing up because I inadvertently disabled the snapshot shard's services own retry in #50788
I can also reproduce this failure back to 7.1
, though interestingly enough it seems to be a lot less likely in 7.1
than in 7.6
(maybe that's due to some locking we removed from the internal test cluster operations, but I can't tell right now).
I pushed 646823d to slow things down a little more and make the test repro better. Without this change it takes a pretty large number of iterations for it to fail on 7.1
I've marked this as a bug, as it prevents proper failover behavior in case of node restarts. |
…shot-shard-status-updates
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Armin, left a single comment, otherwise looking good.
@@ -178,7 +178,7 @@ protected void doStart(ClusterState clusterState) { | |||
@Override | |||
public void handleException(final TransportException exp) { | |||
Throwable cause = exp.unwrapCause(); | |||
if (cause instanceof ConnectTransportException) { | |||
if (cause instanceof ConnectTransportException || cause instanceof NodeClosedException) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we should separate out cases where the NodeClosedException
originate on the local node (for isntance the case where transport stopped, which is also relayed as a NodeClosedException). There should be no need to retry those.
Adding a check that the NodeClosedException
is wrapped in a RemoteTransportException
would ensure we only retry in true client cases.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Henning, I pushed 1d07a26 :)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
AFAICS, TransportException is not an ElasticsearchWrapperException (in contrast to RemoteTransportException), and therefore unwrapCause
will not yield the NodeClosedException in case of a local transport exception.
This means that the code as before would have been fine I think
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we can run into SendRequestTransportException
with NodeClosedException
which is an ElasticsearchWrapperException
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
^^ seems to have happened here https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+pull-request-1/14196/consoleText
1> [2020-01-21T11:51:16,833][WARN ][o.e.s.SnapshotShardsService] [node_td4] [test-repo:test-snap/SQinm62PQkmhonsriaGPUw] [ShardSnapshotStatus[state=FAILED, nodeId=ULGA-lGmRfiN3aM53N3qqg, reason=[test-idx/6AougmeaSJCvxG4mFvfRZA][[test-idx][11]] org.elasticsearch.index.snapshots.IndexShardSnapshotFailedException: Aborted
1> at org.elasticsearch.repositories.blobstore.BlobStoreRepository.snapshotShard(BlobStoreRepository.java:1501)
1> at org.elasticsearch.snapshots.SnapshotShardsService.snapshot(SnapshotShardsService.java:341)
1> at org.elasticsearch.snapshots.SnapshotShardsService.lambda$startNewShards$1(SnapshotShardsService.java:287)
1> at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:629)
1> at java.base/java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
1> at java.base/java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
1> at java.base/java.lang.Thread.run(Thread.java:834)
1> , generation=null]] failed to update snapshot state
1> org.elasticsearch.transport.SendRequestTransportException: [node_td4][127.0.0.1:32974][internal:cluster/snapshot/update_snapshot_status]
1> at org.elasticsearch.transport.TransportService.sendRequestInternal(TransportService.java:644) ~[main/:?]
1> at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:545) ~[main/:?]
1> at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:520) ~[main/:?]
1> at org.elasticsearch.snapshots.SnapshotShardsService.lambda$sendSnapshotShardUpdate$2(SnapshotShardsService.java:492) ~[main/:?]
1> at org.elasticsearch.transport.TransportRequestDeduplicator.executeOnce(TransportRequestDeduplicator.java:52) ~[main/:?]
1> at org.elasticsearch.snapshots.SnapshotShardsService.sendSnapshotShardUpdate(SnapshotShardsService.java:478) ~[main/:?]
1> at org.elasticsearch.snapshots.SnapshotShardsService.notifyFailedSnapshotShard(SnapshotShardsService.java:472) ~[main/:?]
1> at org.elasticsearch.snapshots.SnapshotShardsService$1.onFailure(SnapshotShardsService.java:305) ~[main/:?]
1> at org.elasticsearch.action.ActionListener$5.onFailure(ActionListener.java:256) ~[main/:?]
1> at org.elasticsearch.repositories.blobstore.BlobStoreRepository.snapshotShard(BlobStoreRepository.java:1624) ~[main/:?]
1> at org.elasticsearch.snapshots.SnapshotShardsService.snapshot(SnapshotShardsService.java:341) ~[main/:?]
1> at org.elasticsearch.snapshots.SnapshotShardsService.lambda$startNewShards$1(SnapshotShardsService.java:287) ~[main/:?]
1> at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingRunnable.run(ThreadContext.java:629) ~[main/:?]
1> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128) [?:?]
1> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628) [?:?]
1> at java.lang.Thread.run(Thread.java:834) [?:?]
1> Caused by: org.elasticsearch.node.NodeClosedException: node closed {node_td4}{ULGA-lGmRfiN3aM53N3qqg}{MhSHeUC3SG2Qv9JTAFNk9Q}{127.0.0.1}{127.0.0.1:32974}{di}
1> at org.elasticsearch.transport.TransportService.sendRequestInternal(TransportService.java:626) ~[main/:?]
1> ... 15 more
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
alright, the tests and you win
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Thanks Yannick + Henning! |
…ic#51325) Added node closed exception to the retryable remote exceptions as it's possible to run into this exception instead of a connect exception when the master node is just shutting down but still responding to requests.
…ic#51325) Added node closed exception to the retryable remote exceptions as it's possible to run into this exception instead of a connect exception when the master node is just shutting down but still responding to requests.
… (#51437) Added node closed exception to the retryable remote exceptions as it's possible to run into this exception instead of a connect exception when the master node is just shutting down but still responding to requests.
… (#51438) Added node closed exception to the retryable remote exceptions as it's possible to run into this exception instead of a connect exception when the master node is just shutting down but still responding to requests.
Added node closed exception to the retryable remote exceptions.
Follow up to the discussion in #51270 (review).
I'd still like to keep the workaround from #51270 though just in case we missed a spot. Otherwise any exception on the failed master that arrives with a delay and is not retried will dead-lock the snapshot.
I'll open a follow-up to deal with the possibility of this happening due to the circuit breaker killing the status update on master.