MlDistributedFailureIT.testCloseUnassignedJobAndDatafeed fails with NodeNotConnectedException #43670

alpar-t · 2019-06-27T03:49:36Z

Example build failure

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+intake/370/console
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-unix-compatibility/os=oraclelinux-6/87/console

And quite a few PR checks.

https://scans.gradle.com/s/bsbkz6io7ysno/tests/lf2lfu4ufazso-jxctggmo7ue4i

Reproduction line

does not reproduce locally

./gradlew :x-pack:plugin:ml:internalClusterTest --tests "org.elasticsearch.xpack.ml.integration.MlDistributedFailureIT.testCloseUnassignedJobAndDatafeed" -Dtests.seed=8DE3FEE00F9B4146 -Dtests.security.manager=true -Dtests.locale=bem-ZM -Dtests.timezone=Australia/Sydney -Dcompiler.java=12 -Druntime.java=12

Example relevant log:

org.elasticsearch.action.FailedNodeException: Failed node [itGQaU1qT_mC05foNcu0qA]Close stacktrace
at org.elasticsearch.action.support.tasks.TransportTasksAction$AsyncAction.onFailure(TransportTasksAction.java:308)
at org.elasticsearch.action.support.tasks.TransportTasksAction$AsyncAction$1.handleException(TransportTasksAction.java:280)
at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:535)
at org.elasticsearch.action.support.tasks.TransportTasksAction$AsyncAction.start(TransportTasksAction.java:264)
at org.elasticsearch.action.support.tasks.TransportTasksAction.doExecute(TransportTasksAction.java:96)
at org.elasticsearch.xpack.ml.action.TransportStopDatafeedAction.normalStopDatafeed(TransportStopDatafeedAction.java:175)
at org.elasticsearch.xpack.ml.action.TransportStopDatafeedAction.lambda$doExecute$0(TransportStopDatafeedAction.java:130)
at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:62)
at org.elasticsearch.xpack.ml.datafeed.persistence.DatafeedConfigProvider.lambda$expandDatafeedIds$3(DatafeedConfigProvider.java:387)
at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:62)
at org.elasticsearch.action.support.ContextPreservingActionListener.onResponse(ContextPreservingActionListener.java:43)
at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:68)
at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:64)
at org.elasticsearch.action.search.AbstractSearchAsyncAction.sendSearchResponse(AbstractSearchAsyncAction.java:300)
at org.elasticsearch.action.search.FetchSearchPhase$3.run(FetchSearchPhase.java:213)
at org.elasticsearch.action.search.AbstractSearchAsyncAction.executePhase(AbstractSearchAsyncAction.java:171)
at org.elasticsearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:165)
at org.elasticsearch.action.search.ExpandSearchPhase.run(ExpandSearchPhase.java:119)
at org.elasticsearch.action.search.AbstractSearchAsyncAction.executePhase(AbstractSearchAsyncAction.java:171)
at org.elasticsearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:165)
at org.elasticsearch.action.search.FetchSearchPhase.moveToNextPhase(FetchSearchPhase.java:206)
at org.elasticsearch.action.search.FetchSearchPhase.lambda$innerRun$2(FetchSearchPhase.java:104)
at org.elasticsearch.action.search.FetchSearchPhase.innerRun(FetchSearchPhase.java:110)
at org.elasticsearch.action.search.FetchSearchPhase$1.doRun(FetchSearchPhase.java:86)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:44)
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:757)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.lang.Thread.run(Thread.java:835)
Caused by: org.elasticsearch.transport.NodeNotConnectedException: [node_t2][127.0.0.1:45727] Node not connectedClose stacktrace
at org.elasticsearch.transport.ConnectionManager.getConnection(ConnectionManager.java:151)
at org.elasticsearch.transport.TransportService.getConnection(TransportService.java:559)
at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:531)
at org.elasticsearch.action.support.tasks.TransportTasksAction$AsyncAction.start(TransportTasksAction.java:264)
at org.elasticsearch.action.support.tasks.TransportTasksAction.doExecute(TransportTasksAction.java:96)
at org.elasticsearch.xpack.ml.action.TransportStopDatafeedAction.normalStopDatafeed(TransportStopDatafeedAction.java:175)
at org.elasticsearch.xpack.ml.action.TransportStopDatafeedAction.lambda$doExecute$0(TransportStopDatafeedAction.java:130)
at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:62)
at org.elasticsearch.xpack.ml.datafeed.persistence.DatafeedConfigProvider.lambda$expandDatafeedIds$3(DatafeedConfigProvider.java:387)
at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:62)
at org.elasticsearch.action.support.ContextPreservingActionListener.onResponse(ContextPreservingActionListener.java:43)
at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:68)
at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:64)
at org.elasticsearch.action.search.AbstractSearchAsyncAction.sendSearchResponse(AbstractSearchAsyncAction.java:300)
at org.elasticsearch.action.search.FetchSearchPhase$3.run(FetchSearchPhase.java:213)
at org.elasticsearch.action.search.AbstractSearchAsyncAction.executePhase(AbstractSearchAsyncAction.java:171)
at org.elasticsearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:165)
at org.elasticsearch.action.search.ExpandSearchPhase.run(ExpandSearchPhase.java:119)
at org.elasticsearch.action.search.AbstractSearchAsyncAction.executePhase(AbstractSearchAsyncAction.java:171)
at org.elasticsearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:165)
at org.elasticsearch.action.search.FetchSearchPhase.moveToNextPhase(FetchSearchPhase.java:206)
at org.elasticsearch.action.search.FetchSearchPhase.lambda$innerRun$2(FetchSearchPhase.java:104)
at org.elasticsearch.action.search.FetchSearchPhase.innerRun(FetchSearchPhase.java:110)
at org.elasticsearch.action.search.FetchSearchPhase$1.doRun(FetchSearchPhase.java:86)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:44)
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:757)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.lang.Thread.run(Thread.java:835)

The text was updated successfully, but these errors were encountered:

elasticmachine · 2019-06-27T03:49:37Z

Pinging @elastic/ml-core

Tracked in #43670

alpar-t · 2019-06-27T03:56:52Z

Muted in f6bc4b1

droberts195 · 2019-06-27T08:39:13Z

There's a race condition in TransportStopDatafeedAction.normalStopDatafeed() if the datafeed becomes unassigned after the loop that handles assigned and unassigned datafeeds differently but before the redirect to the node the datafeed is assigned to. We should probably catch the FailedNodeException and if the cause is NodeNotConnectedException then retry the stop.

The fact the test has only had this problem on master suggests to me that something has changed in the last few days that means the assignment status of persistent tasks is slower to update in the cluster state when a node leaves the cluster than it used to be a few days ago.

/cc @elastic/es-distributed (not because I think you should fix this - that's definitely for the ML team - but just in case you have an idea why the cluster state update handlers that react to a node leaving the cluster might now take longer to complete than they used to)

ywelsch · 2019-06-27T12:12:09Z

/cc @elastic/es-distributed (not because I think you should fix this - that's definitely for the ML team - but just in case you have an idea why the cluster state update handlers that react to a node leaving the cluster might now take longer to complete than they used to)

I'm not aware of any recent change affecting this. Perhaps it's just the timing on some CI machines that's a little different. We're talking about 150ms here.

1> [2019-06-26T16:53:53,999][INFO ][o.e.c.s.MasterService    ] [node_tmd1] node-left[{node_t2}{Acy_1w7ETxqZVT9j6lrRQg}{moXD7ZzHTYu_52dDIa3kaw}{127.0.0.1}{127.0.0.1:47796}{di}{ml.machine_memory=63058345984, ml.max_open_jobs=20, xpack.installed=true} disconnected], term: 1, version: 28, reason: removed {{node_t2}{Acy_1w7ETxqZVT9j6lrRQg}{moXD7ZzHTYu_52dDIa3kaw}{127.0.0.1}{127.0.0.1:47796}{di}{ml.machine_memory=63058345984, ml.max_open_jobs=20, xpack.installed=true},}
  1> [2019-06-26T16:53:54,064][INFO ][o.e.n.Node               ] [testCloseUnassignedJobAndDatafeed] stopped
  1> [2019-06-26T16:53:54,065][INFO ][o.e.n.Node               ] [testCloseUnassignedJobAndDatafeed] closing ...
  1> [2019-06-26T16:53:54,067][INFO ][o.e.n.Node               ] [testCloseUnassignedJobAndDatafeed] closed
  1> [2019-06-26T16:53:54,083][INFO ][o.e.x.m.i.MlDistributedFailureIT] [testCloseUnassignedJobAndDatafeed] [MlDistributedFailureIT#testCloseUnassignedJobAndDatafeed]: Cleaning up datafeeds and jobs after test
  1> [2019-06-26T16:53:54,085][INFO ][o.e.x.m.i.MlDistributedFailureIT] [testCloseUnassignedJobAndDatafeed] Closing all datafeeds (using _all)
  1> [2019-06-26T16:53:54,154][INFO ][o.e.c.s.ClusterApplierService] [node_tmd1] removed {{node_t2}{Acy_1w7ETxqZVT9j6lrRQg}{moXD7ZzHTYu_52dDIa3kaw}{127.0.0.1}{127.0.0.1:47796}{di}{ml.machine_memory=63058345984, ml.max_open_jobs=20, xpack.installed=true},}, term: 1, version: 28, reason: Publication{term=1, version=28}

ebadyano · 2019-07-22T20:28:36Z

Looks like another failure with NodeNotConnectedException https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+intake/843/console

droberts195 · 2019-11-13T14:05:27Z

There's a race condition in TransportStopDatafeedAction.normalStopDatafeed() if the datafeed becomes unassigned after the loop that handles assigned and unassigned datafeeds differently but before the redirect to the node the datafeed is assigned to. We should probably catch the FailedNodeException and if the cause is NodeNotConnectedException then retry the stop.

I will try to fix this in the change to fix #48931 as it's in the same part of the code.

The following edge cases were fixed: 1. A request to force-stop a stopping datafeed is no longer ignored. Force-stop is an important recovery mechanism if normal stop doesn't work for some reason, and needs to operate on a datafeed in any state other than stopped. 2. If the node that a datafeed is running on is removed from the cluster during a normal stop then the stop request is retried (and will likely succeed on this retry by simply cancelling the persistent task for the affected datafeed). 3. If there are multiple simultaneous force-stop requests for the same datafeed we no longer fail the one that is processed second. The previous behaviour was wrong as stopping a stopped datafeed is not an error, so stopping a datafeed twice simultaneously should not be either. Fixes elastic#43670 Fixes elastic#48931

The following edge cases were fixed: 1. A request to force-stop a stopping datafeed is no longer ignored. Force-stop is an important recovery mechanism if normal stop doesn't work for some reason, and needs to operate on a datafeed in any state other than stopped. 2. If the node that a datafeed is running on is removed from the cluster during a normal stop then the stop request is retried (and will likely succeed on this retry by simply cancelling the persistent task for the affected datafeed). 3. If there are multiple simultaneous force-stop requests for the same datafeed we no longer fail the one that is processed second. The previous behaviour was wrong as stopping a stopped datafeed is not an error, so stopping a datafeed twice simultaneously should not be either. Fixes #43670 Fixes #48931

alpar-t added >test-failure Triaged test failures from CI :ml Machine learning v8.0.0 labels Jun 27, 2019

alpar-t added a commit that referenced this issue Jun 27, 2019

Mute failing test

f6bc4b1

Tracked in #43670

droberts195 mentioned this issue Nov 13, 2019

[ML] Force-stop for a stopping datafeed is ignored #48931

Closed

droberts195 self-assigned this Nov 13, 2019

droberts195 mentioned this issue Nov 15, 2019

[ML] Fixes for stop datafeed edge cases #49191

Merged

droberts195 closed this as completed in #49191 Nov 19, 2019

This was referenced Feb 3, 2020

[meta] 7.6 release elastic/elasticsearch-net#4340

Closed

[meta] 7.6 release elastic/elasticsearch-net#4341

Closed

jakelandis added v8.0.0-alpha1 and removed v8.0.0 labels Jul 26, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

MlDistributedFailureIT.testCloseUnassignedJobAndDatafeed fails with NodeNotConnectedException #43670

MlDistributedFailureIT.testCloseUnassignedJobAndDatafeed fails with NodeNotConnectedException #43670

alpar-t commented Jun 27, 2019

elasticmachine commented Jun 27, 2019

alpar-t commented Jun 27, 2019

droberts195 commented Jun 27, 2019

ywelsch commented Jun 27, 2019

ebadyano commented Jul 22, 2019

droberts195 commented Nov 13, 2019

MlDistributedFailureIT.testCloseUnassignedJobAndDatafeed fails with NodeNotConnectedException #43670

MlDistributedFailureIT.testCloseUnassignedJobAndDatafeed fails with NodeNotConnectedException #43670

Comments

alpar-t commented Jun 27, 2019

Example build failure

Reproduction line

Example relevant log:

elasticmachine commented Jun 27, 2019

alpar-t commented Jun 27, 2019

droberts195 commented Jun 27, 2019

ywelsch commented Jun 27, 2019

ebadyano commented Jul 22, 2019

droberts195 commented Nov 13, 2019