Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MlDistributedFailureIT.testCloseUnassignedJobAndDatafeed fails with NodeNotConnectedException #43670

Closed
alpar-t opened this issue Jun 27, 2019 · 6 comments · Fixed by #49191
Closed
Assignees
Labels
:ml Machine learning >test-failure Triaged test failures from CI v8.0.0-alpha1

Comments

@alpar-t
Copy link
Contributor

alpar-t commented Jun 27, 2019

Example build failure

https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+intake/370/console
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-unix-compatibility/os=oraclelinux-6/87/console

And quite a few PR checks.

https://scans.gradle.com/s/bsbkz6io7ysno/tests/lf2lfu4ufazso-jxctggmo7ue4i

Reproduction line

does not reproduce locally

./gradlew :x-pack:plugin:ml:internalClusterTest --tests "org.elasticsearch.xpack.ml.integration.MlDistributedFailureIT.testCloseUnassignedJobAndDatafeed" -Dtests.seed=8DE3FEE00F9B4146 -Dtests.security.manager=true -Dtests.locale=bem-ZM -Dtests.timezone=Australia/Sydney -Dcompiler.java=12 -Druntime.java=12

Example relevant log:

org.elasticsearch.action.FailedNodeException: Failed node [itGQaU1qT_mC05foNcu0qA]Close stacktrace
at org.elasticsearch.action.support.tasks.TransportTasksAction$AsyncAction.onFailure(TransportTasksAction.java:308)
at org.elasticsearch.action.support.tasks.TransportTasksAction$AsyncAction$1.handleException(TransportTasksAction.java:280)
at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:535)
at org.elasticsearch.action.support.tasks.TransportTasksAction$AsyncAction.start(TransportTasksAction.java:264)
at org.elasticsearch.action.support.tasks.TransportTasksAction.doExecute(TransportTasksAction.java:96)
at org.elasticsearch.xpack.ml.action.TransportStopDatafeedAction.normalStopDatafeed(TransportStopDatafeedAction.java:175)
at org.elasticsearch.xpack.ml.action.TransportStopDatafeedAction.lambda$doExecute$0(TransportStopDatafeedAction.java:130)
at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:62)
at org.elasticsearch.xpack.ml.datafeed.persistence.DatafeedConfigProvider.lambda$expandDatafeedIds$3(DatafeedConfigProvider.java:387)
at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:62)
at org.elasticsearch.action.support.ContextPreservingActionListener.onResponse(ContextPreservingActionListener.java:43)
at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:68)
at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:64)
at org.elasticsearch.action.search.AbstractSearchAsyncAction.sendSearchResponse(AbstractSearchAsyncAction.java:300)
at org.elasticsearch.action.search.FetchSearchPhase$3.run(FetchSearchPhase.java:213)
at org.elasticsearch.action.search.AbstractSearchAsyncAction.executePhase(AbstractSearchAsyncAction.java:171)
at org.elasticsearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:165)
at org.elasticsearch.action.search.ExpandSearchPhase.run(ExpandSearchPhase.java:119)
at org.elasticsearch.action.search.AbstractSearchAsyncAction.executePhase(AbstractSearchAsyncAction.java:171)
at org.elasticsearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:165)
at org.elasticsearch.action.search.FetchSearchPhase.moveToNextPhase(FetchSearchPhase.java:206)
at org.elasticsearch.action.search.FetchSearchPhase.lambda$innerRun$2(FetchSearchPhase.java:104)
at org.elasticsearch.action.search.FetchSearchPhase.innerRun(FetchSearchPhase.java:110)
at org.elasticsearch.action.search.FetchSearchPhase$1.doRun(FetchSearchPhase.java:86)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:44)
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:757)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.lang.Thread.run(Thread.java:835)
Caused by: org.elasticsearch.transport.NodeNotConnectedException: [node_t2][127.0.0.1:45727] Node not connectedClose stacktrace
at org.elasticsearch.transport.ConnectionManager.getConnection(ConnectionManager.java:151)
at org.elasticsearch.transport.TransportService.getConnection(TransportService.java:559)
at org.elasticsearch.transport.TransportService.sendRequest(TransportService.java:531)
at org.elasticsearch.action.support.tasks.TransportTasksAction$AsyncAction.start(TransportTasksAction.java:264)
at org.elasticsearch.action.support.tasks.TransportTasksAction.doExecute(TransportTasksAction.java:96)
at org.elasticsearch.xpack.ml.action.TransportStopDatafeedAction.normalStopDatafeed(TransportStopDatafeedAction.java:175)
at org.elasticsearch.xpack.ml.action.TransportStopDatafeedAction.lambda$doExecute$0(TransportStopDatafeedAction.java:130)
at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:62)
at org.elasticsearch.xpack.ml.datafeed.persistence.DatafeedConfigProvider.lambda$expandDatafeedIds$3(DatafeedConfigProvider.java:387)
at org.elasticsearch.action.ActionListener$1.onResponse(ActionListener.java:62)
at org.elasticsearch.action.support.ContextPreservingActionListener.onResponse(ContextPreservingActionListener.java:43)
at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:68)
at org.elasticsearch.action.support.TransportAction$1.onResponse(TransportAction.java:64)
at org.elasticsearch.action.search.AbstractSearchAsyncAction.sendSearchResponse(AbstractSearchAsyncAction.java:300)
at org.elasticsearch.action.search.FetchSearchPhase$3.run(FetchSearchPhase.java:213)
at org.elasticsearch.action.search.AbstractSearchAsyncAction.executePhase(AbstractSearchAsyncAction.java:171)
at org.elasticsearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:165)
at org.elasticsearch.action.search.ExpandSearchPhase.run(ExpandSearchPhase.java:119)
at org.elasticsearch.action.search.AbstractSearchAsyncAction.executePhase(AbstractSearchAsyncAction.java:171)
at org.elasticsearch.action.search.AbstractSearchAsyncAction.executeNextPhase(AbstractSearchAsyncAction.java:165)
at org.elasticsearch.action.search.FetchSearchPhase.moveToNextPhase(FetchSearchPhase.java:206)
at org.elasticsearch.action.search.FetchSearchPhase.lambda$innerRun$2(FetchSearchPhase.java:104)
at org.elasticsearch.action.search.FetchSearchPhase.innerRun(FetchSearchPhase.java:110)
at org.elasticsearch.action.search.FetchSearchPhase$1.doRun(FetchSearchPhase.java:86)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
at org.elasticsearch.common.util.concurrent.TimedRunnable.doRun(TimedRunnable.java:44)
at org.elasticsearch.common.util.concurrent.ThreadContext$ContextPreservingAbstractRunnable.doRun(ThreadContext.java:757)
at org.elasticsearch.common.util.concurrent.AbstractRunnable.run(AbstractRunnable.java:37)
at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1128)
at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:628)
at java.lang.Thread.run(Thread.java:835)
@alpar-t alpar-t added >test-failure Triaged test failures from CI :ml Machine learning v8.0.0 labels Jun 27, 2019
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core

alpar-t added a commit that referenced this issue Jun 27, 2019
@alpar-t
Copy link
Contributor Author

alpar-t commented Jun 27, 2019

Muted in f6bc4b1

@droberts195
Copy link
Contributor

There's a race condition in TransportStopDatafeedAction.normalStopDatafeed() if the datafeed becomes unassigned after the loop that handles assigned and unassigned datafeeds differently but before the redirect to the node the datafeed is assigned to. We should probably catch the FailedNodeException and if the cause is NodeNotConnectedException then retry the stop.

The fact the test has only had this problem on master suggests to me that something has changed in the last few days that means the assignment status of persistent tasks is slower to update in the cluster state when a node leaves the cluster than it used to be a few days ago.

/cc @elastic/es-distributed (not because I think you should fix this - that's definitely for the ML team - but just in case you have an idea why the cluster state update handlers that react to a node leaving the cluster might now take longer to complete than they used to)

@ywelsch
Copy link
Contributor

ywelsch commented Jun 27, 2019

/cc @elastic/es-distributed (not because I think you should fix this - that's definitely for the ML team - but just in case you have an idea why the cluster state update handlers that react to a node leaving the cluster might now take longer to complete than they used to)

I'm not aware of any recent change affecting this. Perhaps it's just the timing on some CI machines that's a little different. We're talking about 150ms here.

1> [2019-06-26T16:53:53,999][INFO ][o.e.c.s.MasterService    ] [node_tmd1] node-left[{node_t2}{Acy_1w7ETxqZVT9j6lrRQg}{moXD7ZzHTYu_52dDIa3kaw}{127.0.0.1}{127.0.0.1:47796}{di}{ml.machine_memory=63058345984, ml.max_open_jobs=20, xpack.installed=true} disconnected], term: 1, version: 28, reason: removed {{node_t2}{Acy_1w7ETxqZVT9j6lrRQg}{moXD7ZzHTYu_52dDIa3kaw}{127.0.0.1}{127.0.0.1:47796}{di}{ml.machine_memory=63058345984, ml.max_open_jobs=20, xpack.installed=true},}
  1> [2019-06-26T16:53:54,064][INFO ][o.e.n.Node               ] [testCloseUnassignedJobAndDatafeed] stopped
  1> [2019-06-26T16:53:54,065][INFO ][o.e.n.Node               ] [testCloseUnassignedJobAndDatafeed] closing ...
  1> [2019-06-26T16:53:54,067][INFO ][o.e.n.Node               ] [testCloseUnassignedJobAndDatafeed] closed
  1> [2019-06-26T16:53:54,083][INFO ][o.e.x.m.i.MlDistributedFailureIT] [testCloseUnassignedJobAndDatafeed] [MlDistributedFailureIT#testCloseUnassignedJobAndDatafeed]: Cleaning up datafeeds and jobs after test
  1> [2019-06-26T16:53:54,085][INFO ][o.e.x.m.i.MlDistributedFailureIT] [testCloseUnassignedJobAndDatafeed] Closing all datafeeds (using _all)
  1> [2019-06-26T16:53:54,154][INFO ][o.e.c.s.ClusterApplierService] [node_tmd1] removed {{node_t2}{Acy_1w7ETxqZVT9j6lrRQg}{moXD7ZzHTYu_52dDIa3kaw}{127.0.0.1}{127.0.0.1:47796}{di}{ml.machine_memory=63058345984, ml.max_open_jobs=20, xpack.installed=true},}, term: 1, version: 28, reason: Publication{term=1, version=28}

@ebadyano
Copy link
Contributor

Looks like another failure with NodeNotConnectedException https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+intake/843/console

@droberts195
Copy link
Contributor

There's a race condition in TransportStopDatafeedAction.normalStopDatafeed() if the datafeed becomes unassigned after the loop that handles assigned and unassigned datafeeds differently but before the redirect to the node the datafeed is assigned to. We should probably catch the FailedNodeException and if the cause is NodeNotConnectedException then retry the stop.

I will try to fix this in the change to fix #48931 as it's in the same part of the code.

droberts195 added a commit to droberts195/elasticsearch that referenced this issue Nov 15, 2019
The following edge cases were fixed:

1. A request to force-stop a stopping datafeed is no longer
   ignored.  Force-stop is an important recovery mechanism
   if normal stop doesn't work for some reason, and needs
   to operate on a datafeed in any state other than stopped.
2. If the node that a datafeed is running on is removed from
   the cluster during a normal stop then the stop request is
   retried (and will likely succeed on this retry by simply
   cancelling the persistent task for the affected datafeed).
3. If there are multiple simultaneous force-stop requests for
   the same datafeed we no longer fail the one that is
   processed second.  The previous behaviour was wrong as
   stopping a stopped datafeed is not an error, so stopping
   a datafeed twice simultaneously should not be either.

Fixes elastic#43670
Fixes elastic#48931
droberts195 added a commit that referenced this issue Nov 19, 2019
The following edge cases were fixed:

1. A request to force-stop a stopping datafeed is no longer
   ignored.  Force-stop is an important recovery mechanism
   if normal stop doesn't work for some reason, and needs
   to operate on a datafeed in any state other than stopped.
2. If the node that a datafeed is running on is removed from
   the cluster during a normal stop then the stop request is
   retried (and will likely succeed on this retry by simply
   cancelling the persistent task for the affected datafeed).
3. If there are multiple simultaneous force-stop requests for
   the same datafeed we no longer fail the one that is
   processed second.  The previous behaviour was wrong as
   stopping a stopped datafeed is not an error, so stopping
   a datafeed twice simultaneously should not be either.

Fixes #43670
Fixes #48931
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:ml Machine learning >test-failure Triaged test failures from CI v8.0.0-alpha1
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants