-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
MlDistributedFailureIT.testCloseUnassignedJobAndDatafeed fails with NodeNotConnectedException #43670
Comments
Pinging @elastic/ml-core |
Muted in f6bc4b1 |
There's a race condition in The fact the test has only had this problem on master suggests to me that something has changed in the last few days that means the assignment status of persistent tasks is slower to update in the cluster state when a node leaves the cluster than it used to be a few days ago. /cc @elastic/es-distributed (not because I think you should fix this - that's definitely for the ML team - but just in case you have an idea why the cluster state update handlers that react to a node leaving the cluster might now take longer to complete than they used to) |
I'm not aware of any recent change affecting this. Perhaps it's just the timing on some CI machines that's a little different. We're talking about 150ms here.
|
Looks like another failure with |
I will try to fix this in the change to fix #48931 as it's in the same part of the code. |
The following edge cases were fixed: 1. A request to force-stop a stopping datafeed is no longer ignored. Force-stop is an important recovery mechanism if normal stop doesn't work for some reason, and needs to operate on a datafeed in any state other than stopped. 2. If the node that a datafeed is running on is removed from the cluster during a normal stop then the stop request is retried (and will likely succeed on this retry by simply cancelling the persistent task for the affected datafeed). 3. If there are multiple simultaneous force-stop requests for the same datafeed we no longer fail the one that is processed second. The previous behaviour was wrong as stopping a stopped datafeed is not an error, so stopping a datafeed twice simultaneously should not be either. Fixes elastic#43670 Fixes elastic#48931
The following edge cases were fixed: 1. A request to force-stop a stopping datafeed is no longer ignored. Force-stop is an important recovery mechanism if normal stop doesn't work for some reason, and needs to operate on a datafeed in any state other than stopped. 2. If the node that a datafeed is running on is removed from the cluster during a normal stop then the stop request is retried (and will likely succeed on this retry by simply cancelling the persistent task for the affected datafeed). 3. If there are multiple simultaneous force-stop requests for the same datafeed we no longer fail the one that is processed second. The previous behaviour was wrong as stopping a stopped datafeed is not an error, so stopping a datafeed twice simultaneously should not be either. Fixes #43670 Fixes #48931
Example build failure
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+intake/370/console
https://elasticsearch-ci.elastic.co/job/elastic+elasticsearch+master+multijob-unix-compatibility/os=oraclelinux-6/87/console
And quite a few PR checks.
https://scans.gradle.com/s/bsbkz6io7ysno/tests/lf2lfu4ufazso-jxctggmo7ue4i
Reproduction line
does not reproduce locally
Example relevant log:
The text was updated successfully, but these errors were encountered: