-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[CI] MlDistributedFailureIT testClusterWithTwoMlNodes_RunsDatafeed_GivenOriginalNodeGoesDown failing #77655
Comments
Pinging @elastic/ml-core (Team:ML) |
This one fails pretty consistently on my PRs and locally. Not on command or anything, but enough to cause me trouble. Relates #77655
This one fails pretty consistently on my PRs and locally. Not on command or anything, but enough to cause me trouble. Relates #77655
I've |
Looking at the failure history, there was one failure on 6th September in a PR build for #73324, which is not currently merged. Then loads of failures on 13th September until it was muted. This makes me wonder if there's something in Lucene 9 that was also in #64292, because #64292 was merged yesterday. On latest master with the
After a So it does seem that something in #64292 is causing this failure. @mayya-sharipova I will try to narrow it down a bit more before passing it over, but just a heads-up that there's a bug somewhere. Also cc @romseygeek as you got exactly the same error on #73324 - I don't know if Lucene 9 contains something in common with #64292 or if you saw a completely different problem that resulted in the same net effect on ML's searches. |
I'm getting occasional off-by-one errors in translog replays in the lucene 9 branch as well, that @dnhatn has been looking at. I wonder if it's a problem with search-after in combination with the point-sort optimization? |
Or, what @droberts195 said in the comment above, of course... |
@mayya-sharipova in the test that goes wrong we are doing a scroll that should find 80000 documents. droberts195@2943351 is a commit that prints debug showing how many hits each continuation of the scroll found. On its own it's not very useful, but it might be a starting point for enabling further debug inside the scroll code to show what's going wrong. Running 20 iterations with seed
|
@droberts195 Thanks for the detailed investigations and reporting. We will investigate from here. |
There's a bug in Lucene and I've proposed a fix. @mayya-sharipova I think we need to disable the sort optimization until we have a new Lucene snapshot with that fix? |
This is also causing failures in Kibana functional tests, which leads to PR delays. |
Rreenable testClusterWithTwoMlNodes_RunsDatafeed_GivenOriginalNodeGoesDown after Lucene 8.10 upgrade Closes #77655
Build scan:
https://gradle-enterprise.elastic.co/s/puafo5n7kasvi/tests/:x-pack:plugin:ml:internalClusterTest/org.elasticsearch.xpack.ml.integration.MlDistributedFailureIT/testClusterWithTwoMlNodes_RunsDatafeed_GivenOriginalNodeGoesDown
Reproduction line:
./gradlew ':x-pack:plugin:ml:internalClusterTest' --tests "org.elasticsearch.xpack.ml.integration.MlDistributedFailureIT.testClusterWithTwoMlNodes_RunsDatafeed_GivenOriginalNodeGoesDown" -Dtests.seed=4EF400169D3386F -Dtests.locale=hu-HU -Dtests.timezone=Asia/Muscat -Druntime.java=8
Applicable branches:
master, 7.x
Reproduces locally?:
Nope
Failure history:
https://gradle-enterprise.elastic.co/scans/tests?tests.container=org.elasticsearch.xpack.ml.integration.MlDistributedFailureIT&tests.test=testClusterWithTwoMlNodes_RunsDatafeed_GivenOriginalNodeGoesDown
Failure excerpt:
The text was updated successfully, but these errors were encountered: