Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[CI] MlDistributedFailureIT testClusterWithTwoMlNodes_RunsDatafeed_GivenOriginalNodeGoesDown failing #77655

Closed
nik9000 opened this issue Sep 13, 2021 · 11 comments
Labels
:ml Machine learning Team:ML Meta label for the ML team >test-failure Triaged test failures from CI

Comments

@nik9000
Copy link
Member

nik9000 commented Sep 13, 2021

Build scan:
https://gradle-enterprise.elastic.co/s/puafo5n7kasvi/tests/:x-pack:plugin:ml:internalClusterTest/org.elasticsearch.xpack.ml.integration.MlDistributedFailureIT/testClusterWithTwoMlNodes_RunsDatafeed_GivenOriginalNodeGoesDown

Reproduction line:
./gradlew ':x-pack:plugin:ml:internalClusterTest' --tests "org.elasticsearch.xpack.ml.integration.MlDistributedFailureIT.testClusterWithTwoMlNodes_RunsDatafeed_GivenOriginalNodeGoesDown" -Dtests.seed=4EF400169D3386F -Dtests.locale=hu-HU -Dtests.timezone=Asia/Muscat -Druntime.java=8

Applicable branches:
master, 7.x

Reproduces locally?:
Nope

Failure history:
https://gradle-enterprise.elastic.co/scans/tests?tests.container=org.elasticsearch.xpack.ml.integration.MlDistributedFailureIT&tests.test=testClusterWithTwoMlNodes_RunsDatafeed_GivenOriginalNodeGoesDown

Failure excerpt:

java.lang.AssertionError: 
Expected: <80000L>
     but: was <79999L>

  at __randomizedtesting.SeedInfo.seed([4EF400169D3386F:A3A27413B0768C84]:0)
  at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
  at org.junit.Assert.assertThat(Assert.java:956)
  at org.junit.Assert.assertThat(Assert.java:923)
  at org.elasticsearch.xpack.ml.integration.MlDistributedFailureIT.lambda$testClusterWithTwoMlNodes_RunsDatafeed_GivenOriginalNodeGoesDown$17(MlDistributedFailureIT.java:541)
  at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:1039)
  at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:1012)
  at org.elasticsearch.xpack.ml.integration.MlDistributedFailureIT.testClusterWithTwoMlNodes_RunsDatafeed_GivenOriginalNodeGoesDown(MlDistributedFailureIT.java:539)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(NativeMethodAccessorImpl.java:-2)
  at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
  at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:498)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1750)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:938)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:974)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:988)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:49)
  at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
  at org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:48)
  at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
  at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:817)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:468)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:947)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:832)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:883)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:894)
  at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:41)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
  at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
  at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
  at org.apache.lucene.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:54)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
  at java.lang.Thread.run(Thread.java:748)

@nik9000 nik9000 added :ml Machine learning >test-failure Triaged test failures from CI labels Sep 13, 2021
@elasticmachine elasticmachine added the Team:ML Meta label for the ML team label Sep 13, 2021
@elasticmachine
Copy link
Collaborator

Pinging @elastic/ml-core (Team:ML)

nik9000 added a commit that referenced this issue Sep 13, 2021
This one fails pretty consistently on my PRs and locally. Not on command
or anything, but enough to cause me trouble.

Relates #77655
nik9000 added a commit that referenced this issue Sep 13, 2021
This one fails pretty consistently on my PRs and locally. Not on command
or anything, but enough to cause me trouble.

Relates #77655
@nik9000
Copy link
Member Author

nik9000 commented Sep 13, 2021

I've @AwaitsFixed this on master and 7.x.

@droberts195
Copy link
Contributor

Looking at the failure history, there was one failure on 6th September in a PR build for #73324, which is not currently merged. Then loads of failures on 13th September until it was muted.

This makes me wonder if there's something in Lucene 9 that was also in #64292, because #64292 was merged yesterday.

On latest master with the AwaitsFix removed I get 3 failures out of 20 for:

./gradlew ':x-pack:plugin:ml:internalClusterTest' --tests "org.elasticsearch.xpack.ml.integration.MlDistributedFailureIT.testClusterWithTwoMlNodes_RunsDatafeed_GivenOriginalNodeGoesDown" -Dtests.iters=20 -Dtests.seed=D582F949E13E4E4D

After a git revert 1b56e8b I get 0 failures out of 20 for that same command.

So it does seem that something in #64292 is causing this failure. @mayya-sharipova I will try to narrow it down a bit more before passing it over, but just a heads-up that there's a bug somewhere. Also cc @romseygeek as you got exactly the same error on #73324 - I don't know if Lucene 9 contains something in common with #64292 or if you saw a completely different problem that resulted in the same net effect on ML's searches.

@romseygeek
Copy link
Contributor

I'm getting occasional off-by-one errors in translog replays in the lucene 9 branch as well, that @dnhatn has been looking at. I wonder if it's a problem with search-after in combination with the point-sort optimization?

@romseygeek
Copy link
Contributor

I wonder if it's a problem with search-after in combination with the point-sort optimization?

Or, what @droberts195 said in the comment above, of course...

@droberts195
Copy link
Contributor

@mayya-sharipova in the test that goes wrong we are doing a scroll that should find 80000 documents.

droberts195@2943351 is a commit that prints debug showing how many hits each continuation of the scroll found. On its own it's not very useful, but it might be a starting point for enabling further debug inside the scroll code to show what's going wrong.

Running 20 iterations with seed D582F949E13E4E4D seems to be a pretty reliable way to get a failure, at least on my machine:

./gradlew ':x-pack:plugin:ml:internalClusterTest' --tests "org.elasticsearch.xpack.ml.integration.MlDistributedFailureIT.testClusterWithTwoMlNodes_RunsDatafeed_GivenOriginalNodeGoesDown" -Dtests.iters=20 -Dtests.seed=D582F949E13E4E4D

@mayya-sharipova
Copy link
Contributor

@droberts195 Thanks for the detailed investigations and reporting. We will investigate from here.

@dnhatn
Copy link
Member

dnhatn commented Sep 15, 2021

There's a bug in Lucene and I've proposed a fix. @mayya-sharipova I think we need to disable the sort optimization until we have a new Lucene snapshot with that fix?

@pheyos
Copy link
Member

pheyos commented Sep 15, 2021

This is also causing failures in Kibana functional tests, which leads to PR delays.
@dnhatn @mayya-sharipova any chance we can get the temporary fix in quickly?

@dnhatn
Copy link
Member

dnhatn commented Sep 16, 2021

@pheyos We are working to upgrade Lucene to include the fix: #77801.

@dnhatn
Copy link
Member

dnhatn commented Sep 16, 2021

@pheyos I've merged #77801. I think we are good now.

mayya-sharipova added a commit that referenced this issue Sep 16, 2021
Rreenable testClusterWithTwoMlNodes_RunsDatafeed_GivenOriginalNodeGoesDown
 after Lucene 8.10 upgrade

Closes #77655
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:ml Machine learning Team:ML Meta label for the ML team >test-failure Triaged test failures from CI
Projects
None yet
Development

No branches or pull requests

7 participants