[CI] MlDistributedFailureIT testClusterWithTwoMlNodes_RunsDatafeed_GivenOriginalNodeGoesDown failing #77655

nik9000 · 2021-09-13T18:23:17Z

Build scan:
https://gradle-enterprise.elastic.co/s/puafo5n7kasvi/tests/:x-pack:plugin:ml:internalClusterTest/org.elasticsearch.xpack.ml.integration.MlDistributedFailureIT/testClusterWithTwoMlNodes_RunsDatafeed_GivenOriginalNodeGoesDown

Reproduction line:
./gradlew ':x-pack:plugin:ml:internalClusterTest' --tests "org.elasticsearch.xpack.ml.integration.MlDistributedFailureIT.testClusterWithTwoMlNodes_RunsDatafeed_GivenOriginalNodeGoesDown" -Dtests.seed=4EF400169D3386F -Dtests.locale=hu-HU -Dtests.timezone=Asia/Muscat -Druntime.java=8

Applicable branches:
master, 7.x

Reproduces locally?:
Nope

Failure history:
https://gradle-enterprise.elastic.co/scans/tests?tests.container=org.elasticsearch.xpack.ml.integration.MlDistributedFailureIT&tests.test=testClusterWithTwoMlNodes_RunsDatafeed_GivenOriginalNodeGoesDown

Failure excerpt:

java.lang.AssertionError: 
Expected: <80000L>
     but: was <79999L>

  at __randomizedtesting.SeedInfo.seed([4EF400169D3386F:A3A27413B0768C84]:0)
  at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
  at org.junit.Assert.assertThat(Assert.java:956)
  at org.junit.Assert.assertThat(Assert.java:923)
  at org.elasticsearch.xpack.ml.integration.MlDistributedFailureIT.lambda$testClusterWithTwoMlNodes_RunsDatafeed_GivenOriginalNodeGoesDown$17(MlDistributedFailureIT.java:541)
  at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:1039)
  at org.elasticsearch.test.ESTestCase.assertBusy(ESTestCase.java:1012)
  at org.elasticsearch.xpack.ml.integration.MlDistributedFailureIT.testClusterWithTwoMlNodes_RunsDatafeed_GivenOriginalNodeGoesDown(MlDistributedFailureIT.java:539)
  at sun.reflect.NativeMethodAccessorImpl.invoke0(NativeMethodAccessorImpl.java:-2)
  at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
  at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
  at java.lang.reflect.Method.invoke(Method.java:498)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.invoke(RandomizedRunner.java:1750)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$8.evaluate(RandomizedRunner.java:938)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$9.evaluate(RandomizedRunner.java:974)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$10.evaluate(RandomizedRunner.java:988)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:49)
  at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
  at org.apache.lucene.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:48)
  at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
  at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:817)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:468)
  at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:947)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:832)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:883)
  at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:894)
  at org.apache.lucene.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:45)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:41)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at org.apache.lucene.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
  at org.apache.lucene.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:47)
  at org.apache.lucene.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:64)
  at org.apache.lucene.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:54)
  at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
  at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
  at java.lang.Thread.run(Thread.java:748)

The text was updated successfully, but these errors were encountered:

elasticmachine · 2021-09-13T18:23:19Z

Pinging @elastic/ml-core (Team:ML)

This one fails pretty consistently on my PRs and locally. Not on command or anything, but enough to cause me trouble. Relates #77655

nik9000 · 2021-09-13T18:47:42Z

I've @AwaitsFixed this on master and 7.x.

droberts195 · 2021-09-14T11:56:33Z

Looking at the failure history, there was one failure on 6th September in a PR build for #73324, which is not currently merged. Then loads of failures on 13th September until it was muted.

This makes me wonder if there's something in Lucene 9 that was also in #64292, because #64292 was merged yesterday.

On latest master with the AwaitsFix removed I get 3 failures out of 20 for:

./gradlew ':x-pack:plugin:ml:internalClusterTest' --tests "org.elasticsearch.xpack.ml.integration.MlDistributedFailureIT.testClusterWithTwoMlNodes_RunsDatafeed_GivenOriginalNodeGoesDown" -Dtests.iters=20 -Dtests.seed=D582F949E13E4E4D

After a git revert 1b56e8b I get 0 failures out of 20 for that same command.

So it does seem that something in #64292 is causing this failure. @mayya-sharipova I will try to narrow it down a bit more before passing it over, but just a heads-up that there's a bug somewhere. Also cc @romseygeek as you got exactly the same error on #73324 - I don't know if Lucene 9 contains something in common with #64292 or if you saw a completely different problem that resulted in the same net effect on ML's searches.

romseygeek · 2021-09-14T12:27:44Z

I'm getting occasional off-by-one errors in translog replays in the lucene 9 branch as well, that @dnhatn has been looking at. I wonder if it's a problem with search-after in combination with the point-sort optimization?

romseygeek · 2021-09-14T12:28:29Z

I wonder if it's a problem with search-after in combination with the point-sort optimization?

Or, what @droberts195 said in the comment above, of course...

droberts195 · 2021-09-14T13:55:41Z

@mayya-sharipova in the test that goes wrong we are doing a scroll that should find 80000 documents.

droberts195@2943351 is a commit that prints debug showing how many hits each continuation of the scroll found. On its own it's not very useful, but it might be a starting point for enabling further debug inside the scroll code to show what's going wrong.

Running 20 iterations with seed D582F949E13E4E4D seems to be a pretty reliable way to get a failure, at least on my machine:

./gradlew ':x-pack:plugin:ml:internalClusterTest' --tests "org.elasticsearch.xpack.ml.integration.MlDistributedFailureIT.testClusterWithTwoMlNodes_RunsDatafeed_GivenOriginalNodeGoesDown" -Dtests.iters=20 -Dtests.seed=D582F949E13E4E4D

mayya-sharipova · 2021-09-14T14:44:49Z

@droberts195 Thanks for the detailed investigations and reporting. We will investigate from here.

dnhatn · 2021-09-15T02:44:52Z

There's a bug in Lucene and I've proposed a fix. @mayya-sharipova I think we need to disable the sort optimization until we have a new Lucene snapshot with that fix?

pheyos · 2021-09-15T14:42:40Z

This is also causing failures in Kibana functional tests, which leads to PR delays.
@dnhatn @mayya-sharipova any chance we can get the temporary fix in quickly?

dnhatn · 2021-09-16T03:19:23Z

@pheyos We are working to upgrade Lucene to include the fix: #77801.

dnhatn · 2021-09-16T16:24:47Z

@pheyos I've merged #77801. I think we are good now.

Rreenable testClusterWithTwoMlNodes_RunsDatafeed_GivenOriginalNodeGoesDown after Lucene 8.10 upgrade Closes #77655

nik9000 added :ml Machine learning >test-failure Triaged test failures from CI labels Sep 13, 2021

elasticmachine added the Team:ML Meta label for the ML team label Sep 13, 2021

nik9000 added a commit that referenced this issue Sep 13, 2021

AwaitsFix ml distributed failure test

36b2fd9

This one fails pretty consistently on my PRs and locally. Not on command or anything, but enough to cause me trouble. Relates #77655

nik9000 added a commit that referenced this issue Sep 13, 2021

AwaitsFix ml distributed failure test

ff17d81

This one fails pretty consistently on my PRs and locally. Not on command or anything, but enough to cause me trouble. Relates #77655

pheyos mentioned this issue Sep 16, 2021

Failing test: X-Pack API Integration Tests.x-pack/test/api_integration/apis/ml/results/get_anomalies_table_data·ts - apis Machine Learning ResultsService GetAnomaliesTableData should fetch anomalies table data elastic/kibana#112417

Closed

mayya-sharipova closed this as completed in c495842 Sep 16, 2021

mayya-sharipova added a commit that referenced this issue Sep 16, 2021

Reenable MlDistributedFailureIT

398e762

Rreenable testClusterWithTwoMlNodes_RunsDatafeed_GivenOriginalNodeGoesDown after Lucene 8.10 upgrade Closes #77655

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[CI] MlDistributedFailureIT testClusterWithTwoMlNodes_RunsDatafeed_GivenOriginalNodeGoesDown failing #77655

[CI] MlDistributedFailureIT testClusterWithTwoMlNodes_RunsDatafeed_GivenOriginalNodeGoesDown failing #77655

nik9000 commented Sep 13, 2021 •

edited

Loading

elasticmachine commented Sep 13, 2021

nik9000 commented Sep 13, 2021

droberts195 commented Sep 14, 2021

romseygeek commented Sep 14, 2021

romseygeek commented Sep 14, 2021

droberts195 commented Sep 14, 2021

mayya-sharipova commented Sep 14, 2021

dnhatn commented Sep 15, 2021

pheyos commented Sep 15, 2021

dnhatn commented Sep 16, 2021

dnhatn commented Sep 16, 2021

[CI] MlDistributedFailureIT testClusterWithTwoMlNodes_RunsDatafeed_GivenOriginalNodeGoesDown failing #77655

[CI] MlDistributedFailureIT testClusterWithTwoMlNodes_RunsDatafeed_GivenOriginalNodeGoesDown failing #77655

Comments

nik9000 commented Sep 13, 2021 • edited Loading

elasticmachine commented Sep 13, 2021

nik9000 commented Sep 13, 2021

droberts195 commented Sep 14, 2021

romseygeek commented Sep 14, 2021

romseygeek commented Sep 14, 2021

droberts195 commented Sep 14, 2021

mayya-sharipova commented Sep 14, 2021

dnhatn commented Sep 15, 2021

pheyos commented Sep 15, 2021

dnhatn commented Sep 16, 2021

dnhatn commented Sep 16, 2021

nik9000 commented Sep 13, 2021 •

edited

Loading