Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[BUG] org.opensearch.cluster.coordination.AwarenessAttributeDecommissionIT.testConcurrentDecommissionAction #12197

Closed
harishbhakuni opened this issue Feb 6, 2024 · 6 comments · Fixed by #14372
Assignees
Labels
bug Something isn't working Cluster Manager flaky-test Random test failure that succeeds on second run

Comments

@harishbhakuni
Copy link
Contributor

Describe the bug

Test Case [org.opensearch.cluster.coordination.AwarenessAttributeDecommissionIT.testConcurrentDecommissionAction](https://build.ci.opensearch.org/job/gradle-check/33392/testReport/junit/org.opensearch.cluster.coordination/AwarenessAttributeDecommissionIT/testConcurrentDecommissionAction_4/) is flaky:

java.lang.AssertionError: ClusterHealthResponse has timed out - returned: [{"cluster_name":"TEST-TEST_WORKER_VM=[700]-CLUSTER_SEED=[-4643523745686193170]-HASH=[324040314CA]-cluster","status":"green","timed_out":true,"number_of_nodes":4,"number_of_data_nodes":2,"discovered_master":true,"discovered_cluster_manager":true,"active_primary_shards":0,"active_shards":0,"relocating_shards":0,"initializing_shards":0,"unassigned_shards":0,"delayed_unassigned_shards":0,"number_of_pending_tasks":0,"number_of_in_flight_fetch":0,"task_max_waiting_in_queue_millis":0,"active_shards_percent_as_number":100.0}]
Expected: is <false>
     but: was <true>
at __randomizedtesting.SeedInfo.seed([8BCFB0523EE66AB:24FF53C38A04342B]:0)
	at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
	at org.opensearch.test.hamcrest.OpenSearchAssertions.assertNoTimeout(OpenSearchAssertions.java:121)
	at org.opensearch.test.OpenSearchIntegTestCase.ensureClusterSizeConsistency(OpenSearchIntegTestCase.java:1034)
	at org.opensearch.test.OpenSearchTestClusterRule.afterInternal(OpenSearchTestClusterRule.java:319)
	at org.opensearch.test.OpenSearchTestClusterRule.after(OpenSearchTestClusterRule.java:188)
	at org.opensearch.test.OpenSearchTestClusterRule$1.evaluate(OpenSearchTestClusterRule.java:374)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at org.junit.rules.RunRules.evaluate(RunRules.java:20)
	at org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:48)
	at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
	at org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
	at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
	at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
	at org.junit.rules.RunRules.evaluate(RunRules.java:20)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:817)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:468)
	at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:947)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:832)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:883)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:894)
	at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
	at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
	at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
	at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
	at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
	at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
	at org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
	at org.junit.rules.RunRules.evaluate(RunRules.java:20)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
	at java.base/java.lang.Thread.run(Thread.java:1583)

Related component

Other

To Reproduce

./gradlew ':server:internalClusterTest' --tests "org.opensearch.cluster.coordination.AwarenessAttributeDecommissionIT.testConcurrentDecommissionAction" -Dtests.seed=8BCFB0523EE66AB -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=es-EC -Dtests.timezone=Etc/GMT0 -Druntime.java=21

Expected behavior

The test should always pass.

Additional Details

Plugins
Please list all plugins currently enabled.

Screenshots
If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

  • OS: [e.g. iOS]
  • Version [e.g. 22]

Additional context
Add any other context about the problem here.

@harishbhakuni harishbhakuni added bug Something isn't working untriaged labels Feb 6, 2024
@github-actions github-actions bot added the Other label Feb 6, 2024
@peternied peternied added Cluster Manager flaky-test Random test failure that succeeds on second run and removed untriaged labels Feb 7, 2024
@peternied
Copy link
Member

[Triage - attendees 1 2 3]
@harishbhakuni Thanks for filing, looking forward to a PR to resolve this.

@peternied
Copy link
Member

@gauravruhela Over the past 30 days, this test has adversely affected a substantial number of pull requests (PRs), notably including [#12464, #12462 (repeated), #12394 (repeated), #12382 (repeated), #12375 (repeated), #12301 (repeated), #12273, #12271 (repeated), #12267 (repeated), #12260 (repeated), #12200, #12193 (repeated), #12163 (repeated), #12151 (repeated), and #12133, #12111].

Please prioritize fixing this test or disabling the test case until it can be fixed.

@rwali-aws
Copy link

Assigning to @imRishN based on discussion with @gargharsh3134

@reta reta mentioned this issue Jun 10, 2024
8 tasks
@dblock
Copy link
Member

dblock commented Jun 13, 2024

@imRishN Are you still looking into this?

andrross added a commit to andrross/OpenSearch that referenced this issue Jun 14, 2024
The problem is that this test would decommission one of six nodes. The
tear down logic of the test would attempt to assert on the health of the
cluster by randomly selecting a node and requesting the cluster health.
If this random check happened to select the node that was
decommissioned, then the test would fail. The fix is to recommission
the node at the end of the test.

Also, the "recommission node and assert cluster health" logic was used
in multiple places and could be refactored out to a helper method.

Resolves opensearch-project#14290
Resolves opensearch-project#12197

Signed-off-by: Andrew Ross <andrross@amazon.com>
andrross added a commit to andrross/OpenSearch that referenced this issue Jun 15, 2024
The problem is that this test would decommission one of six nodes. The
tear down logic of the test would attempt to assert on the health of the
cluster by randomly selecting a node and requesting the cluster health.
If this random check happened to select the node that was
decommissioned, then the test would fail. The fix is to recommission
the node at the end of the test.

Also, the "recommission node and assert cluster health" logic was used
in multiple places and could be refactored out to a helper method.

Resolves opensearch-project#14290
Resolves opensearch-project#12197

Signed-off-by: Andrew Ross <andrross@amazon.com>
@reta reta closed this as completed in 0d38d14 Jun 15, 2024
opensearch-trigger-bot bot pushed a commit that referenced this issue Jun 15, 2024
…#14372)

The problem is that this test would decommission one of six nodes. The
tear down logic of the test would attempt to assert on the health of the
cluster by randomly selecting a node and requesting the cluster health.
If this random check happened to select the node that was
decommissioned, then the test would fail. The fix is to recommission
the node at the end of the test.

Also, the "recommission node and assert cluster health" logic was used
in multiple places and could be refactored out to a helper method.

Resolves #14290
Resolves #12197

Signed-off-by: Andrew Ross <andrross@amazon.com>
(cherry picked from commit 0d38d14)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
reta pushed a commit that referenced this issue Jun 15, 2024
…#14372) (#14376)

The problem is that this test would decommission one of six nodes. The
tear down logic of the test would attempt to assert on the health of the
cluster by randomly selecting a node and requesting the cluster health.
If this random check happened to select the node that was
decommissioned, then the test would fail. The fix is to recommission
the node at the end of the test.

Also, the "recommission node and assert cluster health" logic was used
in multiple places and could be refactored out to a helper method.

Resolves #14290
Resolves #12197


(cherry picked from commit 0d38d14)

Signed-off-by: Andrew Ross <andrross@amazon.com>
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
harshavamsi pushed a commit to harshavamsi/OpenSearch that referenced this issue Jul 12, 2024
…opensearch-project#14372)

The problem is that this test would decommission one of six nodes. The
tear down logic of the test would attempt to assert on the health of the
cluster by randomly selecting a node and requesting the cluster health.
If this random check happened to select the node that was
decommissioned, then the test would fail. The fix is to recommission
the node at the end of the test.

Also, the "recommission node and assert cluster health" logic was used
in multiple places and could be refactored out to a helper method.

Resolves opensearch-project#14290
Resolves opensearch-project#12197

Signed-off-by: Andrew Ross <andrross@amazon.com>
kkewwei pushed a commit to kkewwei/OpenSearch that referenced this issue Jul 24, 2024
…opensearch-project#14372) (opensearch-project#14376)

The problem is that this test would decommission one of six nodes. The
tear down logic of the test would attempt to assert on the health of the
cluster by randomly selecting a node and requesting the cluster health.
If this random check happened to select the node that was
decommissioned, then the test would fail. The fix is to recommission
the node at the end of the test.

Also, the "recommission node and assert cluster health" logic was used
in multiple places and could be refactored out to a helper method.

Resolves opensearch-project#14290
Resolves opensearch-project#12197

(cherry picked from commit 0d38d14)

Signed-off-by: Andrew Ross <andrross@amazon.com>
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Signed-off-by: kkewwei <kkewwei@163.com>
wdongyu pushed a commit to wdongyu/OpenSearch that referenced this issue Aug 22, 2024
…opensearch-project#14372)

The problem is that this test would decommission one of six nodes. The
tear down logic of the test would attempt to assert on the health of the
cluster by randomly selecting a node and requesting the cluster health.
If this random check happened to select the node that was
decommissioned, then the test would fail. The fix is to recommission
the node at the end of the test.

Also, the "recommission node and assert cluster health" logic was used
in multiple places and could be refactored out to a helper method.

Resolves opensearch-project#14290
Resolves opensearch-project#12197

Signed-off-by: Andrew Ross <andrross@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Cluster Manager flaky-test Random test failure that succeeds on second run
Projects
Status: ✅ Done
Development

Successfully merging a pull request may close this issue.

7 participants