[BUG] org.opensearch.cluster.coordination.AwarenessAttributeDecommissionIT.testConcurrentDecommissionAction #12197

harishbhakuni · 2024-02-06T20:59:46Z

Describe the bug

Test Case [org.opensearch.cluster.coordination.AwarenessAttributeDecommissionIT.testConcurrentDecommissionAction](https://build.ci.opensearch.org/job/gradle-check/33392/testReport/junit/org.opensearch.cluster.coordination/AwarenessAttributeDecommissionIT/testConcurrentDecommissionAction_4/) is flaky:

java.lang.AssertionError: ClusterHealthResponse has timed out - returned: [{"cluster_name":"TEST-TEST_WORKER_VM=[700]-CLUSTER_SEED=[-4643523745686193170]-HASH=[324040314CA]-cluster","status":"green","timed_out":true,"number_of_nodes":4,"number_of_data_nodes":2,"discovered_master":true,"discovered_cluster_manager":true,"active_primary_shards":0,"active_shards":0,"relocating_shards":0,"initializing_shards":0,"unassigned_shards":0,"delayed_unassigned_shards":0,"number_of_pending_tasks":0,"number_of_in_flight_fetch":0,"task_max_waiting_in_queue_millis":0,"active_shards_percent_as_number":100.0}]
Expected: is <false>
     but: was <true>
at __randomizedtesting.SeedInfo.seed([8BCFB0523EE66AB:24FF53C38A04342B]:0)
	at org.hamcrest.MatcherAssert.assertThat(MatcherAssert.java:18)
	at org.opensearch.test.hamcrest.OpenSearchAssertions.assertNoTimeout(OpenSearchAssertions.java:121)
	at org.opensearch.test.OpenSearchIntegTestCase.ensureClusterSizeConsistency(OpenSearchIntegTestCase.java:1034)
	at org.opensearch.test.OpenSearchTestClusterRule.afterInternal(OpenSearchTestClusterRule.java:319)
	at org.opensearch.test.OpenSearchTestClusterRule.after(OpenSearchTestClusterRule.java:188)
	at org.opensearch.test.OpenSearchTestClusterRule$1.evaluate(OpenSearchTestClusterRule.java:374)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at org.junit.rules.RunRules.evaluate(RunRules.java:20)
	at org.apache.lucene.tests.util.TestRuleSetupTeardownChained$1.evaluate(TestRuleSetupTeardownChained.java:48)
	at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
	at org.apache.lucene.tests.util.TestRuleThreadAndTestName$1.evaluate(TestRuleThreadAndTestName.java:45)
	at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
	at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
	at org.junit.rules.RunRules.evaluate(RunRules.java:20)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl.forkTimeoutingTask(ThreadLeakControl.java:817)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl$3.evaluate(ThreadLeakControl.java:468)
	at com.carrotsearch.randomizedtesting.RandomizedRunner.runSingleTest(RandomizedRunner.java:947)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$5.evaluate(RandomizedRunner.java:832)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$6.evaluate(RandomizedRunner.java:883)
	at com.carrotsearch.randomizedtesting.RandomizedRunner$7.evaluate(RandomizedRunner.java:894)
	at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at org.apache.lucene.tests.util.TestRuleStoreClassName$1.evaluate(TestRuleStoreClassName.java:38)
	at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
	at com.carrotsearch.randomizedtesting.rules.NoShadowingOrOverridesOnMethodsRule$1.evaluate(NoShadowingOrOverridesOnMethodsRule.java:40)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at org.apache.lucene.tests.util.TestRuleAssertionsRequired$1.evaluate(TestRuleAssertionsRequired.java:53)
	at org.apache.lucene.tests.util.AbstractBeforeAfterRule$1.evaluate(AbstractBeforeAfterRule.java:43)
	at org.apache.lucene.tests.util.TestRuleMarkFailure$1.evaluate(TestRuleMarkFailure.java:44)
	at org.apache.lucene.tests.util.TestRuleIgnoreAfterMaxFailures$1.evaluate(TestRuleIgnoreAfterMaxFailures.java:60)
	at org.apache.lucene.tests.util.TestRuleIgnoreTestSuites$1.evaluate(TestRuleIgnoreTestSuites.java:47)
	at org.junit.rules.RunRules.evaluate(RunRules.java:20)
	at com.carrotsearch.randomizedtesting.rules.StatementAdapter.evaluate(StatementAdapter.java:36)
	at com.carrotsearch.randomizedtesting.ThreadLeakControl$StatementRunner.run(ThreadLeakControl.java:368)
	at java.base/java.lang.Thread.run(Thread.java:1583)

Related component

Other

To Reproduce

./gradlew ':server:internalClusterTest' --tests "org.opensearch.cluster.coordination.AwarenessAttributeDecommissionIT.testConcurrentDecommissionAction" -Dtests.seed=8BCFB0523EE66AB -Dtests.security.manager=true -Dtests.jvm.argline="-XX:TieredStopAtLevel=1 -XX:ReservedCodeCacheSize=64m" -Dtests.locale=es-EC -Dtests.timezone=Etc/GMT0 -Druntime.java=21

Expected behavior

The test should always pass.

Additional Details

Plugins
Please list all plugins currently enabled.

Screenshots
If applicable, add screenshots to help explain your problem.

Host/Environment (please complete the following information):

OS: [e.g. iOS]
Version [e.g. 22]

Additional context
Add any other context about the problem here.

The text was updated successfully, but these errors were encountered:

peternied · 2024-02-07T16:15:57Z

[Triage - attendees 1 2 3]
@harishbhakuni Thanks for filing, looking forward to a PR to resolve this.

peternied · 2024-02-08T23:11:28Z

Impacting [BUG] org.opensearch.cluster.coordination.AwarenessAttributeDecommissionIT.testConcurrentDecommissionAction #12197 Logs https://build.ci.opensearch.org/job/gradle-check/33572/testReport/

peternied · 2024-02-29T22:49:36Z

@gauravruhela Over the past 30 days, this test has adversely affected a substantial number of pull requests (PRs), notably including [#12464, #12462 (repeated), #12394 (repeated), #12382 (repeated), #12375 (repeated), #12301 (repeated), #12273, #12271 (repeated), #12267 (repeated), #12260 (repeated), #12200, #12193 (repeated), #12163 (repeated), #12151 (repeated), and #12133, #12111].

Please prioritize fixing this test or disabling the test case until it can be fixed.

rwali-aws · 2024-06-05T10:59:42Z

Assigning to @imRishN based on discussion with @gargharsh3134

dblock · 2024-06-13T15:42:46Z

@imRishN Are you still looking into this?

The problem is that this test would decommission one of six nodes. The tear down logic of the test would attempt to assert on the health of the cluster by randomly selecting a node and requesting the cluster health. If this random check happened to select the node that was decommissioned, then the test would fail. The fix is to recommission the node at the end of the test. Also, the "recommission node and assert cluster health" logic was used in multiple places and could be refactored out to a helper method. Resolves opensearch-project#14290 Resolves opensearch-project#12197 Signed-off-by: Andrew Ross <andrross@amazon.com>

…#14372) The problem is that this test would decommission one of six nodes. The tear down logic of the test would attempt to assert on the health of the cluster by randomly selecting a node and requesting the cluster health. If this random check happened to select the node that was decommissioned, then the test would fail. The fix is to recommission the node at the end of the test. Also, the "recommission node and assert cluster health" logic was used in multiple places and could be refactored out to a helper method. Resolves #14290 Resolves #12197 Signed-off-by: Andrew Ross <andrross@amazon.com> (cherry picked from commit 0d38d14) Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

…#14372) (#14376) The problem is that this test would decommission one of six nodes. The tear down logic of the test would attempt to assert on the health of the cluster by randomly selecting a node and requesting the cluster health. If this random check happened to select the node that was decommissioned, then the test would fail. The fix is to recommission the node at the end of the test. Also, the "recommission node and assert cluster health" logic was used in multiple places and could be refactored out to a helper method. Resolves #14290 Resolves #12197 (cherry picked from commit 0d38d14) Signed-off-by: Andrew Ross <andrross@amazon.com> Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

…opensearch-project#14372) The problem is that this test would decommission one of six nodes. The tear down logic of the test would attempt to assert on the health of the cluster by randomly selecting a node and requesting the cluster health. If this random check happened to select the node that was decommissioned, then the test would fail. The fix is to recommission the node at the end of the test. Also, the "recommission node and assert cluster health" logic was used in multiple places and could be refactored out to a helper method. Resolves opensearch-project#14290 Resolves opensearch-project#12197 Signed-off-by: Andrew Ross <andrross@amazon.com>

…opensearch-project#14372) (opensearch-project#14376) The problem is that this test would decommission one of six nodes. The tear down logic of the test would attempt to assert on the health of the cluster by randomly selecting a node and requesting the cluster health. If this random check happened to select the node that was decommissioned, then the test would fail. The fix is to recommission the node at the end of the test. Also, the "recommission node and assert cluster health" logic was used in multiple places and could be refactored out to a helper method. Resolves opensearch-project#14290 Resolves opensearch-project#12197 (cherry picked from commit 0d38d14) Signed-off-by: Andrew Ross <andrross@amazon.com> Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Signed-off-by: kkewwei <kkewwei@163.com>

…opensearch-project#14372) The problem is that this test would decommission one of six nodes. The tear down logic of the test would attempt to assert on the health of the cluster by randomly selecting a node and requesting the cluster health. If this random check happened to select the node that was decommissioned, then the test would fail. The fix is to recommission the node at the end of the test. Also, the "recommission node and assert cluster health" logic was used in multiple places and could be refactored out to a helper method. Resolves opensearch-project#14290 Resolves opensearch-project#12197 Signed-off-by: Andrew Ross <andrross@amazon.com>

harishbhakuni added bug Something isn't working untriaged labels Feb 6, 2024

github-actions bot added the Other label Feb 6, 2024

harishbhakuni mentioned this issue Feb 6, 2024

[Backport 2.x] [Snapshot Interop] Fix Flakiness in Snapshot Interop C… #12193

Merged

8 tasks

peternied added Cluster Manager flaky-test Random test failure that succeeds on second run and removed untriaged labels Feb 7, 2024

Rishikesh1159 mentioned this issue Feb 8, 2024

[Segment Replication] Add Segment Replication Specific Integration Tests #11773

Merged

8 tasks

peternied mentioned this issue Feb 8, 2024

Update the lucene snapshot url #12260

Merged

8 tasks

abseth-amzn mentioned this issue Feb 9, 2024

[Backport 2.x] Add support for dependencies in plugin descriptor properties with semver range #11441 #12271

Merged

8 tasks

reta mentioned this issue Feb 9, 2024

Fix BwC for PluginInfo with semver range support #12273

Merged

8 tasks

peternied mentioned this issue Feb 9, 2024

[AUTOCUT] Gradle Check Failure on push to main #12266

Closed

reta mentioned this issue Feb 13, 2024

Add a system property to configure YamlParser codepoint limits #12301

Merged

8 tasks

sohami mentioned this issue Feb 19, 2024

Disable concurrent search path for composite aggregations. #12375

Merged

5 tasks

reta mentioned this issue Feb 20, 2024

[Forwardport] Update Apache Lucene version to 9.10.0 for OpenSearch 2.13.0 #12394

Merged

8 tasks

This was referenced Feb 21, 2024

[BUG] Gradle check is unreliable #12410

Open

Update TRIAGING instructions for closed issues #12382

Merged

andrross removed the Other label Feb 21, 2024

This was referenced Feb 23, 2024

[AUTOCUT] Gradle Check Failure on push to 2.x #12437

Closed

Bump peter-evans/create-or-update-comment from 3 to 4 #12462

Merged

jed326 mentioned this issue Mar 1, 2024

Increase suite timeout for HighlighterSearchIT #12512

Merged

3 tasks

kotwanikunal mentioned this issue Mar 1, 2024

Add release notes for 1.3.15 #12510

Merged

8 tasks

VachaShah mentioned this issue Mar 5, 2024

QueryFetchSearchResult as a proto message and node-to-node communication with protobuf #11910

Closed

12 tasks

kkewwei mentioned this issue Mar 10, 2024

onShardResult and onShardFailure are executed on one shard causes opensearch jvm crashed #12158

Merged

8 tasks

anshu1106 mentioned this issue Mar 12, 2024

Mute flaky test testMultiGetWithNetworkDisruption_FailOpenEnabled #12562

Merged

8 tasks

This was referenced Mar 12, 2024

[AUTOCUT] Gradle Check Failure on push to main #12603

Closed

Tracing for deep search path #12103

Merged

rwali-aws unassigned gargharsh3134 Jun 5, 2024

akolarkunnu mentioned this issue Jun 5, 2024

COMPAT locale provider will be removed in a future release #13988

Merged

9 tasks

soosinha mentioned this issue Jun 6, 2024

[Backport 2.x] [Remote Cluster State] Remote state interfaces #14019

Closed

reta mentioned this issue Jun 6, 2024

[Backport 2.x] Bump HdrHistogram to 2.2.2 and move the dependency version to version.properties #14040

Merged

3 tasks

rajiv-kv mentioned this issue Jun 7, 2024

[Backport 2.x] De-duping shards in ShardsBatchGatewayAllocator based on ShardId inst… #13775

Closed

reta mentioned this issue Jun 7, 2024

[Streaming Indexing] Enhance RestAction with request / response streaming support #13772

Merged

9 tasks

soosinha mentioned this issue Jun 10, 2024

Add remote state publication transport call #13835

Merged

5 tasks

This was referenced Jun 10, 2024

[BUG] Flaky tests impacting PRs tracked for 2.15 release #14132

Closed

[Remote Routing Table] Add write flow for remote routing table #13870

Merged

reta mentioned this issue Jun 10, 2024

Update to Gradle 8.8 #13584

Merged

8 tasks

andrross mentioned this issue Jun 11, 2024

[Backport 2.x] Add capability to disable source recovery_source for an index (#13590) #14064

Merged

8 tasks

reta mentioned this issue Jun 11, 2024

[AUTOCUT] Gradle Check Failure on push to main #14153

Closed

kkewwei mentioned this issue Jun 12, 2024

limit the max value of cluster.max_shards_per_node to avoid int overflow #14155

Merged

3 tasks

kiranprakash154 mentioned this issue Jun 12, 2024

[Tiered Caching] [Bug Fix] Use concurrentMap instead of HashMap to fix Concurrent Modification Exception #14221

Merged

3 tasks

sohami mentioned this issue Jun 13, 2024

[Backport 2.x] [Tiered Caching] [Bug Fix] Use concurrentMap instead of HashMap to fix Concurrent Modification Exception #14253

Merged

dblock mentioned this issue Jun 13, 2024

Fix flaky tests in org.opensearch.cluster.routing.remote.RemoteRoutingTableServiceTests #14264

Merged

shiv0408 mentioned this issue Jun 14, 2024

[Backport 2.15] Fix flakiness of testRemoteCleanupDeleteStale, bug fix in RemoteMetadataManifest and RemoteReadResult #14354

Merged

andrross mentioned this issue Jun 15, 2024

Fix AwarenessAttributeDecommissionIT.testConcurrentDecommissionAction #14372

Merged

1 task

reta closed this as completed in #14372 Jun 15, 2024

reta closed this as completed in 0d38d14 Jun 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] org.opensearch.cluster.coordination.AwarenessAttributeDecommissionIT.testConcurrentDecommissionAction #12197

[BUG] org.opensearch.cluster.coordination.AwarenessAttributeDecommissionIT.testConcurrentDecommissionAction #12197

harishbhakuni commented Feb 6, 2024

peternied commented Feb 7, 2024

peternied commented Feb 8, 2024

peternied commented Feb 29, 2024

rwali-aws commented Jun 5, 2024

dblock commented Jun 13, 2024

[BUG] org.opensearch.cluster.coordination.AwarenessAttributeDecommissionIT.testConcurrentDecommissionAction #12197

[BUG] org.opensearch.cluster.coordination.AwarenessAttributeDecommissionIT.testConcurrentDecommissionAction #12197

Comments

harishbhakuni commented Feb 6, 2024

Describe the bug

Related component

To Reproduce

Expected behavior

Additional Details

peternied commented Feb 7, 2024

peternied commented Feb 8, 2024

peternied commented Feb 29, 2024

rwali-aws commented Jun 5, 2024

dblock commented Jun 13, 2024