Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix flaky test cases for DiskThresholdDeciderIT #5952

Merged
merged 1 commit into from
Jan 27, 2023

Conversation

RS146BIJAY
Copy link
Contributor

@RS146BIJAY RS146BIJAY commented Jan 20, 2023

Signed-off-by: Rishav Sagar rissag@amazon.com

Description

Test cases added for guardrail for applying index create block when all nodes are breaching high disk watermark were flaky. Reallocation on index created during integ test run was causing some race conditions scenarios. This PR fixes these test cases by disabling relocation on the indices created during integ tests.

Issues Resolved

#5956

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Commits are signed per the DCO using --signoff
  • Commit changes are listed out in CHANGELOG.md file (See: Changelog)

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@RS146BIJAY RS146BIJAY force-pushed the index-block-guardrails branch from 561f670 to a60d924 Compare January 20, 2023 16:25
CHANGELOG.md Outdated Show resolved Hide resolved
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

  • RESULT: UNSTABLE ❕
  • TEST FAILURES:
      1 org.opensearch.indices.replication.SegmentReplicationRelocationIT.testMultipleShards
      1 org.opensearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT.testIndexCreateBlockIsRemovedWhenAnyNodesNotExceedHighWatermark

@RS146BIJAY RS146BIJAY force-pushed the index-block-guardrails branch from a60d924 to 62db94a Compare January 20, 2023 18:25
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

  • RESULT: UNSTABLE ❕
  • TEST FAILURES:
      1 org.opensearch.indices.replication.SegmentReplicationIT.testDeleteOperations
      1 org.opensearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT.testIndexCreateBlockWhenAllNodesExceededHighWatermark

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

  • RESULT: UNSTABLE ❕
  • TEST FAILURES:
      1 org.opensearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT.testIndexCreateBlockWhenAllNodesExceededHighWatermark

@RS146BIJAY RS146BIJAY force-pushed the index-block-guardrails branch 2 times, most recently from 03617e4 to 34e47b6 Compare January 21, 2023 19:44
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@RS146BIJAY RS146BIJAY force-pushed the index-block-guardrails branch from 34e47b6 to 9172666 Compare January 21, 2023 20:16
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

  • RESULT: UNSTABLE ❕
  • TEST FAILURES:
      1 org.opensearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT.testIndexCreateBlockIsRemovedWhenAnyNodesNotExceedHighWatermark
      1 org.opensearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT.classMethod

@RS146BIJAY RS146BIJAY force-pushed the index-block-guardrails branch 3 times, most recently from 6fc213c to 0784c8d Compare January 22, 2023 10:04
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

  • RESULT: UNSTABLE ❕
  • TEST FAILURES:
      1 org.opensearch.indices.replication.SegmentReplicationIT.testDeleteOperations
      1 org.opensearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT.testIndexCreateBlockWithAReadOnlyBlock
      1 org.opensearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT.classMethod

@RS146BIJAY RS146BIJAY force-pushed the index-block-guardrails branch from 0784c8d to 58f5e15 Compare January 22, 2023 11:18
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

  • RESULT: UNSTABLE ❕
  • TEST FAILURES:
      1 org.opensearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT.testIndexCreateBlockWhenAllNodesExceededHighWatermark
      1 org.opensearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT.classMethod

@RS146BIJAY RS146BIJAY force-pushed the index-block-guardrails branch from 58f5e15 to 7a173e4 Compare January 22, 2023 12:57
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@RS146BIJAY RS146BIJAY force-pushed the index-block-guardrails branch 2 times, most recently from f5d79b1 to 55a4ff4 Compare January 22, 2023 13:51
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

  • RESULT: UNSTABLE ❕
  • TEST FAILURES:
      1 org.opensearch.indices.replication.SegmentReplicationRelocationIT.testCancellation
      1 org.opensearch.cluster.allocation.AwarenessAllocationIT.testThreeZoneOneReplicaWithForceZoneValueAndLoadAwareness

@RS146BIJAY RS146BIJAY force-pushed the index-block-guardrails branch 2 times, most recently from 4451ab7 to 6cf33f0 Compare January 25, 2023 21:12
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

  • RESULT: UNSTABLE ❕
  • TEST FAILURES:
      1 org.opensearch.client.ReindexIT.testDeleteByQueryTask
      1 org.opensearch.action.admin.cluster.node.tasks.ResourceAwareTasksTests.testTaskResourceTrackingDuringTaskCancellation

@RS146BIJAY
Copy link
Contributor Author

Is there any way to make this deterministic? Reducing the refresh frequency may have made the race condition less likely, but has it actually been fixed?

Made few portion of test cases synchronous to make these test cases deterministic. Tested it out with 50 iterations run with no failure.

@RS146BIJAY
Copy link
Contributor Author

@RS146BIJAY Thanks for contributing and fixing these flaky tests. Can you please try rerunning the test class for some more (20+ ideally) iterations? You can use the guidelines here. This will help you iteratively run the tests with the fixes with a single comand.

Ran the DiskThresholdDeciderIt successfully with 50 iterations on local.

@dblock dblock requested a review from andrross January 25, 2023 23:37
Comment on lines 160 to 162
public void invokeListeners(final ClusterInfo clusterInfo) {
listeners.forEach(listener -> listener.accept(clusterInfo));
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't quite seem right, architecturally. Being able to manually invoke listeners for an arbitrary ClusterInfo seems like it violates encapsulation. Also the fact that you're maintaining a parallel list of listeners will potentially make this brittle to changes in the behavior of InternalClusterInfoService. Is there no other way to accomplish what you need?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like synchronising refresh function inside InternalClusterInfoService will not be required as it refreshes the ClusterInfo in a blocking fashion. Simply disabling auto refresh (auto refresh was making test cases flaky) of InternalClusterInfoService and explicitly calling refresh after index on node is populated fixed the issue for me (Ignore previous errors as I missed disabling auto refresh on cluster manager node). I have validated test cases with more than 50 iterations on local.

Let me know if you see anything else can cause issue.

@RS146BIJAY RS146BIJAY force-pushed the index-block-guardrails branch from 6cf33f0 to e288d9e Compare January 26, 2023 21:25
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@RS146BIJAY RS146BIJAY force-pushed the index-block-guardrails branch from e288d9e to 01cd9c2 Compare January 27, 2023 10:54
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

@RS146BIJAY RS146BIJAY force-pushed the index-block-guardrails branch 2 times, most recently from 4c6e3c4 to 639b08e Compare January 27, 2023 11:45
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

  • RESULT: UNSTABLE ❕
  • TEST FAILURES:
      1 org.opensearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT.testIndexCreateBlockWithAReadOnlyBlock

@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

  • RESULT: UNSTABLE ❕
  • TEST FAILURES:
      1 org.opensearch.cluster.routing.allocation.decider.DiskThresholdDeciderIT.testIndexCreateBlockIsRemovedWhenAnyNodesNotExceedHighWatermark

@RS146BIJAY RS146BIJAY force-pushed the index-block-guardrails branch from 639b08e to dc87309 Compare January 27, 2023 16:57
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

  • RESULT: UNSTABLE ❕
  • TEST FAILURES:
      1 org.opensearch.cluster.ClusterHealthIT.testHealthOnClusterManagerFailover

Signed-off-by: Rishav Sagar <rissag@amazon.com>
@RS146BIJAY RS146BIJAY force-pushed the index-block-guardrails branch from dc87309 to a72eea4 Compare January 27, 2023 18:46
@github-actions
Copy link
Contributor

Gradle Check (Jenkins) Run Completed with:

Copy link
Member

@andrross andrross left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for working through this @RS146BIJAY!

@andrross andrross merged commit f3ed0d6 into opensearch-project:main Jan 27, 2023
@andrross andrross added the backport 2.x Backport to 2.x branch label Feb 3, 2023
opensearch-trigger-bot bot pushed a commit that referenced this pull request Feb 3, 2023
Signed-off-by: Rishav Sagar <rissag@amazon.com>
Co-authored-by: Rishav Sagar <rissag@amazon.com>
(cherry picked from commit f3ed0d6)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
andrross pushed a commit that referenced this pull request Feb 3, 2023
(cherry picked from commit f3ed0d6)

Signed-off-by: Rishav Sagar <rissag@amazon.com>
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: Rishav Sagar <rissag@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Backport to 2.x branch skip-changelog
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants