Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Correctly calculate doc count error at the slice level for concurrent segment search #11732

Merged
merged 1 commit into from
Jan 11, 2024

Conversation

jed326
Copy link
Collaborator

@jed326 jed326 commented Jan 3, 2024

Description

This PR fixes how doc count error is calculated for terms aggregations for conurrent segment search and sets the slice_size heuristic introduced in #11585 to be equal to the shard_size.

The crux of this change is the introduction of hasSliceLevelDocCountError, to indicate to the coordinator whether or not the slice_size is the reason for why the error exists. This affects how the top level aggregation error is then calculated in the single shard (and single slice) scenarios.

Related Issues

Resolves #11680
Resolves #11702

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • Failing checks are inspected and point to the corresponding known issue(s) (See: Troubleshooting Failing Builds)
  • Commits are signed per the DCO using --signoff
  • Commit changes are listed out in CHANGELOG.md file (See: Changelog)
  • Public documentation issue/PR created

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@github-actions github-actions bot added enhancement Enhancement or improvement to existing feature or request Search:Performance labels Jan 3, 2024
Copy link
Contributor

github-actions bot commented Jan 3, 2024

❌ Gradle check result for 1f31849: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@jed326 jed326 force-pushed the concurrent-doc-count-error branch from 1f31849 to 4b4aacd Compare January 3, 2024 23:16
Copy link
Contributor

github-actions bot commented Jan 4, 2024

❌ Gradle check result for 4b4aacd: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@jed326 jed326 force-pushed the concurrent-doc-count-error branch 2 times, most recently from 17873d6 to 1500ef2 Compare January 9, 2024 00:15
@github-actions github-actions bot added bug Something isn't working flaky-test Random test failure that succeeds on second run Search:Aggregations labels Jan 9, 2024
@jed326 jed326 force-pushed the concurrent-doc-count-error branch from 1500ef2 to eda7d13 Compare January 9, 2024 00:19
@jed326
Copy link
Collaborator Author

jed326 commented Jan 9, 2024

@sohami @reta could you please help review? Thanks!

@jed326 jed326 force-pushed the concurrent-doc-count-error branch from abbc40f to 7e64bf8 Compare January 10, 2024 23:09
Copy link
Contributor

❌ Gradle check result for 7e64bf8: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@jed326
Copy link
Collaborator Author

jed326 commented Jan 10, 2024

@sohami @reta I updated the PR and also added some write ups related to the big issues we're seeing here. Both are referenced by code comments as well. See:

Taking a small step back, I do think the way that these terms aggregators are currently written is quite fragile and each piece of information seems tacked on to the previous. doc_count_error in particular looks like it was implemented as an afterthought and I do think there is a lot of room to simplify the code here. I will try to open a separate issue with some of those thoughts later.

Copy link
Contributor

✅ Gradle check result for 662ab41: SUCCESS

@jed326
Copy link
Collaborator Author

jed326 commented Jan 11, 2024

Created a new docs issue to add a lot of the details discussed here to the opensearch docs. The terms and sig terms docs are very lacking currently.

@jed326 jed326 force-pushed the concurrent-doc-count-error branch from 662ab41 to 7571351 Compare January 11, 2024 02:03
Copy link
Contributor

❌ Gradle check result for 7571351: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@jed326
Copy link
Collaborator Author

jed326 commented Jan 11, 2024

❌ Gradle check result for 7571351: FAILURE

org.opensearch.search.aggregations.bucket.TermsDocCountErrorIT.testDoubleValueFieldSingleShard {p0={"search.concurrent_segment_search.enabled":"true"}}
org.opensearch.search.aggregations.bucket.TermsDocCountErrorIT.testStringValueFieldSingleShard {p0={"search.concurrent_segment_search.enabled":"true"}}
org.opensearch.search.aggregations.bucket.TermsDocCountErrorIT.testLongValueFieldSingleShard {p0={"search.concurrent_segment_search.enabled":"true"}}

Looks like there are a few more tests I need to fix actually. I'm going to just force merge all of the shards in this test case to 1 segment since the purpose of this test class is to run assertions on the doc count error and we know that there will be a difference in doc count error when we use concurrent search. Will add a comment explaining such.

… segment search. Change slice_size heuristic to be equal to shard_size.

Signed-off-by: Jay Deng <jayd0104@gmail.com>
@jed326 jed326 force-pushed the concurrent-doc-count-error branch from 7571351 to 71a7472 Compare January 11, 2024 03:04
Copy link
Contributor

❌ Gradle check result for 71a7472: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@sohami sohami added the backport 2.x Backport to 2.x branch label Jan 11, 2024
Copy link
Contributor

❕ Gradle check result for 71a7472: UNSTABLE

  • TEST FAILURES:
      1 org.opensearch.remotestore.RemoteIndexPrimaryRelocationIT.testPrimaryRelocationWhileIndexing

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

@sohami sohami merged commit b042688 into opensearch-project:main Jan 11, 2024
32 of 33 checks passed
opensearch-trigger-bot bot pushed a commit that referenced this pull request Jan 11, 2024
… segment search. Change slice_size heuristic to be equal to shard_size. (#11732)

Signed-off-by: Jay Deng <jayd0104@gmail.com>
(cherry picked from commit b042688)
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
@reta reta added v3.0.0 Issues and PRs related to version 3.0.0 v2.12.0 Issues and PRs related to version 2.12.0 labels Jan 11, 2024
reta pushed a commit that referenced this pull request Jan 11, 2024
… segment search. Change slice_size heuristic to be equal to shard_size. (#11732) (#11859)

(cherry picked from commit b042688)

Signed-off-by: Jay Deng <jayd0104@gmail.com>
Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>
@jed326 jed326 deleted the concurrent-doc-count-error branch March 12, 2024 22:13
rayshrey pushed a commit to rayshrey/OpenSearch that referenced this pull request Mar 18, 2024
… segment search. Change slice_size heuristic to be equal to shard_size. (opensearch-project#11732)

Signed-off-by: Jay Deng <jayd0104@gmail.com>
shiv0408 pushed a commit to Gaurav614/OpenSearch that referenced this pull request Apr 25, 2024
… segment search. Change slice_size heuristic to be equal to shard_size. (opensearch-project#11732)

Signed-off-by: Jay Deng <jayd0104@gmail.com>
Signed-off-by: Shivansh Arora <hishiv@amazon.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Backport to 2.x branch bug Something isn't working enhancement Enhancement or improvement to existing feature or request flaky-test Random test failure that succeeds on second run Search:Aggregations Search:Performance v2.12.0 Issues and PRs related to version 2.12.0 v3.0.0 Issues and PRs related to version 3.0.0
Projects
Status: No status
3 participants