-
Notifications
You must be signed in to change notification settings - Fork 2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Star Tree] [Search] Keyword & Numeric Terms Aggregation #17165
base: main
Are you sure you want to change the base?
Conversation
@msfroh @bharath-techie Rebased from main again since #17239 is merged. |
...ava/org/opensearch/search/aggregations/bucket/terms/GlobalOrdinalsStringTermsAggregator.java
Outdated
Show resolved
Hide resolved
...ava/org/opensearch/search/aggregations/bucket/terms/GlobalOrdinalsStringTermsAggregator.java
Outdated
Show resolved
Hide resolved
...ava/org/opensearch/search/aggregations/bucket/terms/GlobalOrdinalsStringTermsAggregator.java
Show resolved
Hide resolved
...ava/org/opensearch/search/aggregations/bucket/terms/GlobalOrdinalsStringTermsAggregator.java
Outdated
Show resolved
Hide resolved
) throws IOException { | ||
assert parent == null; | ||
StarTreeValues starTreeValues = StarTreeQueryHelper.getStarTreeValues(ctx, starTree); | ||
return new StarTreeBucketCollector( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is pretty similar to the one in GlobalOrdinalsStringTermsAggregator
. Is there opportunity to pull the common logic into an abstract base class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Talking in terms of Date Histogram, Keyword Terms, Numeric Terms, Range Aggregations collectively.
Actually it was possible to refactor it to a common utility. However, there were subtle differences in the aggregators on which it was implemented:
valuesIterator
being instanceofSortedSetStarTreeValuesIterator
orSortedNumericStarTreeValuesIterator
in different aggregators. This can be hidden behind a common abstract class but still isn't very neat with later things.- Different valuesIterators required different handling - date being passed to rounding utilities, for range being searched in different ranges, keyword values being passed to globalOperators, numeric terms processed with correct data type conversions (as per collectionStrategy used.)
- Bucket collection strategy differing in range aggregation
Based in the differences in all the aggregations, it will be possible to abstract it out by using multiple generics and consumers (or relevant biconsumers, triconusmers) - however, it will make the code highly unreadable given the code logic is similar but not same.
Instead, I'd abstract out certain small pieces, like getDocCountIterator()
, updateBucket()
utilities instead. Thoughts?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Abstracted out common utilities. Code looks cleaner now - check it out!
// don't defer when StarTreeContext is set, don't defer when collectMode == SubAggCollectionMode.BREADTH_FIRST | ||
// this boolean condition can be further simplified but affects readability. | ||
return (context.getQueryShardContext().getStarTreeQueryContext() == null || collectMode != SubAggCollectionMode.BREADTH_FIRST) | ||
&& collectMode == SubAggCollectionMode.BREADTH_FIRST | ||
&& !aggsUsedForSorting.contains(aggregator); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could this be:
if (context.getQueryShardContext().getStarTreeQueryContext() == null) {
return false;
} else {
return collectMode == SubAggCollectionMode.BREADTH_FIRST && !aggsUsedForSorting.contains(aggregator);
}
That is, make it even more complicated for readability. 😁
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So basically we do not want to set up deferred collector when:
(context.getQueryShardContext().getStarTreeQueryContext() != null
&& collectMode == SubAggCollectionMode.BREADTH_FIRST
).
So the queries we were resolving via star-tree resolved to BREADTH_FIRST
mode only, but just to tighten up the criteria and not accidentally resolve DEPTH_FIRST
mode, I just negated the above criteria leading to (context.getQueryShardContext().getStarTreeQueryContext() == null || collectMode != SubAggCollectionMode.BREADTH_FIRST)
.
This can be an alternative to keep the logic intact:
if (context.getQueryShardContext().getStarTreeQueryContext() == null) {
// this will be the non-tree criteria
return collectMode == SubAggCollectionMode.BREADTH_FIRST && !aggsUsedForSorting.contains(aggregator);
} else {
// this will be star-tree criteria - return false (don't defer) for BREADTH_FIRST
return collectMode != SubAggCollectionMode.BREADTH_FIRST;
}
@sandeshkr419 can we revisit #17165 (comment) |
❌ Gradle check result for c8cf571: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Signed-off-by: Sandesh Kumar <sandeshkr419@gmail.com>
Signed-off-by: Sandesh Kumar <sandeshkr419@gmail.com>
Signed-off-by: Sandesh Kumar <sandeshkr419@gmail.com>
Signed-off-by: Sandesh Kumar <sandeshkr419@gmail.com>
Signed-off-by: Sandesh Kumar <sandeshkr419@gmail.com>
Signed-off-by: Sandesh Kumar <sandeshkr419@gmail.com>
Signed-off-by: Sandesh Kumar <sandeshkr419@gmail.com>
Following up on #17165 (comment) @msfroh @GBH I have added support for keyword/numeric terms aggregation without any sub-aggregations as well. Keyword Aggregations(non) Low Cardinality CaseTo validate the performance, I indexed Based on this experimentation, I have moved up the star-tree pre-computation on higher priority than term frequency optimization. If the high cardinality field is at top in the ordered dimensions, star-tree will perform better than the current numbers, but as part of this experiment I wanted to see the worse performing scenario, which still outperforms term frequency optimization. Priority for High Cardinality Cases:Star Tree Optimization > Terms Enum Optimization > Default LeafCollector/collect() based aggregation Low Cardinality CaseTo validate the performance, I indexed For aggregations with any terms queries, the term frequency optimization does not kicks in and so star-tree performs better. Priority for Low Cardinality Cases:Terms Enum Optimization > Star Tree Optimization > Default LeafCollector/collect() based aggregation Numeric AggregationsThis was a simple 10-12x improvement when using star-tree(~65ms) compared to default flow (~450ms), and since there was no other optimization in place, I simply added this case handling. Benchmarking SetupIndexed The performance numbers in comparison are of 99th percentile service time. Note: Terms Enum Optimization talked about in keyword terms is only applicable for cases when the query is a match all with no deleted documents in segments. |
❌ Gradle check result for 7a0a94a: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
❕ Gradle check result for 7a0a94a: UNSTABLE Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure. |
Signed-off-by: Sandesh Kumar <sandeshkr419@gmail.com>
❌ Gradle check result for f5509c5: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
❌ Gradle check result for f5509c5: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
❌ Gradle check result for f5509c5: FAILURE Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change? |
Description
Fix for timestamp field to be fetched from request/valueSource instead of hard-coded value(this change is already merged Fix date hardcoding in date aggregator #17239)Related Issues
Resolves #16551
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.