[Star Tree] [Search] Keyword & Numeric Terms Aggregation #17165

sandeshkr419 · 2025-01-28T19:06:47Z

Description

Adds support for keyword & numeric terms aggregation.
Aggregations are supported with or without metric sub-aggregations
Aggregations are supported with previously supported queries for filtering (no change in support of adding up queries to requests)
~~Fix for timestamp field to be fetched from request/valueSource instead of hard-coded value~~ (this change is already merged Fix date hardcoding in date aggregator #17239)

Related Issues

Resolves #16551

Check List

Functionality includes testing.
API changes companion pull request created, if applicable.
Public documentation issue/PR created, if applicable.

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

sandeshkr419 · 2025-02-04T20:52:36Z

@msfroh @bharath-techie Rebased from main again since #17239 is merged.

...ava/org/opensearch/search/aggregations/bucket/terms/GlobalOrdinalsStringTermsAggregator.java

msfroh · 2025-02-07T00:41:41Z

...er/src/main/java/org/opensearch/search/aggregations/bucket/terms/NumericTermsAggregator.java

+    ) throws IOException {
+        assert parent == null;
+        StarTreeValues starTreeValues = StarTreeQueryHelper.getStarTreeValues(ctx, starTree);
+        return new StarTreeBucketCollector(


This is pretty similar to the one in GlobalOrdinalsStringTermsAggregator. Is there opportunity to pull the common logic into an abstract base class?

Talking in terms of Date Histogram, Keyword Terms, Numeric Terms, Range Aggregations collectively.

Actually it was possible to refactor it to a common utility. However, there were subtle differences in the aggregators on which it was implemented:

valuesIterator being instanceof SortedSetStarTreeValuesIterator or SortedNumericStarTreeValuesIterator in different aggregators. This can be hidden behind a common abstract class but still isn't very neat with later things.

Different valuesIterators required different handling - date being passed to rounding utilities, for range being searched in different ranges, keyword values being passed to globalOperators, numeric terms processed with correct data type conversions (as per collectionStrategy used.)

Bucket collection strategy differing in range aggregation

Based in the differences in all the aggregations, it will be possible to abstract it out by using multiple generics and consumers (or relevant biconsumers, triconusmers) - however, it will make the code highly unreadable given the code logic is similar but not same.

Instead, I'd abstract out certain small pieces, like getDocCountIterator(), updateBucket() utilities instead. Thoughts?

Abstracted out common utilities. Code looks cleaner now - check it out!

msfroh · 2025-02-07T00:45:00Z

server/src/main/java/org/opensearch/search/aggregations/bucket/terms/TermsAggregator.java

+        // don't defer when StarTreeContext is set, don't defer when collectMode == SubAggCollectionMode.BREADTH_FIRST
+        // this boolean condition can be further simplified but affects readability.
+        return (context.getQueryShardContext().getStarTreeQueryContext() == null || collectMode != SubAggCollectionMode.BREADTH_FIRST)
+            && collectMode == SubAggCollectionMode.BREADTH_FIRST
+            && !aggsUsedForSorting.contains(aggregator);


Could this be:

if (context.getQueryShardContext().getStarTreeQueryContext() == null) { return false; } else { return collectMode == SubAggCollectionMode.BREADTH_FIRST && !aggsUsedForSorting.contains(aggregator); }

That is, make it even more complicated for readability. 😁

So basically we do not want to set up deferred collector when:
(context.getQueryShardContext().getStarTreeQueryContext() != null && collectMode == SubAggCollectionMode.BREADTH_FIRST).

So the queries we were resolving via star-tree resolved to BREADTH_FIRST mode only, but just to tighten up the criteria and not accidentally resolve DEPTH_FIRST mode, I just negated the above criteria leading to (context.getQueryShardContext().getStarTreeQueryContext() == null || collectMode != SubAggCollectionMode.BREADTH_FIRST).

This can be an alternative to keep the logic intact:

if (context.getQueryShardContext().getStarTreeQueryContext() == null) { // this will be the non-tree criteria return collectMode == SubAggCollectionMode.BREADTH_FIRST && !aggsUsedForSorting.contains(aggregator); } else { // this will be star-tree criteria - return false (don't defer) for BREADTH_FIRST return collectMode != SubAggCollectionMode.BREADTH_FIRST; }

bharath-techie · 2025-02-08T06:29:01Z

@sandeshkr419 can we revisit #17165 (comment)
and #17165 (comment) .
I think we can enable numeric terms count without any benchmarks even since there are no optimizations, we can try global aggs term count after benchmarks maybe.

github-actions · 2025-02-12T05:06:03Z

❌ Gradle check result for c8cf571: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Signed-off-by: Sandesh Kumar <sandeshkr419@gmail.com>

sandeshkr419 · 2025-02-17T22:01:51Z

Following up on #17165 (comment)

@msfroh @GBH I have added support for keyword/numeric terms aggregation without any sub-aggregations as well.

Keyword Aggregations

(non) Low Cardinality Case

To validate the performance, I indexed clientip as keyword field in the last of ordered dimensions to basically force a worse performance as my hunch was that the term frequency optimization in keyword terms would anytime outperform star-tree traversal, but upon benchmarking and profiling, I saw contradictory results. The terms dictionary lookup wasn't faster than star-tree traversal and the performance of star-tree (~150ms) was 2x better than term frequency optimization (~300 ms).
The non-optimized default flow was worst as expected (~400ms).

Based on this experimentation, I have moved up the star-tree pre-computation on higher priority than term frequency optimization. If the high cardinality field is at top in the ordered dimensions, star-tree will perform better than the current numbers, but as part of this experiment I wanted to see the worse performing scenario, which still outperforms term frequency optimization.

Priority for High Cardinality Cases:

Star Tree Optimization > Terms Enum Optimization > Default LeafCollector/collect() based aggregation

Low Cardinality Case

To validate the performance, I indexed status as keyword field in the first of ordered dimensions to get the best performance out of star-tree. To the surprise, term frequency optimization (~15ms) out-performed here over star-tree best performance (~60ms).

For aggregations with any terms queries, the term frequency optimization does not kicks in and so star-tree performs better.

Priority for Low Cardinality Cases:

Terms Enum Optimization > Star Tree Optimization > Default LeafCollector/collect() based aggregation

Numeric Aggregations

This was a simple 10-12x improvement when using star-tree(~65ms) compared to default flow (~450ms), and since there was no other optimization in place, I simply added this case handling.

Benchmarking Setup

Indexed http_logs in Mac (M1 Pro chip, 32 GB RAM), with 8G configured heap.
Ordered dimensions in star-tree: status(numeric), size (numeric), clientip (keyword), @timestamp (date)

The performance numbers in comparison are of 99th percentile service time.

Note: Terms Enum Optimization talked about in keyword terms is only applicable for cases when the query is a match all with no deleted documents in segments.

github-actions · 2025-02-17T22:36:03Z

❌ Gradle check result for 7a0a94a: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2025-02-18T00:07:22Z

❕ Gradle check result for 7a0a94a: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

Signed-off-by: Sandesh Kumar <sandeshkr419@gmail.com>

github-actions · 2025-02-18T20:47:39Z

❌ Gradle check result for f5509c5: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2025-02-18T21:40:45Z

❌ Gradle check result for f5509c5: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions · 2025-02-18T23:00:41Z

❌ Gradle check result for f5509c5: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

github-actions bot added enhancement Enhancement or improvement to existing feature or request Search:Aggregations labels Jan 28, 2025

sandeshkr419 added backport 2.x Backport to 2.x branch v2.19.0 Issues and PRs related to version 2.19.0 and removed enhancement Enhancement or improvement to existing feature or request Search:Aggregations labels Jan 28, 2025

opensearch-ci-bot mentioned this pull request Feb 5, 2025

[AUTOCUT] Gradle Check Flaky Test Report for MinimumClusterManagerNodesIT #14289

Open

msfroh reviewed Feb 7, 2025

View reviewed changes

opensearch-ci-bot mentioned this pull request Feb 7, 2025

[AUTOCUT] Gradle Check Flaky Test Report for ShuffleForcedMergePolicyTests #17294

Open

sandeshkr419 force-pushed the k1 branch from e1bebaa to c8cf571 Compare February 12, 2025 04:29

opensearch-ci-bot mentioned this pull request Feb 11, 2025

[AUTOCUT] Gradle Check Flaky Test Report for MetadataCreateIndexServiceTests #17291

Open

sandeshkr419 added 7 commits February 17, 2025 13:44

keyword, numeric terms aggregation

4e9c30a

Signed-off-by: Sandesh Kumar <sandeshkr419@gmail.com>

fix data field parsing; use advanceExact()

af8659a

Signed-off-by: Sandesh Kumar <sandeshkr419@gmail.com>

apply aggreagtion pre-computation unifying changes

bdeac1e

Signed-off-by: Sandesh Kumar <sandeshkr419@gmail.com>

append only index fix in test

a4c8822

Signed-off-by: Sandesh Kumar <sandeshkr419@gmail.com>

code refactoring

6f29706

Signed-off-by: Sandesh Kumar <sandeshkr419@gmail.com>

sub-aggs empty case

e5243f5

Signed-off-by: Sandesh Kumar <sandeshkr419@gmail.com>

include keyword/numeric terms without any metric aggregations as well

7a0a94a

Signed-off-by: Sandesh Kumar <sandeshkr419@gmail.com>

sandeshkr419 force-pushed the k1 branch from c8cf571 to 7a0a94a Compare February 17, 2025 21:44

sandeshkr419 closed this Feb 17, 2025

sandeshkr419 reopened this Feb 17, 2025

opensearch-ci-bot mentioned this pull request Feb 18, 2025

[AUTOCUT] Gradle Check Flaky Test Report for DedicatedClusterSnapshotRestoreIT #15806

Open

add low cardinality case for star tree precompute

f5509c5

Signed-off-by: Sandesh Kumar <sandeshkr419@gmail.com>

sandeshkr419 closed this Feb 18, 2025

sandeshkr419 reopened this Feb 18, 2025

sandeshkr419 closed this Feb 18, 2025

sandeshkr419 reopened this Feb 18, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Star Tree] [Search] Keyword & Numeric Terms Aggregation #17165

[Star Tree] [Search] Keyword & Numeric Terms Aggregation #17165

sandeshkr419 commented Jan 28, 2025 •

edited

Loading

sandeshkr419 commented Feb 4, 2025

msfroh Feb 7, 2025

sandeshkr419 Feb 11, 2025 •

edited

Loading

sandeshkr419 Feb 12, 2025

msfroh Feb 7, 2025

sandeshkr419 Feb 11, 2025

bharath-techie commented Feb 8, 2025

github-actions bot commented Feb 12, 2025

sandeshkr419 commented Feb 17, 2025 •

edited

Loading

github-actions bot commented Feb 17, 2025

github-actions bot commented Feb 18, 2025

github-actions bot commented Feb 18, 2025

github-actions bot commented Feb 18, 2025

github-actions bot commented Feb 18, 2025

[Star Tree] [Search] Keyword & Numeric Terms Aggregation #17165

Are you sure you want to change the base?

[Star Tree] [Search] Keyword & Numeric Terms Aggregation #17165

Conversation

sandeshkr419 commented Jan 28, 2025 • edited Loading

Description

Related Issues

Check List

sandeshkr419 commented Feb 4, 2025

msfroh Feb 7, 2025

Choose a reason for hiding this comment

sandeshkr419 Feb 11, 2025 • edited Loading

Choose a reason for hiding this comment

sandeshkr419 Feb 12, 2025

Choose a reason for hiding this comment

msfroh Feb 7, 2025

Choose a reason for hiding this comment

sandeshkr419 Feb 11, 2025

Choose a reason for hiding this comment

bharath-techie commented Feb 8, 2025

github-actions bot commented Feb 12, 2025

sandeshkr419 commented Feb 17, 2025 • edited Loading

Keyword Aggregations

(non) Low Cardinality Case

Priority for High Cardinality Cases:

Low Cardinality Case

Priority for Low Cardinality Cases:

Numeric Aggregations

Benchmarking Setup

github-actions bot commented Feb 17, 2025

github-actions bot commented Feb 18, 2025

github-actions bot commented Feb 18, 2025

github-actions bot commented Feb 18, 2025

github-actions bot commented Feb 18, 2025

sandeshkr419 commented Jan 28, 2025 •

edited

Loading

sandeshkr419 Feb 11, 2025 •

edited

Loading

sandeshkr419 commented Feb 17, 2025 •

edited

Loading