Optimize 2 keyword multi-terms aggregation #13929

sandeshkr419 · 2024-06-03T08:42:09Z

Description

Optimize multi-terms aggregation for case:

When the multi-term aggregation is for 2 keyword fields
There are no deleted documents (will update code checks)
No _doc_count field (will update code checks)
[need to test if required] for filter is supported with this optimization or not

The optimization changes how buckets are collected for a segment. For the above cases, it presently checks in values for required aggregation for each document, computes the composite key and then updates the bucket count. The optimization utilizes reading posting enums directly so that we are not computing composite keys for each document, and save time by creating composite keys only once and then get the intersection document count by checking intersection of each composite bucket.

Related Issues

Resolves #13120

Check List

New functionality includes testing.
- All tests pass
New functionality has been documented.
- New functionality has javadoc added
API changes companion pull request created.
Failing checks are inspected and point to the corresponding known issue(s) (See: Troubleshooting Failing Builds)
Commits are signed per the DCO using --signoff
Commit changes are listed out in CHANGELOG.md file (See: Changelog)
Public documentation issue/PR created

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

github-actions · 2024-06-03T09:14:31Z

❌ Gradle check result for bbd49c6: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

sandeshkr419 · 2024-06-12T06:19:39Z

For POC, I ran the query the below query for big5 workload and saw 50% reduction in service time.

Expand to see multi-term query

curl -X POST "$dom/big5/_search?pretty" -H 'Content-Type: application/json' -d '{
  "size": 0,
  "aggs": {
    "important_terms": {
      "multi_terms": {
        "terms": [
          { "field": "process.name" },
          { "field": "cloud.region" }
        ]
      }
    }
  }
}'

Benchmark with big5 workload:

Total docs in index: 116000000 (11.6*10^7)

Field Name	Field Cardinality	% of unique values with total docs
process.name	7	0
agent.ephemeral_id	10	0
agent.id	10	0
cloud.region	25	0
log.file.path	26571	0.00023
agent.name	26832	0.00023
host.name	26832	0.00023
aws.cloudwatch.log_stream	26836	0.00023
event.id	180250	0.00155
aws.cloudwatch.ingestion_time	2917359	0.02515
event.ingested	6768110	0.05835

field_i	field_j	time taken with change	time taken in old code
process.name	agent.ephemeral_id	17943	54133
process.name	agent.id	17914	51424
process.name	cloud.region	24313	47968
process.name	log.file.path	148481	150401
process.name	agent.name	148247	116546
process.name	host.name	148298	118020
process.name	aws.cloudwatch.log_stream	147485	118093
process.name	event.id	661574	145909
process.name	aws.cloudwatch.ingestion_time	null	null
process.name	event.ingested	null	null
agent.ephemeral_id	process.name	18112	51884
agent.ephemeral_id	agent.id	19763	54124
agent.ephemeral_id	cloud.region	30548	58646
agent.ephemeral_id	log.file.path	175867	156336
agent.ephemeral_id	agent.name	175747	125895
agent.ephemeral_id	host.name	175711	125865
agent.ephemeral_id	aws.cloudwatch.log_stream	175695	125361
agent.ephemeral_id	event.id	872176	156505
agent.ephemeral_id	aws.cloudwatch.ingestion_time	null	null
agent.ephemeral_id	event.ingested	null	null
agent.id	process.name	18098	51229
agent.id	agent.ephemeral_id	19793	53031
agent.id	cloud.region	30561	57829
agent.id	log.file.path	176061	153235
agent.id	agent.name	175966	125073
agent.id	host.name	175608	125487
agent.id	aws.cloudwatch.log_stream	176056	125651
agent.id	event.id	877700	155753
agent.id	aws.cloudwatch.ingestion_time	null	null
agent.id	event.ingested	null	null
cloud.region	process.name	24713	51343
cloud.region	agent.ephemeral_id	31079	57751
cloud.region	agent.id	31080	57976
cloud.region	log.file.path	254592	143288
cloud.region	agent.name	254184	116291
cloud.region	host.name	255058	116715
cloud.region	aws.cloudwatch.log_stream	254723	117257
cloud.region	event.id	1897213	161301
cloud.region	aws.cloudwatch.ingestion_time	null	null
cloud.region	event.ingested	null	null
log.file.path	process.name	148742	157187
log.file.path	agent.ephemeral_id	175875	157447
log.file.path	agent.id	175854	154243
log.file.path	cloud.region	256221	143427
log.file.path	agent.name	null	199356
log.file.path	host.name	null	198765
log.file.path	aws.cloudwatch.log_stream		199760

null - indicates the request timed out

I tried to benchmark this against eventdata workload since the query time in above big5 workload was too high and I needed a smaller dataset to establish gains, but sadly it doesn't looks like that the change is improving the results. It may actually end up worsening the performance.

Total docs in index: 20000000 (2*10^7)

field_i	Cardinality of field 1: x	field_j	Cardinality of field 2:y	x * y	time taken in old code	time taken with change
httpversion	1	verb	5	5	3563	3628
httpversion	1	useragent.os_name	42	42	4365	4787
httpversion	1	geoip.country_name	190	190	7857	9140
httpversion	1	useragent.os	203	203	6136	6953
httpversion	1	useragent.name	208	208	5814	6586
verb	5	httpversion	1	5	3649	3442
verb	5	useragent.os_name	42	210	4551	5047
verb	5	geoip.country_name	190	950	8212	9450
verb	5	useragent.os	203	1015	6532	7260
verb	5	useragent.name	208	1040	5957	6730
useragent.os_name	42	httpversion	1	42	4395	4905
useragent.os_name	42	verb	5	210	4645	5003
useragent.os_name	42	geoip.country_name	190	7980	9413	11010
useragent.os_name	42	useragent.os	203	8526	7457	8613
useragent.os_name	42	useragent.name	208	8736	7034	8227
geoip.country_name	190	httpversion	1	190	8201	9329
geoip.country_name	190	verb	5	950	8341	9477
geoip.country_name	190	useragent.os_name	42	7980	9218	10865
geoip.country_name	190	useragent.os	203	38570	11527	13095
geoip.country_name	190	useragent.name	208	39520	10623	12593
useragent.os	203	httpversion	1	203	6448	7058
useragent.os	203	verb	5	1015	6731	7294
useragent.os	203	useragent.os_name	42	8526	7391	8692
useragent.os	203	geoip.country_name	190	38570	11351	13005
useragent.os	203	useragent.name	208	42224	8876	10156
useragent.name	208	httpversion	1	208	5975	6651
useragent.name	208	verb	5	1040	6155	6818
useragent.name	208	useragent.os_name	42	8736	6934	8149
useragent.name	208	geoip.country_name	190	39520	10807	12583
useragent.name	208	useragent.os	203	42224	8827	10220

rishabhmaurya · 2024-07-08T19:19:08Z

server/src/main/java/org/opensearch/search/aggregations/bucket/terms/MultiTermsAggregator.java

+                while (postings1.docID() != PostingsEnum.NO_MORE_DOCS && postings2.docID() != PostingsEnum.NO_MORE_DOCS) {
+
+                    // Count of intersecting docs to get number of docs in each bucket
+                    if (postings1.docID() == postings2.docID()) {
+                        bucketCount++;
+                        postings1.nextDoc();
+                        postings2.nextDoc();
+                    } else if (postings1.docID() < postings2.docID()) {
+                        postings1.advance(postings2.docID());
+                    } else {
+                        postings2.advance(postings1.docID());
+                    }
+                }


we need to optimize this method.
could you create a fixedbitset and use intersectionCount https://github.com/apache/lucene/blob/ebea2e1492c95b5d6b1e1032485598f901bda286/lucene/core/src/java/org/apache/lucene/util/FixedBitSet.java#L74

Agreed. The complexity of intersection logic is highly dependent on the documents in the posting lists. With larger datasets and higher cardinality, the leapfrogging method for intersection evaluation would require more frequent iterations over these lists, which can be expensive.

opensearch-trigger-bot · 2024-08-08T15:22:00Z

This PR is stalled because it has been open for 30 days with no activity.

Optimize 2 keyword multi-terms aggregation

bbd49c6

github-actions bot added Search:Aggregations Search:Performance v2.15.0 Issues and PRs related to version 2.15.0 labels Jun 3, 2024

rishabhmaurya reviewed Jul 8, 2024

View reviewed changes

opensearch-trigger-bot bot added the stalled Issues that have stalled label Aug 8, 2024

prudhvigodithi mentioned this pull request Oct 16, 2024

[AUTOCUT] Gradle Check Flaky Test Report for MixedClusterClientYamlTestSuiteIT prudhvigodithi/opensearch-build#82

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize 2 keyword multi-terms aggregation #13929

Optimize 2 keyword multi-terms aggregation #13929

sandeshkr419 commented Jun 3, 2024 •

edited

Loading

github-actions bot commented Jun 3, 2024

sandeshkr419 commented Jun 12, 2024 •

edited

Loading

rishabhmaurya Jul 8, 2024 •

edited

Loading

getsaurabh02 Jul 8, 2024

opensearch-trigger-bot bot commented Aug 8, 2024

Optimize 2 keyword multi-terms aggregation #13929

Are you sure you want to change the base?

Optimize 2 keyword multi-terms aggregation #13929

Conversation

sandeshkr419 commented Jun 3, 2024 • edited Loading

Description

Related Issues

Check List

github-actions bot commented Jun 3, 2024

sandeshkr419 commented Jun 12, 2024 • edited Loading

rishabhmaurya Jul 8, 2024 • edited Loading

Choose a reason for hiding this comment

getsaurabh02 Jul 8, 2024

Choose a reason for hiding this comment

opensearch-trigger-bot bot commented Aug 8, 2024

sandeshkr419 commented Jun 3, 2024 •

edited

Loading

sandeshkr419 commented Jun 12, 2024 •

edited

Loading

rishabhmaurya Jul 8, 2024 •

edited

Loading