Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize 2 keyword multi-terms aggregation #13929

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

sandeshkr419
Copy link
Contributor

@sandeshkr419 sandeshkr419 commented Jun 3, 2024

Description

Optimize multi-terms aggregation for case:

  1. When the multi-term aggregation is for 2 keyword fields
  2. There are no deleted documents (will update code checks)
  3. No _doc_count field (will update code checks)
  4. [need to test if required] for filter is supported with this optimization or not

The optimization changes how buckets are collected for a segment. For the above cases, it presently checks in values for required aggregation for each document, computes the composite key and then updates the bucket count. The optimization utilizes reading posting enums directly so that we are not computing composite keys for each document, and save time by creating composite keys only once and then get the intersection document count by checking intersection of each composite bucket.

Related Issues

Resolves #13120

Check List

  • New functionality includes testing.
    • All tests pass
  • New functionality has been documented.
    • New functionality has javadoc added
  • API changes companion pull request created.
  • Failing checks are inspected and point to the corresponding known issue(s) (See: Troubleshooting Failing Builds)
  • Commits are signed per the DCO using --signoff
  • Commit changes are listed out in CHANGELOG.md file (See: Changelog)
  • Public documentation issue/PR created

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

@github-actions github-actions bot added Search:Aggregations Search:Performance v2.15.0 Issues and PRs related to version 2.15.0 labels Jun 3, 2024
Copy link
Contributor

github-actions bot commented Jun 3, 2024

❌ Gradle check result for bbd49c6: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@sandeshkr419
Copy link
Contributor Author

sandeshkr419 commented Jun 12, 2024

For POC, I ran the query the below query for big5 workload and saw 50% reduction in service time.

Expand to see multi-term query
curl -X POST "$dom/big5/_search?pretty" -H 'Content-Type: application/json' -d '{
  "size": 0,
  "aggs": {
    "important_terms": {
      "multi_terms": {
        "terms": [
          { "field": "process.name" },
          { "field": "cloud.region" }
        ]
      }
    }
  }
}'

Benchmark with big5 workload:

Total docs in index: 116000000 (11.6*10^7)

Field Name Field Cardinality % of unique values with total docs
process.name 7 0
agent.ephemeral_id 10 0
agent.id 10 0
cloud.region 25 0
log.file.path 26571 0.00023
agent.name 26832 0.00023
host.name 26832 0.00023
aws.cloudwatch.log_stream 26836 0.00023
event.id 180250 0.00155
aws.cloudwatch.ingestion_time 2917359 0.02515
event.ingested 6768110 0.05835
field_i field_j time taken with change time taken in old code
process.name agent.ephemeral_id 17943 54133
process.name agent.id 17914 51424
process.name cloud.region 24313 47968
process.name log.file.path 148481 150401
process.name agent.name 148247 116546
process.name host.name 148298 118020
process.name aws.cloudwatch.log_stream 147485 118093
process.name event.id 661574 145909
process.name aws.cloudwatch.ingestion_time null null
process.name event.ingested null null
agent.ephemeral_id process.name 18112 51884
agent.ephemeral_id agent.id 19763 54124
agent.ephemeral_id cloud.region 30548 58646
agent.ephemeral_id log.file.path 175867 156336
agent.ephemeral_id agent.name 175747 125895
agent.ephemeral_id host.name 175711 125865
agent.ephemeral_id aws.cloudwatch.log_stream 175695 125361
agent.ephemeral_id event.id 872176 156505
agent.ephemeral_id aws.cloudwatch.ingestion_time null null
agent.ephemeral_id event.ingested null null
agent.id process.name 18098 51229
agent.id agent.ephemeral_id 19793 53031
agent.id cloud.region 30561 57829
agent.id log.file.path 176061 153235
agent.id agent.name 175966 125073
agent.id host.name 175608 125487
agent.id aws.cloudwatch.log_stream 176056 125651
agent.id event.id 877700 155753
agent.id aws.cloudwatch.ingestion_time null null
agent.id event.ingested null null
cloud.region process.name 24713 51343
cloud.region agent.ephemeral_id 31079 57751
cloud.region agent.id 31080 57976
cloud.region log.file.path 254592 143288
cloud.region agent.name 254184 116291
cloud.region host.name 255058 116715
cloud.region aws.cloudwatch.log_stream 254723 117257
cloud.region event.id 1897213 161301
cloud.region aws.cloudwatch.ingestion_time null null
cloud.region event.ingested null null
log.file.path process.name 148742 157187
log.file.path agent.ephemeral_id 175875 157447
log.file.path agent.id 175854 154243
log.file.path cloud.region 256221 143427
log.file.path agent.name null 199356
log.file.path host.name null 198765
log.file.path aws.cloudwatch.log_stream 199760

null - indicates the request timed out


I tried to benchmark this against eventdata workload since the query time in above big5 workload was too high and I needed a smaller dataset to establish gains, but sadly it doesn't looks like that the change is improving the results. It may actually end up worsening the performance.

Total docs in index: 20000000 (2*10^7)

field_i Cardinality of field 1: x field_j Cardinality of field 2:y x * y time taken in old code time taken with change
httpversion 1 verb 5 5 3563 3628
httpversion 1 useragent.os_name 42 42 4365 4787
httpversion 1 geoip.country_name 190 190 7857 9140
httpversion 1 useragent.os 203 203 6136 6953
httpversion 1 useragent.name 208 208 5814 6586
verb 5 httpversion 1 5 3649 3442
verb 5 useragent.os_name 42 210 4551 5047
verb 5 geoip.country_name 190 950 8212 9450
verb 5 useragent.os 203 1015 6532 7260
verb 5 useragent.name 208 1040 5957 6730
useragent.os_name 42 httpversion 1 42 4395 4905
useragent.os_name 42 verb 5 210 4645 5003
useragent.os_name 42 geoip.country_name 190 7980 9413 11010
useragent.os_name 42 useragent.os 203 8526 7457 8613
useragent.os_name 42 useragent.name 208 8736 7034 8227
geoip.country_name 190 httpversion 1 190 8201 9329
geoip.country_name 190 verb 5 950 8341 9477
geoip.country_name 190 useragent.os_name 42 7980 9218 10865
geoip.country_name 190 useragent.os 203 38570 11527 13095
geoip.country_name 190 useragent.name 208 39520 10623 12593
useragent.os 203 httpversion 1 203 6448 7058
useragent.os 203 verb 5 1015 6731 7294
useragent.os 203 useragent.os_name 42 8526 7391 8692
useragent.os 203 geoip.country_name 190 38570 11351 13005
useragent.os 203 useragent.name 208 42224 8876 10156
useragent.name 208 httpversion 1 208 5975 6651
useragent.name 208 verb 5 1040 6155 6818
useragent.name 208 useragent.os_name 42 8736 6934 8149
useragent.name 208 geoip.country_name 190 39520 10807 12583
useragent.name 208 useragent.os 203 42224 8827 10220

Comment on lines +280 to +292
while (postings1.docID() != PostingsEnum.NO_MORE_DOCS && postings2.docID() != PostingsEnum.NO_MORE_DOCS) {

// Count of intersecting docs to get number of docs in each bucket
if (postings1.docID() == postings2.docID()) {
bucketCount++;
postings1.nextDoc();
postings2.nextDoc();
} else if (postings1.docID() < postings2.docID()) {
postings1.advance(postings2.docID());
} else {
postings2.advance(postings1.docID());
}
}
Copy link
Contributor

@rishabhmaurya rishabhmaurya Jul 8, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed. The complexity of intersection logic is highly dependent on the documents in the posting lists. With larger datasets and higher cardinality, the leapfrogging method for intersection evaluation would require more frequent iterations over these lists, which can be expensive.

@opensearch-trigger-bot
Copy link
Contributor

This PR is stalled because it has been open for 30 days with no activity.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Search:Aggregations Search:Performance stalled Issues that have stalled v2.15.0 Issues and PRs related to version 2.15.0
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Multi-terms Aggregation Performance Optimization
3 participants