Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Latency improvements to Multi Term Aggregations #14993

Open
wants to merge 17 commits into
base: main
Choose a base branch
from

Conversation

expani
Copy link

@expani expani commented Jul 29, 2024

Description

This PR aims to introduce the following improvements :

  • Reduces the latency of Multi Term Aggregation queries by 7-10 seconds
  • Reduces the memory footprint of multi term aggregation queries by decreasing its allocations.

Testing was done on a c5.9xlarge with 20GB of JVM heap and store type as mmapfs to ensure any affect of EBS Latencies doesn't affect the result much. The same was verified using lsof on the OS pid to ensure all index files are m-mapped ( mem is the type in such cases )

The numbers below are averaged after 20 iterations of each type of query with and w/o changes.

Workload Field1 Field2 WithoutChanges WithChanges
big5 agent_name host_name 236 secs 226 secs
big5 process_name agent_id 45 secs 38 secs
nyc_taxi store_and_fwd_flag payment_type 53 secs 44 secs
Sample Aggregation Query
curl -k -H 'Content-Type: application/json' https://localhost:9200/nyc_taxis/_search -u 'admin:xxx' -d '{
  "aggs": {
    "flag_and_payment_type": {
      "multi_terms": {
        "terms": [{
          "field": "store_and_fwd_flag"
        }, {
          "field": "payment_type"
        }]
      }
    }
  }
}'

Multi term aggregation goes through all the docs given by the collector of the filter query ( MatchAllDocs Query if no filter is present )

For every document given by the collector, it generates cartesian product of all the values for all the fields present in the aggregation here

A deep copy is generated for every composite key here which is eventually copied again ( only for the first time ) while adding to the bucket. This PR refactors the code to remove the need for a deep copy of every composite key.

We also perform a deep copy of the field values retrieved by Lucene here This is only essential for fields with multiple values in a document and can be avoided for fields with single value in a document.

Allocation Profiling

For Big5 Benchmark Process Name and Agent Id ( LOW CARDINALITY )

Deep Copy of composite key and for single valued fields takes around 25% of the overall allocations for a multi term aggregation query.

Screenshot 2024-07-29 at 4 55 27 PM

Collecting all the composite keys for every document in a list here also takes around 9% of the overall allocations.

Screenshot 2024-07-29 at 4 59 16 PM

For Big5 Benchmark Agent Name and Host Name ( HIGH CARDINALITY )

19% of overall allocations spent in deep copy of composite key

Screenshot 2024-07-29 at 5 02 47 PM

Collecting all composite keys taking around 9% same as before.

Also, for each loop to go over the field values for a document here contributes to 17% of overall allocations because of creating a new Iterator every time. Changed the same to use a regular for loop.

Screenshot 2024-07-29 at 5 10 36 PM

Testing

  • I ensured that results of the output were same with and without my changes for aggregation queries for different fields of the Big5 dataset.
  • Testing was done for concurrent search using different field combinations and the results were the same.

Will see the existing integs and UTs for Multi term aggregations to ensure if any corner cases are not covered.

Signed-off-by: expani <anijainc@amazon.com>
Copy link
Contributor

❌ Gradle check result for fb84412: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@bowenlan-amzn bowenlan-amzn added v2.17.0 backport 2.x Backport to 2.x branch labels Jul 31, 2024
Copy link
Member

@bowenlan-amzn bowenlan-amzn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! Like how you document the profiling results to explain this!

@sandeshkr419
Copy link
Contributor

sandeshkr419 commented Jul 31, 2024

Thanks @expani for the code changes and detailed explanation on PR - I do have some minor comments on refactoring mainly. Please add relevant comments/javadocs to help future developers understand minor optimization and utilities for various low level operations.

Did we check performance on any data which has multi-valued fields as well? If not, let us propose a change in OSB for multi-valued fields in some workloads, in case we don't have any such available workloads.

Also, let us iterate CI to green and check if we have a solid code coverage as well.

@expani
Copy link
Author

expani commented Jul 31, 2024

Thanks for taking the time to review @bowenlan-amzn and @sandeshkr419

I will add required java docs and comments as suggested.
I have explained the reasoning for code structure, let me know if you feel otherwise.

Did we check performance on any data which has multi-valued fields as well? If not, let us propose a change in OSB for multi-valued fields in some workloads, in case we don't have any such available workloads.

Checked with few multi-values fields only for testing and not from a performance perspective.
Will check with OSB team for any such existing workloads as I need it for other possible optimisations as well.

Current CI seems to be failing due to 2 tests that have been reported to be flaky by multiple other folks.

> Task :server:internalClusterTest

Tests with failures:
 - org.opensearch.action.admin.indices.create.RemoteSplitIndexIT.classMethod
 - org.opensearch.remotestore.RemoteStoreStatsIT.testDownloadStatsCorrectnessSinglePrimarySingleReplica

Will check on the coverage of the existing tests and add any if required.

Copy link
Contributor

github-actions bot commented Sep 6, 2024

❌ Gradle check result for 73b9bf5: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@expani
Copy link
Author

expani commented Sep 6, 2024

Creating this comment to record all the build failures in the PR after rebasing with mainline, all these test execute just fine in local :

  1. RemoteStoreMigrationSettingsUpdateIT
    Cluster taking longer to become green causing test to time out waiting for cluster to be green
./gradlew :server:internalClusterTest --tests "org.opensearch.remotemigration.RemoteStoreMigrationSettingsUpdateIT.*"

https://github.com/opensearch-project/OpenSearch/actions/runs/10735985172/job/29774489951?pr=14993

Copy link
Contributor

github-actions bot commented Sep 6, 2024

❕ Gradle check result for 25617bd: UNSTABLE

Please review all flaky tests that succeeded after retry and create an issue if one does not already exist to track the flaky failure.

Copy link
Contributor

❌ Gradle check result for 6645e73: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

Copy link
Contributor

❌ Gradle check result for 4b11dcb: FAILURE

Please examine the workflow log, locate, and copy-paste the failure(s) below, then iterate to green. Is the failure a flaky test unrelated to your change?

@expani expani closed this Sep 10, 2024
@expani expani reopened this Sep 10, 2024
Copy link
Contributor

✅ Gradle check result for 4b11dcb: SUCCESS

@sandeshkr419
Copy link
Contributor

@expani Can you please add changelog - that should help in closing this out.

Copy link
Contributor

✅ Gradle check result for 4b11dcb: SUCCESS

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backport 2.x Backport to 2.x branch Performance This is for any performance related enhancements or bugs v2.18.0 Issues and PRs related to version 2.18.0
Projects
Status: In Progress
Development

Successfully merging this pull request may close these issues.

6 participants