Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add benchmark support for vector radial search #546

Merged
merged 3 commits into from
Aug 2, 2024

Conversation

junqiu-lei
Copy link
Member

@junqiu-lei junqiu-lei commented Jun 4, 2024

Description

Since OpenSearch version 2.14, we've introduced vector radial search in k-NN plugin, this PR will support run benchmark with radial search api.

Raised another PR opensearch-project/opensearch-benchmark-workloads#309 in opensearch-benchmark-workloads to have the change accordinately.

Testing

  • New functionality includes testing

Example local run workloads and results:

{
    "target_index_name": "target_index",
    "target_field_name": "target_field",
    "target_index_body": "indices/faiss-index.json",
    "target_index_primary_shards": 3,
    "target_index_dimension": 768,
    "target_index_space_type": "innerproduct",
    
    "target_index_bulk_size": 100,
    "target_index_bulk_index_data_set_format": "hdf5",
    "target_index_bulk_index_data_set_path": "/Users/junqiu/dataset/documents-1m-threshold-innerproduct-160.hdf5",
    "target_index_bulk_indexing_clients": 10,
    
    "target_index_max_num_segments": 1,
    "target_index_force_merge_timeout": 600.0,
    "hnsw_ef_search": 256,
    "hnsw_ef_construction": 256,
    "target_index_num_vectors": 1000000,
    "query_max_distance": -160.0,
    "query_body": {
         "docvalue_fields" : ["_id"],
         "stored_fields" : "_none_"
    },

    "query_data_set_format": "hdf5",
    "query_data_set_path":"/Users/junqiu/dataset/documents-1m-threshold-innerproduct-160.hdf5",
    "query_count": 100
}
|---------------------------------------------------------------:|-------------:|------------:|-------:|
|                     Cumulative indexing time of primary shards |              |     83.6451 |    min |
|             Min cumulative indexing time across primary shards |              | 0.000416667 |    min |
|          Median cumulative indexing time across primary shards |              |   0.0165833 |    min |
|             Max cumulative indexing time across primary shards |              |     28.2067 |    min |
|            Cumulative indexing throttle time of primary shards |              |           0 |    min |
|    Min cumulative indexing throttle time across primary shards |              |           0 |    min |
| Median cumulative indexing throttle time across primary shards |              |           0 |    min |
|    Max cumulative indexing throttle time across primary shards |              |           0 |    min |
|                        Cumulative merge time of primary shards |              |     93.4468 |    min |
|                       Cumulative merge count of primary shards |              |          94 |        |
|                Min cumulative merge time across primary shards |              |           0 |    min |
|             Median cumulative merge time across primary shards |              |           0 |    min |
|                Max cumulative merge time across primary shards |              |     32.9337 |    min |
|               Cumulative merge throttle time of primary shards |              |     3.82742 |    min |
|       Min cumulative merge throttle time across primary shards |              |           0 |    min |
|    Median cumulative merge throttle time across primary shards |              |           0 |    min |
|       Max cumulative merge throttle time across primary shards |              |     1.39933 |    min |
|                      Cumulative refresh time of primary shards |              |     4.67383 |    min |
|                     Cumulative refresh count of primary shards |              |         211 |        |
|              Min cumulative refresh time across primary shards |              | 0.000866667 |    min |
|           Median cumulative refresh time across primary shards |              |      0.0031 |    min |
|              Max cumulative refresh time across primary shards |              |     1.66205 |    min |
|                        Cumulative flush time of primary shards |              |     3.79628 |    min |
|                       Cumulative flush count of primary shards |              |          42 |        |
|                Min cumulative flush time across primary shards |              |           0 |    min |
|             Median cumulative flush time across primary shards |              |  0.00796667 |    min |
|                Max cumulative flush time across primary shards |              |     1.29665 |    min |
|                                        Total Young Gen GC time |              |           0 |      s |
|                                       Total Young Gen GC count |              |           0 |        |
|                                          Total Old Gen GC time |              |           0 |      s |
|                                         Total Old Gen GC count |              |           0 |        |
|                                                     Store size |              |     17.0813 |     GB |
|                                                  Translog size |              | 3.58559e-07 |     GB |
|                                         Heap used for segments |              |           0 |     MB |
|                                       Heap used for doc values |              |           0 |     MB |
|                                            Heap used for terms |              |           0 |     MB |
|                                            Heap used for norms |              |           0 |     MB |
|                                           Heap used for points |              |           0 |     MB |
|                                    Heap used for stored fields |              |           0 |     MB |
|                                                  Segment count |              |          23 |        |
|                                                 Min Throughput | prod-queries |       66.96 |  ops/s |
|                                                Mean Throughput | prod-queries |       66.96 |  ops/s |
|                                              Median Throughput | prod-queries |       66.96 |  ops/s |
|                                                 Max Throughput | prod-queries |       66.96 |  ops/s |
|                                        50th percentile latency | prod-queries |     5.82227 |     ms |
|                                        90th percentile latency | prod-queries |     13.9878 |     ms |
|                                        99th percentile latency | prod-queries |     63.4047 |     ms |
|                                       100th percentile latency | prod-queries |     90.4701 |     ms |
|                                   50th percentile service time | prod-queries |     5.82227 |     ms |
|                                   90th percentile service time | prod-queries |     13.9878 |     ms |
|                                   99th percentile service time | prod-queries |     63.4047 |     ms |
|                                  100th percentile service time | prod-queries |     90.4701 |     ms |
|                                                     error rate | prod-queries |           0 |      % |

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Copy link
Collaborator

@gkamat gkamat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @junqiu-lei, a large portion of the diffs pertain to formatting changes unrelated to the changed/new functionality. Can you submit only the latter in this PR?

We want to be cautious about changing formatting guidelines. Suggestions are welcome, but they need to be discussed separately and applied globally. Thanks for understanding.

@junqiu-lei
Copy link
Member Author

Hi @junqiu-lei, a large portion of the diffs pertain to formatting changes unrelated to the changed/new functionality. Can you submit only the latter in this PR?

We want to be cautious about changing formatting guidelines. Suggestions are welcome, but they need to be discussed separately and applied globally. Thanks for understanding.

@gkamat Thanks the feedback, yes, I just updated PR.

@junqiu-lei junqiu-lei force-pushed the radial-1 branch 3 times, most recently from 1b22e1f to f706f39 Compare June 25, 2024 05:29
@junqiu-lei junqiu-lei requested review from gkamat and VijayanB July 15, 2024 19:10
@VijayanB
Copy link
Member

@junqiu-lei Overall looks good. Added minor comments. You might also need follow up PR once #581 is merged

@junqiu-lei junqiu-lei requested review from IanHoang and VijayanB July 19, 2024 01:02
finnroblin added a commit to finnroblin/opensearch-benchmark that referenced this pull request Jul 22, 2024
Copy link
Member

@VijayanB VijayanB left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM overall. Just added few minor comments.

@junqiu-lei junqiu-lei force-pushed the radial-1 branch 2 times, most recently from 6d244d9 to 398a893 Compare July 23, 2024 19:37
osbenchmark/worker_coordinator/runner.py Outdated Show resolved Hide resolved
osbenchmark/workload/params.py Outdated Show resolved Hide resolved
Comment on lines +1124 to +1128
if self.query_type == self.KNN_QUERY_TYPE:
return Context.NEIGHBORS
if self.query_type == self.MIN_SCORE_QUERY_TYPE:
return Context.MIN_SCORE_NEIGHBORS
if self.query_type == self.MAX_DISTANCE_QUERY_TYPE:
return Context.MAX_DISTANCE_NEIGHBORS
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using elif will be more appropriate.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ummm, I used to use elif, @VijayanB and I agreed to updated to current condition check for cleaner read. It has validation step before this

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The effect is the same in this case, but using elif makes it more readable, especially since the cases are mutually exclusive. I'll leave it to you.

osbenchmark/workload/params.py Outdated Show resolved Hide resolved
osbenchmark/workload/params.py Outdated Show resolved Hide resolved
query.update({
"k": self.k,
})
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Instead of using query.update() here:

query_update["k"] = self.k

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm following the same pattern used in this function.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Again, this is for better readability, especially if there is only a single key being updated. I won't insist, though.

Signed-off-by: Junqiu Lei <junqiu@amazon.com>
Signed-off-by: Junqiu Lei <junqiu@amazon.com>
Signed-off-by: Junqiu Lei <junqiu@amazon.com>
@gkamat gkamat merged commit 1eb5171 into opensearch-project:main Aug 2, 2024
8 checks passed
finnroblin added a commit to finnroblin/opensearch-benchmark that referenced this pull request Aug 20, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants