Add vectorsearch training workload #333

finnroblin · 2024-06-24T23:51:07Z

Description

Adds the train-test vectorsearch workload to benchmark kNN operations that require training like faiss ivf. Please see issue #332 for context.

This PR adds a schedule to train kNN algorithms using the train-knn-model operation proposal in OSB PR 556. It depends on the operation runners in that PR. It also requires an additional index in the vectorsearch workload.json to hold training data.

The train-test workload on my branch works on the faiss-sift-128 dataset without breaking backwards compatibility with other vectorsearch workloads. Please feel free to clone my forks (OSB, OSB Workload) to investigate workload behavior, as there are not unit tests in the OSB workloads framework.

Issues Resolved

Closes #332

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.

Sample Output:

> export PARAMS=opensearch-benchmark-workloads/vectorsearch/params/train/train-faiss-sift-128-l2-sq.json
> opensearch-benchmark execute-test --target-hosts $ENDPOINT \                                                               
    --workload-path /Users/finnrobl/Code/opensearch-benchmark-workloads/vectorsearch  --workload-params $PARAMS \
    --pipeline benchmark-only \
    --kill-running-processes \
  --test-procedure train-test 

   ____                  _____                      __       ____                  __                         __
  / __ \____  ___  ____ / ___/___  ____ ___________/ /_     / __ )___  ____  _____/ /_  ____ ___  ____ ______/ /__
 / / / / __ \/ _ \/ __ \\__ \/ _ \/ __ `/ ___/ ___/ __ \   / __  / _ \/ __ \/ ___/ __ \/ __ `__ \/ __ `/ ___/ //_/
/ /_/ / /_/ /  __/ / / /__/ /  __/ /_/ / /  / /__/ / / /  / /_/ /  __/ / / / /__/ / / / / / / / / /_/ / /  / ,<
\____/ .___/\___/_/ /_/____/\___/\__,_/_/   \___/_/ /_/  /_____/\___/_/ /_/\___/_/ /_/_/ /_/ /_/\__,_/_/  /_/|_|
    /_/

[INFO] [Test Execution ID]: c9954f19-26a8-48bb-9f18-b0b6605aab76
[INFO] Executing test with workload [vectorsearch], test_procedure [train-test] and provision_config_instance ['external'] with version [3.0.0-SNAPSHOT].

Running delete-train-index                                                     [100% done]
Running create-train-index                                                     [100% done]
Running custom-vector-bulk-train                                               [100% done]
Running refresh-train-index                                                    [100% done]
Running delete-target-index                                                    [100% done]
Running create-target-index                                                    [100% done]
Running custom-vector-bulk                                                     [100% done]
Running refresh-target-index                                                   [100% done]
Running delete-model                                                           [100% done]
Running train-knn-model                                                        [100% done]
Running warmup-indices                                                         [100% done]
Running prod-queries                                                           [100% done]

------------------------------------------------------
    _______             __   _____
   / ____(_)___  ____ _/ /  / ___/_________  ________
  / /_  / / __ \/ __ `/ /   \__ \/ ___/ __ \/ ___/ _ \
 / __/ / / / / / /_/ / /   ___/ / /__/ /_/ / /  /  __/
/_/   /_/_/ /_/\__,_/_/   /____/\___/\____/_/   \___/
------------------------------------------------------
            
|                                                         Metric |                     Task |      Value |   Unit |
|---------------------------------------------------------------:|-------------------------:|-----------:|-------:|
|                     Cumulative indexing time of primary shards |                          |    15.1107 |    min |
|             Min cumulative indexing time across primary shards |                          | 0.00368333 |    min |
|          Median cumulative indexing time across primary shards |                          |    7.55535 |    min |
|             Max cumulative indexing time across primary shards |                          |     15.107 |    min |
|            Cumulative indexing throttle time of primary shards |                          |          0 |    min |
|    Min cumulative indexing throttle time across primary shards |                          |          0 |    min |
| Median cumulative indexing throttle time across primary shards |                          |          0 |    min |
|    Max cumulative indexing throttle time across primary shards |                          |          0 |    min |
|                        Cumulative merge time of primary shards |                          |    3.43095 |    min |
|                       Cumulative merge count of primary shards |                          |         16 |        |
|                Min cumulative merge time across primary shards |                          |          0 |    min |
|             Median cumulative merge time across primary shards |                          |    1.71548 |    min |
|                Max cumulative merge time across primary shards |                          |    3.43095 |    min |
|               Cumulative merge throttle time of primary shards |                          |   0.505767 |    min |
|       Min cumulative merge throttle time across primary shards |                          |          0 |    min |
|    Median cumulative merge throttle time across primary shards |                          |   0.252883 |    min |
|       Max cumulative merge throttle time across primary shards |                          |   0.505767 |    min |
|                      Cumulative refresh time of primary shards |                          |     0.4279 |    min |
|                     Cumulative refresh count of primary shards |                          |         34 |        |
|              Min cumulative refresh time across primary shards |                          |    0.00125 |    min |
|           Median cumulative refresh time across primary shards |                          |    0.21395 |    min |
|              Max cumulative refresh time across primary shards |                          |    0.42665 |    min |
|                        Cumulative flush time of primary shards |                          |    0.03595 |    min |
|                       Cumulative flush count of primary shards |                          |          1 |        |
|                Min cumulative flush time across primary shards |                          |          0 |    min |
|             Median cumulative flush time across primary shards |                          |   0.017975 |    min |
|                Max cumulative flush time across primary shards |                          |    0.03595 |    min |
|                                        Total Young Gen GC time |                          |      4.022 |      s |
|                                       Total Young Gen GC count |                          |       2405 |        |
|                                          Total Old Gen GC time |                          |          0 |      s |
|                                         Total Old Gen GC count |                          |          0 |        |
|                                                     Store size |                          |    1.98823 |     GB |
|                                                  Translog size |                          |   0.174298 |     GB |
|                                         Heap used for segments |                          |          0 |     MB |
|                                       Heap used for doc values |                          |          0 |     MB |
|                                            Heap used for terms |                          |          0 |     MB |
|                                            Heap used for norms |                          |          0 |     MB |
|                                           Heap used for points |                          |          0 |     MB |
|                                    Heap used for stored fields |                          |          0 |     MB |
|                                                  Segment count |                          |         36 |        |
|                                                 Min Throughput | custom-vector-bulk-train |    18383.6 | docs/s |
|                                                Mean Throughput | custom-vector-bulk-train |    18383.6 | docs/s |
|                                              Median Throughput | custom-vector-bulk-train |    18383.6 | docs/s |
|                                                 Max Throughput | custom-vector-bulk-train |    18383.6 | docs/s |
|                                        50th percentile latency | custom-vector-bulk-train |    43.5641 |     ms |
|                                        90th percentile latency | custom-vector-bulk-train |    46.3634 |     ms |
|                                       100th percentile latency | custom-vector-bulk-train |      46.49 |     ms |
|                                   50th percentile service time | custom-vector-bulk-train |    43.5641 |     ms |
|                                   90th percentile service time | custom-vector-bulk-train |    46.3634 |     ms |
|                                  100th percentile service time | custom-vector-bulk-train |      46.49 |     ms |
|                                                     error rate | custom-vector-bulk-train |          0 |      % |
|                                                 Min Throughput |       custom-vector-bulk |    8894.83 | docs/s |
|                                                Mean Throughput |       custom-vector-bulk |    11858.3 | docs/s |
|                                              Median Throughput |       custom-vector-bulk |    10465.9 | docs/s |
|                                                 Max Throughput |       custom-vector-bulk |    30396.6 | docs/s |
|                                        50th percentile latency |       custom-vector-bulk |    101.675 |     ms |
|                                        90th percentile latency |       custom-vector-bulk |    137.139 |     ms |
|                                        99th percentile latency |       custom-vector-bulk |    277.051 |     ms |
|                                      99.9th percentile latency |       custom-vector-bulk |    2109.04 |     ms |
|                                     99.99th percentile latency |       custom-vector-bulk |    2827.03 |     ms |
|                                       100th percentile latency |       custom-vector-bulk |    2890.82 |     ms |
|                                   50th percentile service time |       custom-vector-bulk |    101.609 |     ms |
|                                   90th percentile service time |       custom-vector-bulk |    137.125 |     ms |
|                                   99th percentile service time |       custom-vector-bulk |    277.253 |     ms |
|                                 99.9th percentile service time |       custom-vector-bulk |    2109.04 |     ms |
|                                99.99th percentile service time |       custom-vector-bulk |    2827.03 |     ms |
|                                  100th percentile service time |       custom-vector-bulk |    2890.82 |     ms |
|                                                     error rate |       custom-vector-bulk |          0 |      % |
|                                                 Min Throughput |             delete-model |       84.7 |  ops/s |
|                                                Mean Throughput |             delete-model |       84.7 |  ops/s |
|                                              Median Throughput |             delete-model |       84.7 |  ops/s |
|                                                 Max Throughput |             delete-model |       84.7 |  ops/s |
|                                       100th percentile latency |             delete-model |    11.6162 |     ms |
|                                  100th percentile service time |             delete-model |    11.6162 |     ms |
|                                                     error rate |             delete-model |          0 |      % |
|                                                 Min Throughput |          train-knn-model |        1.1 |  ops/s |
|                                                Mean Throughput |          train-knn-model |        1.1 |  ops/s |
|                                              Median Throughput |          train-knn-model |        1.1 |  ops/s |
|                                                 Max Throughput |          train-knn-model |        1.1 |  ops/s |
|                                       100th percentile latency |          train-knn-model |    909.219 |     ms |
|                                  100th percentile service time |          train-knn-model |    909.219 |     ms |
|                                                     error rate |          train-knn-model |          0 |      % |
|                                                 Min Throughput |           warmup-indices |       3.39 |  ops/s |
|                                                Mean Throughput |           warmup-indices |       3.39 |  ops/s |
|                                              Median Throughput |           warmup-indices |       3.39 |  ops/s |
|                                                 Max Throughput |           warmup-indices |       3.39 |  ops/s |
|                                       100th percentile latency |           warmup-indices |    294.256 |     ms |
|                                  100th percentile service time |           warmup-indices |    294.256 |     ms |
|                                                     error rate |           warmup-indices |          0 |      % |
|                                                 Min Throughput |             prod-queries |      56.65 |  ops/s |
|                                                Mean Throughput |             prod-queries |      56.65 |  ops/s |
|                                              Median Throughput |             prod-queries |      56.65 |  ops/s |
|                                                 Max Throughput |             prod-queries |      56.65 |  ops/s |
|                                        50th percentile latency |             prod-queries |    8.57323 |     ms |
|                                        90th percentile latency |             prod-queries |    11.1135 |     ms |
|                                        99th percentile latency |             prod-queries |     116.16 |     ms |
|                                       100th percentile latency |             prod-queries |    215.067 |     ms |
|                                   50th percentile service time |             prod-queries |    8.57323 |     ms |
|                                   90th percentile service time |             prod-queries |    11.1135 |     ms |
|                                   99th percentile service time |             prod-queries |     116.16 |     ms |
|                                  100th percentile service time |             prod-queries |    215.067 |     ms |
|                                                     error rate |             prod-queries |          0 |      % |

Signed-off-by: Finn Roblin <finnrobl@amazon.com>

VijayanB · 2024-06-25T18:39:16Z

vectorsearch/params/train/train-faiss-sift-128-l2-pq.json

+    "target_index_num_vectors": 1000,
+


can we remove "target_index_num_vectors" from param file?

finnroblin · 2024-06-25T20:54:48Z

vectorsearch/workload.json

        }
    ],
    "corpora": [
    {
      "name": "cohere",
      "base-url": "https://dbyiw3u3rf9yr.cloudfront.net/corpora/vectorsearch/cohere-wikipedia-22-12-en-embeddings",
+      "target-index": "{{ target_index_name }}",


Calling out here that this target-index param is not used anywhere in the workload, but it's necessary due to OSB validation. I'm not sure what the solution is, but I opened an issue about this.

Signed-off-by: Finn Roblin <finnrobl@amazon.com>

IanHoang

LGTM

IanHoang

@finnroblin Overall, LGTM. As per best practices specified in the README, please provide a sample summary output of train-test in the PR description.

Signed-off-by: Finn Roblin <finnrobl@amazon.com>

IanHoang

LGTM

* Add vectorsearch training workload Signed-off-by: Finn Roblin <finnrobl@amazon.com> * Addressed Vijay feedback and ignores error if model DNE Signed-off-by: Finn Roblin <finnrobl@amazon.com> * Added documentation to VS readme Signed-off-by: Finn Roblin <finnrobl@amazon.com> --------- Signed-off-by: Finn Roblin <finnrobl@amazon.com> (cherry picked from commit 29d9715) Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

* Add vectorsearch training workload * Addressed Vijay feedback and ignores error if model DNE * Added documentation to VS readme --------- (cherry picked from commit 29d9715) Signed-off-by: Finn Roblin <finnrobl@amazon.com> Signed-off-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com> Co-authored-by: github-actions[bot] <github-actions[bot]@users.noreply.github.com>

Add vectorsearch training workload

0febc09

Signed-off-by: Finn Roblin <finnrobl@amazon.com>

finnroblin requested review from IanHoang, gkamat, beaioun, cgchinmay, rishabh6788 and VijayanB as code owners June 24, 2024 23:51

finnroblin mentioned this pull request Jun 24, 2024

[FEATURE] Add Train Model KNN Workload #332

Closed

2 tasks

VijayanB reviewed Jun 25, 2024

View reviewed changes

finnroblin commented Jun 25, 2024

View reviewed changes

Addressed Vijay feedback and ignores error if model DNE

f12d1c9

Signed-off-by: Finn Roblin <finnrobl@amazon.com>

finnroblin requested a review from VijayanB June 26, 2024 21:30

IanHoang approved these changes Jul 2, 2024

View reviewed changes

IanHoang added backport 2 Backport to the "2" branch backport 1 backport 3 Backport to the "3" branch labels Jul 2, 2024

IanHoang reviewed Jul 2, 2024

View reviewed changes

Added documentation to VS readme

0ea55c6

Signed-off-by: Finn Roblin <finnrobl@amazon.com>

IanHoang approved these changes Jul 18, 2024

View reviewed changes

IanHoang removed the backport 1 label Jul 18, 2024

IanHoang merged commit 29d9715 into opensearch-project:main Jul 18, 2024
2 checks passed

This was referenced Jul 18, 2024

[Backport 2] Add vectorsearch training workload #345

Merged

[Backport 3] Add vectorsearch training workload #346

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add vectorsearch training workload #333

Add vectorsearch training workload #333

finnroblin commented Jun 24, 2024 •

edited

Loading

VijayanB Jun 25, 2024

finnroblin Jun 25, 2024

IanHoang left a comment

IanHoang left a comment

IanHoang left a comment

Add vectorsearch training workload #333

Add vectorsearch training workload #333

Conversation

finnroblin commented Jun 24, 2024 • edited Loading

Description

Issues Resolved

Sample Output:

VijayanB Jun 25, 2024

Choose a reason for hiding this comment

finnroblin Jun 25, 2024

Choose a reason for hiding this comment

IanHoang left a comment

Choose a reason for hiding this comment

IanHoang left a comment

Choose a reason for hiding this comment

IanHoang left a comment

Choose a reason for hiding this comment

finnroblin commented Jun 24, 2024 •

edited

Loading