Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

add nb test cases for vector search jsonapi #512

Merged
merged 4 commits into from
Sep 7, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
50 changes: 50 additions & 0 deletions nosqlbench/http-jsonapi-vector-crud.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
# JSON API Vector CRUD

## Description

The JSON API CRUD Dataset workflow targets Stargate's JSON API using JSON documents from an external dataset.
The [dataset](#dataset) is mandatory and should contain a vector per row that should be used as the input for write, read and update operations.
This workflow is perfect for testing Stargate performance using your own JSON dataset or any other realistic dataset.

In contrast to other workflows, this one is not split into ramp-up and main phases. Instead, there is only the main phase with 4 different load types (write, read, update and delete).

## Named Scenarios

### default

The default scenario for http-jsonapi-vector-crud.yaml runs each type of the main phase sequentially: write, read, update and delete. This means that setting cycles for each of the phases should be done using the: `write-cycles`, `read-cycles`, `update-cycles` and `delete-cycles`. The default value for all 4 cycles variables is the amount of documents to process (see [Workload Parameters](http://localhost:63342/markdownPreview/147307353/markdown-preview-index-1841516304.html?_ijt=avuea5chkg34krn8blmr2k7431#workload-parameters)).

Note that error handling is set to `errors=timer,warn`, which means that in case of HTTP errors the scenario is not stopped.

## Dataset

### Vector Sample

Vector size is 1536 in the nosqlbench file. (openAI embedding vector standard size is 1536)
Sample dataset is in [vector dataset](vector-dataset.txt)

> If you want to test different vector-size, please change [http-jsonapi-vector-crud create-collection op](http-jsonapi-vector-crud.yaml) and [vector dataset](vector-dataset.txt)

## Workload Parameters

- `docscount` - the number of documents to process in each step of a scenario (default: `500`)
- `dataset_file` - the file to read the JSON documents from (note that if number of documents in a file is smaller than the `docscount` parameter, the documents will be reused)
- `connections` - number of HTTP2 connections to be shared between the threads (default: `20`)
- `write-cycles`, `read-cycles`, `update-cycles`,`delete-cycles` - running cycles for each phases (default: `docscount`)

## Sample Command

### Against AstraDB

> comment out `create-namespace` op in the [nosqlbench yaml file](http-jsonapi-vector-crud.yaml)

```
nb5 -v http-jsonapi-vector-crud docscount=1000 threads=20 jsonapi_host=Your-AstraDB-Host auth_token=Your-AstraDB-Token jsonapi_port=443 protocol=https path_prefix=/api/json namespace=Your-Keyspace
```

### Against Local JSON API

```
nb5 -v http-jsonapi-vector-crud jsonapi_host=localhost docscount=1000 threads=20
```

197 changes: 197 additions & 0 deletions nosqlbench/http-jsonapi-vector-crud.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,197 @@
min_version: "5.17.3"

# Example command line
# Against AstraDB
# nb5 -v http-jsonapi-vector-crud docscount=1000 threads=20 jsonapi_host=Your-AstraDB-Host auth_token=Your-AstraDB-Token jsonapi_port=443 protocol=https path_prefix=/api/json namespace=Your-Keyspace
# Against local JSON API
# nb5 -v http-jsonapi-vector-crud jsonapi_host=localhost docscount=1000 threads=20

description: >2
This workload emulates vector CRUD operations for Stargate JSON API.
It requires a data set file (default vector-dataset.txt), where contains vectors of size 1536
1536 is a standard vector size that openAI embedding generates, using this size for benchmark


scenarios:
default:
schema: run driver=http tags==block:schema threads==1 cycles==UNDEF
write: run driver=http tags==name:"write.*" cycles===TEMPLATE(write-cycles,TEMPLATE(docscount,500)) threads=auto errors=timer,warn
read: run driver=http tags==name:"read.*" cycles===TEMPLATE(read-cycles,TEMPLATE(docscount,500)) threads=auto errors=timer,warn
update: run driver=http tags==name:"update.*" cycles===TEMPLATE(update-cycles,TEMPLATE(docscount,500)) threads=auto errors=timer,warn
delete: run driver=http tags==name:"delete.*" cycles===TEMPLATE(delete-cycles,TEMPLATE(docscount,500)) threads=auto errors=timer,warn

bindings:
# To enable an optional weighted set of hosts in place of a load balancer
# Examples
# single host: jsonapi_host=host1
# multiple hosts: jsonapi_host=host1,host2,host3
# multiple weighted hosts: jsonapi_host=host1:3,host2:7
weighted_hosts: WeightedStrings('<<jsonapi_host:jsonapi>>')

# spread into different spaces to use multiple connections
space: HashRange(1,<<connections:20>>); ToString();

# http request id
request_id: ToHashedUUID(); ToString();

# autogenerate auth token to use on API calls using configured uri/uid/password, unless one is provided
token: Discard(); Token('<<auth_token:>>','<<uri:http://localhost:8081/v1/auth>>', '<<uid:cassandra>>', '<<pswd:cassandra>>');

seq_key: Mod(<<docscount:500>>); ToString() -> String
random_key: Uniform(0,<<docscount:500>>); ToString() -> String
vector_json: HashedLineToString('<<dataset:vector-dataset.txt>>');

blocks:
schema:
ops:
create-namespace:
method: POST
uri: <<protocol:http>>://{weighted_hosts}:<<jsonapi_port:8181>><<path_prefix:>>/v1
Accept: "application/json"
X-Cassandra-Request-Id: "{request_id}"
X-Cassandra-Token: "{token}"
Content-Type: "application/json"
ok-body: ".*\"ok\":1.*"
body: >2
{
"createNamespace": {
"name": "<<namespace:jsonapi_vector_crud_namespace>>"
}
}

delete-collection:
method: POST
uri: <<protocol:http>>://{weighted_hosts}:<<jsonapi_port:8181>><<path_prefix:>>/v1/<<namespace:jsonapi_vector_crud_namespace>>
Accept: "application/json"
X-Cassandra-Request-Id: "{request_id}"
X-Cassandra-Token: "{token}"
Content-Type: "application/json"
ok-body: ".*\"ok\":1.*"
body: >2
{
"deleteCollection": {
"name": "<<collection:jsonapi_vector_crud_collection>>"
}
}

create-collection:
method: POST
uri: <<protocol:http>>://{weighted_hosts}:<<jsonapi_port:8181>><<path_prefix:>>/v1/<<namespace:jsonapi_vector_crud_namespace>>
Accept: "application/json"
X-Cassandra-Request-Id: "{request_id}"
X-Cassandra-Token: "{token}"
Content-Type: "application/json"
ok-body: ".*\"ok\":1.*"
# vector mush be enabled when creating collection
body: >2
{
"createCollection": {
"name": "<<collection:jsonapi_vector_crud_collection>>",
"options": {
"vector": {
"size": 1536
}
}
}
}

write:
ops:
write-insert-one-vector:
params:
ratio: 5
space: "{space}"
method: POST
uri: <<protocol:http>>://{weighted_hosts}:<<jsonapi_port:8181>><<path_prefix:>>/v1/<<namespace:jsonapi_vector_crud_namespace>>/<<collection:jsonapi_vector_crud_collection>>
Accept: "application/json"
X-Cassandra-Request-Id: "{request_id}"
X-Cassandra-Token: "{token}"
Content-Type: "application/json"
ok-body: '.*\"insertedIds\":\[.*\].*'
body: >2
{
"insertOne" : {
"document" : {
"_id" : "{seq_key}",
"$vector" : {vector_json}
}
}
}
read:
ops:
find-one-by-vector-projection:
space: "{space}"
method: POST
uri: <<protocol:http>>://{weighted_hosts}:<<jsonapi_port:8181>><<path_prefix:>>/v1/<<namespace:jsonapi_vector_crud_namespace>>/<<collection:jsonapi_vector_crud_collection>>
Accept: "application/json"
X-Cassandra-Request-Id: "{request_id}"
X-Cassandra-Token: "{token}"
Content-Type: "application/json"
ok-body: ".*\"data\".*"
body: >2
{
"findOne": {
"sort" : {"$vector" : {vector_json}},
"projection" : {"$vector" : 1}
}
}

find-by-vector-projection:
space: "{space}"
method: POST
uri: <<protocol:http>>://{weighted_hosts}:<<jsonapi_port:8181>><<path_prefix:>>/v1/<<namespace:jsonapi_vector_crud_namespace>>/<<collection:jsonapi_vector_crud_collection>>
Accept: "application/json"
X-Cassandra-Request-Id: "{request_id}"
X-Cassandra-Token: "{token}"
Content-Type: "application/json"
ok-body: ".*\"data\".*"
body: >2
{
"find": {
"sort" : {"$vector" : {vector_json}},
"projection" : {"$vector" : 1, "$similarity" : 1},
"options" : {
"limit" : 10
}
}
}


update:
ops:
find-one-update-vector:
space: "{space}"
method: POST
uri: <<protocol:http>>://{weighted_hosts}:<<jsonapi_port:8181>><<path_prefix:>>/v1/<<namespace:jsonapi_vector_crud_namespace>>/<<collection:jsonapi_vector_crud_collection>>
Accept: "application/json"
X-Cassandra-Request-Id: "{request_id}"
X-Cassandra-Token: "{token}"
Content-Type: "application/json"
ok-body: ".*\"data\".*"
body: >2
{
"findOneAndUpdate": {
"sort" : {"$vector" : {vector_json}},
"update" : {"$set" : {"status" : "active"}},
"options" : {"returnDocument" : "after"}
}
}

delete:
ops:
delete-document:
space: "{space}"
method: POST
uri: <<protocol:http>>://{weighted_hosts}:<<jsonapi_port:8181>><<path_prefix:>>/v1/<<namespace:jsonapi_vector_crud_namespace>>/<<collection:jsonapi_vector_crud_collection>>
Accept: "application/json"
X-Cassandra-Request-Id: "{request_id}"
X-Cassandra-Token: "{token}"
Content-Type: "application/json"
ok-body: ".*\"deletedCount\":[0,1].*"
body: >2
{
"findOneAndDelete": {
"sort" : {"$vector" : {vector_json}}
}
}

Loading