Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

use-case: Add adaptive retrieval use case #61

Merged
merged 2 commits into from
Mar 7, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
48 changes: 25 additions & 23 deletions .vitepress/config.mts
Original file line number Diff line number Diff line change
Expand Up @@ -41,7 +41,7 @@ export default defineConfig({
nav: [
{ text: 'Home', link: '/' },
{ text: 'Docs', link: '/getting-started/overview' },
{text: 'Tutorial', link: '/tutorial/'},
{text: 'Reference', link: '/reference/'},
{ text: 'Blog', link: 'https://blog.pgvecto.rs' },
],

Expand All @@ -51,6 +51,17 @@ export default defineConfig({
},

sidebar: {
'/reference/': [
{
text: 'Reference',
collapsed: false,
items: [
{ text: 'Schema', link: '/reference/schema' },
{ text: 'Indexing Options', link: '/reference/indexing_options' },
{ text: 'Search Options', link: '/reference/search_options' },
],
},
],
'/': [
{
text: 'Getting Started',
Expand All @@ -72,6 +83,16 @@ export default defineConfig({
{ text: 'Compatibility', link: '/usage/compatibility' },
]
},
{
text: 'Use Cases',
collapsed: false,
items: [
{ text: 'Hybrid search', link: '/use-case/hybrid-search' },
{ text: 'Adaptive retrieval', link: '/use-case/adaptive-retrieval' },
{ text: 'Image search', link: '/use-case/image-search' },
{ text: 'Multi tenancy', link: '/use-case/multi-tenancy' },
],
},
{
text: 'Integration',
collapsed: false,
Expand Down Expand Up @@ -100,18 +121,9 @@ export default defineConfig({
{ text: `Benchmarks`, link: '/faqs/benchmark' },
]
},
{
text: 'Reference',
collapsed: false,
items: [
{ text: 'Schema', link: '/reference/schema' },
{ text: 'Indexing Options', link: '/reference/indexing_options' },
{ text: 'Search Options', link: '/reference/search_options' },
]
},
{
text: 'Community',
collapsed: false,
collapsed: true,
items: [
{ text: 'Contributing to pgvecto.rs', link: '/community/contributing' },
{ text: 'Roadmap', link: '/community/roadmap' },
Expand All @@ -120,23 +132,13 @@ export default defineConfig({
},
{
text: 'Developers',
collapsed: false,
collapsed: true,
items: [
{ text: 'Development tutorial', link: '/developers/development' },
],
},
],
'/tutorial/': [
{
text: 'Use Cases',
collapsed: false,
items: [
{ text: 'Hybrid Search', link: '/tutorial/hybrid-search' },
{ text: 'Image Search', link: '/tutorial/image-search' },
{ text: 'Multi Tenancy', link: '/tutorial/multi-tenancy' },
],
},
]

},

socialLinks: [
Expand Down
8 changes: 4 additions & 4 deletions src/faqs/benchmark.md
Original file line number Diff line number Diff line change
Expand Up @@ -24,7 +24,7 @@ When the `vbase` mode is enabled, `pgvecto.rs` can achieve over **2x** more resp

## Shortening embeddings (a.k.a [Matryoshka Embedding](https://aniketrege.github.io/blog/2024/mrl/))

OpenAI [latest embedding models](https://openai.com/blog/new-embedding-models-and-api-updates) supports shortening embeddings, where developers could simply “remove some numbers from the end of a sequence” and still maintain a valid representation for text.
OpenAI [latest embedding models](https://openai.com/blog/new-embedding-models-and-api-updates) support shortening embeddings, where developers could simply “remove some numbers from the end of a sequence” and still maintain a valid representation for text.

> This enables very flexible usage. For example, when using a vector data store that only supports embeddings up to 1024 dimensions long, developers can now still use our best embedding model text-embedding-3-large and specify a value of 1024 for the dimensions API parameter, which will shorten the embedding down from 3072 dimensions, trading off some accuracy in exchange for the smaller vector size.

Expand All @@ -34,16 +34,16 @@ These are the performance benchmarks for the shortened embeddings. The [dbpedia-

**First Pass** in the benchmark shows that the accuracy of the shortened embeddings is not as good as the full embeddings. 256D HNSW index only achieves ~65% accuracy, while 1024D HNSW index achieves ~85% accuracy.

However, the accuracy of the shortened embeddings surpasses 95% when adaptive retrieval is employed. Additionally, adopting this approach leads to significant improvements in terms of RPS (Requests Per Second) and latency.
However, the accuracy of the shortened embeddings surpasses 95% when [adaptive retrieval](/use-case/adaptive-retrieval) is employed. Additionally, adopting this approach leads to significant improvements in terms of RPS (Requests Per Second) and latency.

## Binary vectors

`pgvecto.rs` supports [binary vectors](/usage/vector-types.html#bvector-binary-vector). The `bvector` type is optimized for storage and performance. It uses a bit-packed representation to store the binary vector. The distance calculation is also optimized for binary vectors.

Here are some performance benchmarks for the `bvector` type. We use the [dbpedia-entities-openai3-text-embedding-3-large-3072-1M](https://huggingface.co/datasets/Qdrant/dbpedia-entities-openai3-text-embedding-3-large-3072-1M) dataset for the benchmark. The VM is n2-standard-8 (8 vCPUs, 32 GB memory) on Google Cloud.

We upsert 1M binary vectors into the table and then run a KNN query for each embedding. It only takes about 600MB memory to index 1M binary vectors.
We upsert 1M binary vectors into the table and then run a KNN query for each embedding. It only takes about 600MB memory to index 1M binary vectors, while the `vector` type takes about 18GB memory to index the same number of vectors.

![bvector](./images/bvector.png)

We can see that the `bvector`'s accuracy is not as good as the `vector` type, but it exceeds 95% if we adopt adaptive retrieval.
We can see that the `bvector`'s accuracy is not as good as the `vector` type, but it exceeds 95% if we adopt [adaptive retrieval](/use-case/adaptive-retrieval).
7 changes: 7 additions & 0 deletions src/reference/index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,7 @@
# Reference

The page shows the reference.

- [Indexing options](/reference/indexing_options)
- [Search options](/reference/search_options)
- [Schema](/reference/schema)
Binary file modified src/usage/images/bvector.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
2 changes: 1 addition & 1 deletion src/usage/indexing.md
Original file line number Diff line number Diff line change
Expand Up @@ -126,4 +126,4 @@ Smaller [`m`](/reference/indexing_options.html#options-for-table-hnsw)) values a

The [`ef_construction`](/reference/indexing_options.html#options-for-table-hnsw) parameter determines the dynamic candidate list size when adding new nodes; increasing this value may enhance recall but could extend index construction time.

The HNSW index is resource-intensive, requiring additional RAM and an adjustment to the `maintenance_work_mem` setting for larger datasets. If you're seeking performance that's markedly faster than IVF, with a high recall rate and scalability that matches dataset size, HNSW is an excellent option.
The HNSW index is resource-intensive, requiring additional RAM and an adjustment to the `maintenance_work_mem` setting for larger datasets. If you're seeking performance that's markedly faster than IVF, with a high recall rate and scalability that matches dataset size, HNSW is an excellent option.
4 changes: 2 additions & 2 deletions src/usage/vector-types.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,8 +38,8 @@ The `bvector` type is optimized for storage and performance. It uses a bit-packe

Here are some performance benchmarks for the `bvector` type. We use the [dbpedia-entities-openai3-text-embedding-3-large-3072-1M](https://huggingface.co/datasets/Qdrant/dbpedia-entities-openai3-text-embedding-3-large-3072-1M) dataset for the benchmark. The VM is n2-standard-8 (8 vCPUs, 32 GB memory) on Google Cloud.

We upsert 1M binary vectors into the table and then run a KNN query for each embedding. It only takes about 600MB memory to index 1M binary vectors.
We upsert 1M binary vectors into the table and then run a KNN query for each embedding. It only takes about 600MB memory to index 1M binary vectors, while the `vector` type takes about 18GB memory to index the same number of vectors.

![bvector](./images/bvector.png)

We can see that the `bvector`'s accuracy is not as good as the `vector` type, but it exceeds 95% if we adopt adaptive retrieval.
We can see that the `bvector`'s accuracy is not as good as the `vector` type, but it exceeds 95% if we adopt [adaptive retrieval](/use-case/adaptive-retrieval).
246 changes: 246 additions & 0 deletions src/use-case/adaptive-retrieval.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,246 @@
# Adaptive retrieval and shortening embeddings

OpenAI [latest embedding models](https://openai.com/blog/new-embedding-models-and-api-updates) support shortening embeddings, where developers could simply “remove some numbers from the end of a sequence” and still maintain a valid representation for text.

> This enables very flexible usage. For example, when using a vector data store that only supports embeddings up to 1024 dimensions long, developers can now still use our best embedding model text-embedding-3-large and specify a value of 1024 for the dimensions API parameter, which will shorten the embedding down from 3072 dimensions, trading off some accuracy in exchange for the smaller vector size.

## Shortening embeddings

OpenAI's new embedding model `text-embedding-3-large` produces embeddings with 3072 dimensions by default.

![](./adaptive-retrieval/shortening-embedding.svg)

But you could safely remove some numbers from the end of the sequence and still maintain a valid representation for text. For example, you could shorten the embeddings to 1024 dimensions.

::: details

You may need normalization to ensure that the shortened embeddings compatible with some distance calculation e.g. dot poroduct. OpenAI's API will help you on this if you are calling `text-embedding-3-large` to generate a lower dimension embedding directly, instead of truncating the original embeddings on your own.

```python
embedding_3072_dim_from_openai = [-0.00611, -0.00269, ...]
embedding_1024_dim_from_openai = [-0.01520, -0.03839, ...]

embedding_1024_dim_truncated_from_3072_dim = [-0.00611, -0.00269, ...]
embedding_1024_dim_normalized = [-0.01520, -0.03839, ...]
```

:::

There are lots of benefits to using shortening embeddings:

- **Reduced memory usage**: Shortening embeddings can reduce the memory usage to store the embeddings.
- **Lower dimensions**: Shortening embeddings can reduce the dimensions of the embeddings. Some vector data stores only support embeddings up to a certain number of dimensions (e.g. 2000).
- **Faster indexing and search**: Smaller embeddings can be indexed and searched faster than larger embeddings.
- **Trade-off performance and cost**: Shortening embeddings can trade off some accuracy in exchange for the smaller vector size.

However, the accuracy of the shortened embeddings is not as good as the full embeddings.

## Trade-off performance and cost

We have performed an experiment to evaluate the effectiveness of reduced embeddings compared to the complete embeddings. The benchmark was conducted using the [dbpedia-entities-openai3-text-embedding-3-large-3072-1M](https://huggingface.co/datasets/Qdrant/dbpedia-entities-openai3-text-embedding-3-large-3072-1M) dataset. The experiment was carried out on a Google Cloud VM with specifications of `n2-standard-8`, which includes 8 vCPUs and 32 GB of memory.

We have established a table and populated it with 1 million embeddings, each consisting of 3072 dimensions.

```sql
CREATE TABLE openai3072 (
id bigserial PRIMARY KEY,
text_embedding_3_large_3072_embedding vector(3072),
text_embedding_3_large_3072_embedding bvector(3072)
);
```

::: details

The code for inserting 1M embeddings into the table:

```python
from pgvecto_rs.sdk import PGVectoRs, Record
from pgvecto_rs.psycopg import register_vector
import psycopg

conn = psycopg.connect(conninfo='postgresql://postgres:mysecretpassword@localhost:5433/postgres', autocommit=True)

import json
import os

import numpy as np
import random
from datasets import load_dataset
from datasets.exceptions import DatasetNotFoundError
from tqdm import tqdm

MODEL_NAME, DIMENSIONS = "text-embedding-3-large", 3072
DATASET_NAME = f"Qdrant/dbpedia-entities-openai3-{MODEL_NAME}-{DIMENSIONS}-1M"
collection_name = f"dbpedia-{MODEL_NAME}-{DIMENSIONS}"
embedding_column_name = f"{MODEL_NAME}-{DIMENSIONS}-embedding"

dataset = load_dataset(
DATASET_NAME,
streaming=False,
split="train",
)

bs = 1000
for i, record in tqdm(enumerate(dataset)):
if i % bs == 0:
points = []
points.append({
"embedding": record[embedding_column_name]
})
if i % bs == bs - 1:
batch_points = f", ".join([f"('{p['embedding']}')" for p in points])
conn.execute(f"INSERT INTO openai3072 (text_embedding_3_large_3072_embedding) VALUES %s" % (batch_points))
print(f"Inserted {i} records")
```

:::

Furthermore, we have created indexes for embeddings with dimensions of 256, 1024 and 3072, and performed an ANN (Approximate Nearest Neighbors) query for each dimension. There is no need to create the 256-dimensional column, as we can use the first 256 dimensions of the 3072-dimensional embeddings.

```sql
-- Create 256-dim index
CREATE INDEX openai_vector_index_256 ON public.openai3072 USING vectors (((text_embedding_3_large_3072_embedding[0:256])::vector(256)) vector_l2_ops);

-- Create 1024-dim index
CREATE INDEX openai_vector_index_1024 on openai3072 using vectors((text_embedding_3_large_3072_embedding[0:1024]::vector(1024)) vector_l2_ops);

-- Create 3072-dim index
CREATE INDEX openai_vector_index on openai3072 using vectors(text_embedding_3_large_3072_embedding vector_l2_ops);
```

Additionally, we have constructed a [binary vector index](/usage/vector-types.html#bvector-binary-vector) for the 3072-dimensional embeddings and conducted an ANN query using this index as well.

```sql
CREATE INDEX openai_vector_index_bvector ON public.openai3072 USING vectors (text_embedding_3_large_3072_bvector bvector_l2_ops);
```

We run the KNN query with different limits (5, 10, 50, 100) for the number of neighbors to retrieve.

::: details

The QPS (Queries Per Second) and the average query time were measured for each query with the help of `pgbench`.

```bash
export PGHOST="localhost"
export PGUSER="postgres"
export PGPASSWORD="mysecretpassword"
export PGDATABASE="postgres"
export PGPORT=5433
export VECTOR_DIM=3072
TEST_TIME=300
for clients in 2 4 8 16 32; do
pgbench -n -T "${TEST_TIME}" \
--file=query.sql \
-c "${clients}" -j "${clients}"
done
```

The `query.sql` file contains the following SQL queries:

```sql
\set random_id random(1, 1000000)

select id from openai3072 order by text_embedding_3_large_3072_embedding <-> (select text_embedding_3_large_3072_embedding from openai3072 where id = :random_id) limit 100;
```

:::

The results of the experiment are shown here:

![](./adaptive-retrieval/first-pass.png)

As anticipated, the 3072-dimensional embeddings exhibit the highest accuracy, while the 1024-dimensional embeddings demonstrate a slightly lower accuracy at around 85%. The 256-dimensional embeddings yield the lowest accuracy, approximately 65%. On the other hand, the binary vector index achieves an accuracy of about 80%.

Regarding the Requests Per Second (RPS) metric, the binary vector index showcases the most efficient performance, followed by the 256-dimensional embeddings, the 1024-dimensional embeddings, and finally, the 3072-dimensional embeddings.

![](./adaptive-retrieval/memusage.png)

::: details

![table](./adaptive-retrieval/first-pass-tab.png)

:::

The memory usage of the indexes is an important aspect to take into account. Indexing 1 million 3072-dimensional binary vectors requires only around 600MB of memory. In contrast, the 3072-dimensional vector index consumes approximately 18GB of memory. This represents a significant difference in memory usage, as the 3072-dimensional vector index utilizes approximately **30x more memory than the binary vector index**.

## Improve the accuracy via adaptive retrieval

Lower dimension and binary vector indexes trade off accuracy for advantages such as higher RPS and lower memory usage compared to full dimension indexes. Adaptive retrieval techniques can be used to combine the strengths of both approaches, achieving an optimized solution that balances accuracy, memory usage, and query speed.

The logic behind adaptive retrieval is really simple. Let's take `get top 100 candidates` as an example. We can perform the following steps:

1. **Query the lower dimensional or binary vector index** first to retrieve 200 candidates from the 1 million embeddings. This is a fast operation.
2. **Rerank the candidates using a KNN query** to retrieve the top 100 candidates. K-NN is well-suited for situations where smaller sets and precise similarity search are necessary, making it an excellent choice for reranking in this context.

![](./adaptive-retrieval/adaptive-retrieval.svg)

The reranking step is a bit slower than the initial query, but it is still much faster than querying the full dimension index. It could be done in the pgvecto.rs:

```sql
CREATE OR REPLACE FUNCTION match_documents_adaptive(
query_embedding vector(3072),
match_count int
)
RETURNS SETOF openai3072
LANGUAGE SQL
AS $$
WITH shortlist AS (
SELECT *
FROM openai3072
ORDER BY (text_embedding_3_large_3072_embedding[0:256])::vector(256) <-> (query_embedding[0:256])::vector(256)
LIMIT match_count * 2
)
SELECT *
FROM shortlist
ORDER BY text_embedding_3_large_3072_embedding <-> query_embedding
LIMIT match_count;
$$;
```

The `match_documents_adaptive` function accepts a query embedding and a match count as input parameters. It first retrieves `match_count * 2` candidates from the 1 million embeddings using the 256-dimensional index. Then, it reranks the candidates using a KNN query to retrieve the top `match_count` candidates.

The function for binary vector and 1024-dimensional embeddings can be implemented in a similar manner.

We conduct the benchmark again with the adaptive retrieval technique. We tag the adaptive retrieval technique as `Adaptive Retrieval` in the following figure.

![adaptive retrieval benchmark](./adaptive-retrieval/adaptive-retrieval-bench.png)

Adding a reranking step significantly improves the accuracy of the lower dimensional and binary vector indexes.

The binary vector index with adaptive retrieval achieves an accuracy of around 95%, with only a slight decrease in RPS compared to the binary vector index without adaptive retrieval. Remarkably, the memory usage of the binary vector index with adaptive retrieval is only around 600MB, which is the same as the binary vector index without adaptive retrieval.

![](../usage/images/bvector.png)

::: details

![table](./adaptive-retrieval/adaptive-retrieval-tab.png)

:::

## Conclusion

Shortening embeddings can reduce the memory usage to store the embeddings, and speed up indexing and search. However, the accuracy of the shortened embeddings is not as good as the full embeddings. That's why we need to use adaptive retrieval techniques to improve the accuracy of the lower dimensional indexes.

It seems that binary vector indexes with adaptive retrieval are the best choice for the `text-embedding-3-large` model. You could also consider 1024-dimensional indexes with adaptive retrieval if you need a higher accuracy (e.g. 99%).

PS: All the code and data used in this experiment are available in the [pgvecto.rs-matryoshka-embeddings-benchmark](https://github.com/tensorchord/pgvecto.rs-matryoshka-embeddings-benchmark).

<style>
code {
white-space: pre-wrap !important;
counter-reset: step;
counter-increment: step 0;
min-width: calc(100% - 40px) !important;
}

.line::before {
content: counter(step);
counter-increment: step;
width: 2ch;
margin-right: 36px;
margin-left: calc(-36px - 2ch);
display: inline-block;
text-align: right;
color: var(--vp-code-line-number-color);
}
</style>
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
3 changes: 3 additions & 0 deletions src/use-case/adaptive-retrieval/adaptive-retrieval.svg
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added src/use-case/adaptive-retrieval/first-pass.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added src/use-case/adaptive-retrieval/memusage.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading