[New Feature] support for pgvector extension #16166

saurik · 2023-02-21T04:41:26Z

Jira Link: DB-5722

Description

I'm enjoying working with AI, and I also enjoyed working with Yugabyte... but, I'd love to use them together! Someone recently poked the pgvector maintainer to enable support for vectors with enough dimensions to index embedding generated by OpenAI pgvector/pgvector#52 which was latched onto by Supabase supabase/postgres#472 to build an intelligent contextual search for their documentation (see various references below). Is it possible for Yugabyte to support this extension, or would this essentially require rewriting the concept of the extension from scratch?

https://supabase.com/blog/openai-embeddings-postgres-vector

https://www.youtube.com/watch?v=Yhtjd7yGGGA

https://news.ycombinator.com/item?id=34684593

Phase 1 -

[YSQL] Add support for pgvector (without Indexing) #18686

Phase 2 -

[YSQL] Add Index support for vector/embedding data type #18688

jaki · 2023-04-22T01:57:57Z

It might be doable.

For table (not-index), supported ops are only standard equalities/inequalities. So for storage, it could be straight serialized, and it would automatically sort correctly if range partitioned. Could also support hash partitioning, in which case you don't get inequalities pushed down but could still apply those filters on PG side.

For indexes, there are multiple types, but they aren't too different.

index build: PG logic to create kmeans clusters, find centers, insert those centers. Defaults to 100 clusters, but this is user-configurable (at create time).
scan: read all index items (default 100 because of above), then probe the top x clusters whose center matches search criteria (default x=1, user-configurable at run time), then sort using the vectors within those top x clusters
insert: finds the appropriate cluster and adds to that
delete: PG does vacuum, unlike YB, which sends explicit deletes to the index. It seems the centers aren't ever recalculated (I could be wrong), so after some time, there could be greater inaccuracies (the indexes are already known to be approximations, so this isn't a huge deal).

YB could implement the indexes by storing in a format where the key is the closest center to that vector. For example, if a scan will probe 2 clusters, it can lookup distinct centers, find the top 2 closest to what the query wants, then do another lookup on the index filtering on center IN (first-closest, second-closest), then use the matching pk references to do lookups on the main table where you have access to the vector. Details on the scanning strategy could be thought through more: for example, is it possible to push down the LIMIT to docdb when doing the main table scan?

It is also possible to INCLUDE the vector column in case you want to be able to do an index only scan.

What happens when deletes happen such that a cluster no longer has any vectors? Then the cluster will be permanently missing from the index. That might be undesirable, so one way to avoid that is to pin a row into the index for each center: not clear what mechanism can do that.

More visual example:

CREATE TABLE t (i int PRIMARY KEY, v vector(2));
INSERT INTO t VALUES (1, '[-5,-1]'), (2, '[-5,1]'), (3, '[5,-1]'), (4, '[5,1]');
CREATE INDEX ON t USING ivfflat (v vector_l2_ops) WITH (lists = 2); -- 2 clusters

Index centers should be [-5,0] and [5,0]. Table should contain

hash=yb_hash_code(1), hash_keys=(1), nonkeys=('[-5,-1]')
hash=yb_hash_code(2), hash_keys=(2), nonkeys=('[-5,1]')
hash=yb_hash_code(3), hash_keys=(3), nonkeys=('[5,-1]')
hash=yb_hash_code(4), hash_keys=(4), nonkeys=('[5,1]')

where [-5,-1] is formatted like 4-byte-float(-5) || 4-byte-float(-1) where || is concatenation. Since vector has n-dims in the definition, it should be known in the metadata.

Index should contain

hash=yb_hash_code('[-5,0]'), hash_keys=('[-5,0]'), range_keys=(<pk_ref_to_row_1>)
hash=yb_hash_code('[-5,0]'), hash_keys=('[-5,0]'), range_keys=(<pk_ref_to_row_2>)
hash=yb_hash_code('[5,0]'), hash_keys=('[5,0]'), range_keys=(<pk_ref_to_row_3>)
hash=yb_hash_code('[5,0]'), hash_keys=('[5,0]'), range_keys=(<pk_ref_to_row_4>)

SET ivfflat.probes = 1;
SELECT * FROM t ORDER BY v <-> '[4,1]' LIMIT 1;

Use <-> '[4,1]' to put centers into heap. Heap will be ordered [5,0] then [-5,0].

Next, pop <probe> number of items off heap. In this case, only [5,0].

Then, lookup matches in the index for [5,0]. The pk references could be used to look up the main table row. However, that is wasteful to do for all rows because of the LIMIT clause. I'm not sure what specific optimizations could take place here, but at the very least, the information is all present to figure out the closest match, which is [5,1], so row (4, '[5,1]') is returned.

xingh · 2023-05-10T18:29:17Z

👍

saurik added the kind/new-feature This is a request for a completely new feature label Feb 21, 2023

sushantrmishra added area/ysql Yugabyte SQL (YSQL) pg-compatibility Label for issues that result in differences b/w YSQL and Pg semantics labels Mar 3, 2023

yugabyte-ci added the priority/medium Medium priority issue label Mar 3, 2023

jaki mentioned this issue Apr 22, 2023

build: support compiling with postgres pgvector/pgvector#98

Closed

This was referenced Jun 27, 2023

Feature Request: Support for pgvector in YugabyteDB #17960

Closed

[YSQL] CREATE ACCESS METHOD is not supported yet #10693

Open

sushantrmishra added roadmap-tracking-issue This issue tracks a major roadmap item, and usually appears in the roadmap list. and removed roadmap-tracking-issue This issue tracks a major roadmap item, and usually appears in the roadmap list. labels Aug 15, 2023

premkumr added the current-roadmap label Oct 8, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[New Feature] support for pgvector extension #16166

[New Feature] support for pgvector extension #16166

saurik commented Feb 21, 2023 •

edited by sushantrmishra

Loading

jaki commented Apr 22, 2023

xingh commented May 10, 2023

[New Feature] support for pgvector extension #16166

[New Feature] support for pgvector extension #16166

Comments

saurik commented Feb 21, 2023 • edited by sushantrmishra Loading

Description

Phase 1 -

Phase 2 -

jaki commented Apr 22, 2023

xingh commented May 10, 2023

saurik commented Feb 21, 2023 •

edited by sushantrmishra

Loading