Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[New Feature] support for pgvector extension #16166

Open
saurik opened this issue Feb 21, 2023 · 2 comments
Open

[New Feature] support for pgvector extension #16166

saurik opened this issue Feb 21, 2023 · 2 comments
Labels
area/ysql Yugabyte SQL (YSQL) current-roadmap kind/new-feature This is a request for a completely new feature pg-compatibility Label for issues that result in differences b/w YSQL and Pg semantics priority/medium Medium priority issue roadmap-tracking-issue This issue tracks a major roadmap item, and usually appears in the roadmap list.

Comments

@saurik
Copy link

saurik commented Feb 21, 2023

Jira Link: DB-5722

Description

I'm enjoying working with AI, and I also enjoyed working with Yugabyte... but, I'd love to use them together! Someone recently poked the pgvector maintainer to enable support for vectors with enough dimensions to index embedding generated by OpenAI pgvector/pgvector#52 which was latched onto by Supabase supabase/postgres#472 to build an intelligent contextual search for their documentation (see various references below). Is it possible for Yugabyte to support this extension, or would this essentially require rewriting the concept of the extension from scratch?

https://supabase.com/blog/openai-embeddings-postgres-vector

https://www.youtube.com/watch?v=Yhtjd7yGGGA

https://news.ycombinator.com/item?id=34684593

Phase 1 -

Phase 2 -

@saurik saurik added the kind/new-feature This is a request for a completely new feature label Feb 21, 2023
@sushantrmishra sushantrmishra added area/ysql Yugabyte SQL (YSQL) pg-compatibility Label for issues that result in differences b/w YSQL and Pg semantics labels Mar 3, 2023
@yugabyte-ci yugabyte-ci added the priority/medium Medium priority issue label Mar 3, 2023
@jaki
Copy link
Contributor

jaki commented Apr 22, 2023

It might be doable.

For table (not-index), supported ops are only standard equalities/inequalities. So for storage, it could be straight serialized, and it would automatically sort correctly if range partitioned. Could also support hash partitioning, in which case you don't get inequalities pushed down but could still apply those filters on PG side.

For indexes, there are multiple types, but they aren't too different.

  • index build: PG logic to create kmeans clusters, find centers, insert those centers. Defaults to 100 clusters, but this is user-configurable (at create time).
  • scan: read all index items (default 100 because of above), then probe the top x clusters whose center matches search criteria (default x=1, user-configurable at run time), then sort using the vectors within those top x clusters
  • insert: finds the appropriate cluster and adds to that
  • delete: PG does vacuum, unlike YB, which sends explicit deletes to the index. It seems the centers aren't ever recalculated (I could be wrong), so after some time, there could be greater inaccuracies (the indexes are already known to be approximations, so this isn't a huge deal).

YB could implement the indexes by storing in a format where the key is the closest center to that vector. For example, if a scan will probe 2 clusters, it can lookup distinct centers, find the top 2 closest to what the query wants, then do another lookup on the index filtering on center IN (first-closest, second-closest), then use the matching pk references to do lookups on the main table where you have access to the vector. Details on the scanning strategy could be thought through more: for example, is it possible to push down the LIMIT to docdb when doing the main table scan?

It is also possible to INCLUDE the vector column in case you want to be able to do an index only scan.

What happens when deletes happen such that a cluster no longer has any vectors? Then the cluster will be permanently missing from the index. That might be undesirable, so one way to avoid that is to pin a row into the index for each center: not clear what mechanism can do that.

More visual example:

CREATE TABLE t (i int PRIMARY KEY, v vector(2));
INSERT INTO t VALUES (1, '[-5,-1]'), (2, '[-5,1]'), (3, '[5,-1]'), (4, '[5,1]');
CREATE INDEX ON t USING ivfflat (v vector_l2_ops) WITH (lists = 2); -- 2 clusters

Index centers should be [-5,0] and [5,0]. Table should contain

hash=yb_hash_code(1), hash_keys=(1), nonkeys=('[-5,-1]')
hash=yb_hash_code(2), hash_keys=(2), nonkeys=('[-5,1]')
hash=yb_hash_code(3), hash_keys=(3), nonkeys=('[5,-1]')
hash=yb_hash_code(4), hash_keys=(4), nonkeys=('[5,1]')

where [-5,-1] is formatted like 4-byte-float(-5) || 4-byte-float(-1) where || is concatenation. Since vector has n-dims in the definition, it should be known in the metadata.

Index should contain

hash=yb_hash_code('[-5,0]'), hash_keys=('[-5,0]'), range_keys=(<pk_ref_to_row_1>)
hash=yb_hash_code('[-5,0]'), hash_keys=('[-5,0]'), range_keys=(<pk_ref_to_row_2>)
hash=yb_hash_code('[5,0]'), hash_keys=('[5,0]'), range_keys=(<pk_ref_to_row_3>)
hash=yb_hash_code('[5,0]'), hash_keys=('[5,0]'), range_keys=(<pk_ref_to_row_4>)
SET ivfflat.probes = 1;
SELECT * FROM t ORDER BY v <-> '[4,1]' LIMIT 1;

Use <-> '[4,1]' to put centers into heap. Heap will be ordered [5,0] then [-5,0].

Next, pop <probe> number of items off heap. In this case, only [5,0].

Then, lookup matches in the index for [5,0]. The pk references could be used to look up the main table row. However, that is wasteful to do for all rows because of the LIMIT clause. I'm not sure what specific optimizations could take place here, but at the very least, the information is all present to figure out the closest match, which is [5,1], so row (4, '[5,1]') is returned.

@xingh
Copy link

xingh commented May 10, 2023

👍

@sushantrmishra sushantrmishra added roadmap-tracking-issue This issue tracks a major roadmap item, and usually appears in the roadmap list. and removed roadmap-tracking-issue This issue tracks a major roadmap item, and usually appears in the roadmap list. labels Aug 15, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/ysql Yugabyte SQL (YSQL) current-roadmap kind/new-feature This is a request for a completely new feature pg-compatibility Label for issues that result in differences b/w YSQL and Pg semantics priority/medium Medium priority issue roadmap-tracking-issue This issue tracks a major roadmap item, and usually appears in the roadmap list.
Projects
None yet
Development

No branches or pull requests

6 participants