-
Notifications
You must be signed in to change notification settings - Fork 73
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Sparse vectors: 1 representation, 2 use cases #587
Comments
Thanks for your suggestion! We're aware of the sparse index here and have some prototype at #552. For the weighted keyword search, we have tried https://github.com/tensorchord/pg_bestmatch.rs. Does it work for you? We're also thinking of a better way to directly build text index for bm25 search |
Thanks for aswering, i will take a look at I think that part of storing sparse vectors are solved. However i think, the query part must be rethinked. I honestly think that the native
I think this type of API is great and can solve the previous animal keywors example, and also instead of 'english' we can add custom sparse encoders (TFIDF, BM25, SPLADE, CUSTOM)
And for compute the distances, instead of the tsvector SELECT text
FROM documents
ORDER BY sparse_keywords <#> to_svector_query('simple', 'Dog & Cat)
LIMIT 50; UpdatePostgreSQL
I think it would be abstraction that sparse vectors become the missing |
An implementation of bm25 would be huge for my application. What tasks remain to be done? |
@jbohnslav We don't have an ETA for this. The main reason is that the algorithm is too new and we're not sure what's it's actual behavior with different data like BM25 and how to setup the config param. For bm25, have you tried paradedb? They claim to have full support of bm25 |
In my opinion, sparse vectors can solve two different problems:
In both cases, we can have a sparse vector representation:
{'What': 0.10430284, 'is': 0.10090457, 'BM': 0.2635918, '25': 0.3382988, '?': 0.052101523}
{"Dog": 0.4, "Cat": 0.3, "Giant panda": 0.1, "Komodo dragon": 0.05}
However, the use cases are different:
Text Search
For text search, sparse vectors like BM25 are a good representation. For example, if I look for the query:
BM25 vs SPLADE
This will be the tokens:
["BM", "25", "vs", "SPLADE"]
It is acceptable to return documents that contain a subset of these tokens, such as:
{'What': 0.10430284, 'is': 0.10090457, 'BM': 0.2635918, '25': 0.3382988, '?': 0.052101523}
Weighted-Keywords Search
Now imagine that I have a collection of documents (images or texts), each annotated with animal keywords and their corresponding probabilities:
DOC A: {"Dog": 0.43}
DOC B: {"Cat": 0.21}
DOC C: {"Dog": 0.65, "Cat": 0.11}
DOC D: {"Giant panda": 0.1}
DOC E: {"Dog": 0.33, "Cat": 0.66}
If I perform a query with the keyword
["Dog"]
, I want the following results (sorted by inner product distance):DOC C: {"Dog": 0.65, "Cat": 0.11}
DOC A: {"Dog": 0.43}
DOC E: {"Dog": 0.33, "Cat": 0.66}
However, if I query with the keywords
["Dog", "Cat"]
, it is not acceptable to return documents with only a subset of these keywords because I want documents containing all the keywords (similar to the PostgreSQL@>
operator). The sorting/ranking should then be done by inner product distance:DOC C: {"Dog": 0.65, "Cat": 0.11}
DOC E: {"Dog": 0.33, "Cat": 0.66}
Final Notes
The plain keywords search in PostgreSQL can be achieved with the following syntax and operators:
This type of data can also be accelerated with an Inverted Index in PostgreSQL: GIN.
My final question is whether we can achieve this Weighted-keywords search in
pgvector
; perhaps with new operators like&&
and@>
.Some other vector databases, like Milvus, already include Inverted Indexes for dealing with sparse vectors:
Source
The text was updated successfully, but these errors were encountered: