Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sparse vectors: 1 representation, 2 use cases #587

Open
javiabellan opened this issue Sep 3, 2024 · 4 comments
Open

Sparse vectors: 1 representation, 2 use cases #587

javiabellan opened this issue Sep 3, 2024 · 4 comments
Labels
type/question 🙋 Further information is requested

Comments

@javiabellan
Copy link

javiabellan commented Sep 3, 2024

In my opinion, sparse vectors can solve two different problems:

  1. Text search (TFIDF, BM25, SPLADE, etc.)
  2. Weighted-keywords search

In both cases, we can have a sparse vector representation:

  1. Text: {'What': 0.10430284, 'is': 0.10090457, 'BM': 0.2635918, '25': 0.3382988, '?': 0.052101523}
  2. Weighted-keywords: {"Dog": 0.4, "Cat": 0.3, "Giant panda": 0.1, "Komodo dragon": 0.05}

However, the use cases are different:

Text Search

For text search, sparse vectors like BM25 are a good representation. For example, if I look for the query:

BM25 vs SPLADE

This will be the tokens:

["BM", "25", "vs", "SPLADE"]

It is acceptable to return documents that contain a subset of these tokens, such as:

{'What': 0.10430284, 'is': 0.10090457, 'BM': 0.2635918, '25': 0.3382988, '?': 0.052101523}

Weighted-Keywords Search

Now imagine that I have a collection of documents (images or texts), each annotated with animal keywords and their corresponding probabilities:

  • DOC A: {"Dog": 0.43}
  • DOC B: {"Cat": 0.21}
  • DOC C: {"Dog": 0.65, "Cat": 0.11}
  • DOC D: {"Giant panda": 0.1}
  • DOC E: {"Dog": 0.33, "Cat": 0.66}

If I perform a query with the keyword ["Dog"], I want the following results (sorted by inner product distance):

  1. DOC C: {"Dog": 0.65, "Cat": 0.11}
  2. DOC A: {"Dog": 0.43}
  3. DOC E: {"Dog": 0.33, "Cat": 0.66}

However, if I query with the keywords ["Dog", "Cat"], it is not acceptable to return documents with only a subset of these keywords because I want documents containing all the keywords (similar to the PostgreSQL @> operator). The sorting/ranking should then be done by inner product distance:

  1. DOC C: {"Dog": 0.65, "Cat": 0.11}
  2. DOC E: {"Dog": 0.33, "Cat": 0.66}

Final Notes

The plain keywords search in PostgreSQL can be achieved with the following syntax and operators:

WHERE query_keywords && ARRAY['keyword1', 'keyword2', 'keyword3'];  ---- OR between keywords
WHERE query_keywords @> ARRAY['keyword1', 'keyword2', 'keyword3'];  ---- AND between keywords

This type of data can also be accelerated with an Inverted Index in PostgreSQL: GIN.

My final question is whether we can achieve this Weighted-keywords search in pgvector; perhaps with new operators like && and @>.

Some other vector databases, like Milvus, already include Inverted Indexes for dealing with sparse vectors:

Milvus Sparse Vector Index Example

Source

@gaocegege gaocegege added the type/question 🙋 Further information is requested label Sep 5, 2024
@VoVAllen
Copy link
Member

VoVAllen commented Sep 5, 2024

Thanks for your suggestion! We're aware of the sparse index here and have some prototype at #552. For the weighted keyword search, we have tried https://github.com/tensorchord/pg_bestmatch.rs. Does it work for you? We're also thinking of a better way to directly build text index for bm25 search

@javiabellan
Copy link
Author

javiabellan commented Sep 5, 2024

Thanks for aswering, i will take a look at pg_bestmatch.rs.

I think that part of storing sparse vectors are solved. However i think, the query part must be rethinked.

I honestly think that the native to_tsquery API (link1 link2) is great and can be replicated for sparse vectors (maybe to_svector_query). The to_tsquery API offers great flexibiliy (boolean operators):

  • SELECT to_tsquery('english', 'The & Fat & Rats');
  • SELECT to_tsquery('english', 'Fat | Rats');
  • SELECT to_tsquery('simple', 'Fat | Rats');

I think this type of API is great and can solve the previous animal keywors example, and also instead of 'english' we can add custom sparse encoders (TFIDF, BM25, SPLADE, CUSTOM)

  • SELECT to_svector_query('tfidf', 'We begin, as always, with the text.');
  • SELECT to_svector_query('bm25', 'We begin, as always, with the text.');
  • SELECT to_svector_query('simple', 'Dog & Cat);
  • SELECT to_svector_query('simple', 'Dog | Cat');

And for compute the distances, instead of the tsvector @@ operator, we colud use the existing sparse vector operators like neg inner prod <#>

SELECT text
FROM documents
ORDER BY sparse_keywords <#> to_svector_query('simple', 'Dog & Cat)
LIMIT 50;

Update

PostgreSQL ts_vector already can handle a token/keyword weight. However this is a discrete weight of only 4 different options (A,B,C or D).

Lexemes that have positions can further be labeled with a weight, which can be A, B, C, or D. D is the default.
Weights are typically used to reflect document structure, for example by marking title words differently from body words.
source

I think it would be abstraction that sparse vectors become the missing ts_vector with decimal (floating point) weigths, while keeping all the ts_query capabilitites.

@jbohnslav
Copy link

An implementation of bm25 would be huge for my application. What tasks remain to be done?

@VoVAllen
Copy link
Member

VoVAllen commented Oct 6, 2024

@jbohnslav We don't have an ETA for this. The main reason is that the algorithm is too new and we're not sure what's it's actual behavior with different data like BM25 and how to setup the config param. For bm25, have you tried paradedb? They claim to have full support of bm25

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/question 🙋 Further information is requested
Projects
None yet
Development

No branches or pull requests

4 participants