-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Introduce vector field, vector query and rescoring based on them #31615
Comments
Pinging @elastic/es-search-aggs |
If we want to enforce this, then it might be easier to have a rescorer rather than a query (today we only have one rescore implementation: |
Hi, commenting here on @mayya-sharipova 's invitation. Our use case is that we'd want to use ES to search for sentences that have similar meaning to the sentence in the query, based on each sentence having an embedding. Vectors would be dense. Dimensionality would be 100-300 most of the time presumably. Cosine similarity would be my starting point for computing the similarity of embeddings. |
Do you mean only one
|
@james-daily Thanks for your feedback, James. Sorry, for a single value per document, we meant a single value per field, so it would be possible to have several |
Have you considered Manhattan distance as a cheaper alternative in terms of processing? Though this will not deliver the same result it can be comparable in terms of ranking vectors while delivering higher throughput than euclidian/cosine |
In case it’s useful, here’s another datapoint from @gangeli, who also expressed interest in the feature:
|
@djp-search thanks for a suggestion, we will study Manhattan distance @jtibshirani thanks for another use-case |
1. Dense vector PUT dindex { "mappings": { "_doc": { "properties": { "my_vector": { "type": "dense_vector" }, "my_text" : { "type" : "keyword" } } } } } PUT dinex/_doc/1 { "my_text" : "text1", "my_vector" : [ 0.5, 10, 6 ] } PUT dindex/_doc/2 { "my_text" : "text2", "my_vector" : [ 0.5, 10, 10] } GET dindex/_search { "query" : { "vector" : { "field" : "my_vector", "query_vector": [ 0.5, 10, 10] } } } Result: .... "hits": [ { "_index": "dindex", "_type": "_doc", "_id": "2", "_score": 1.0000001, "_source": { "my_text": "text1", "my_vector": [ 0.5, 10, 10 ] } }, { "_index": "dindex", "_type": "_doc", "_id": "1", "_score": 0.97016037, "_source": { "my_text": "text1", "my_vector": [ 0.5, 10, 6 ] } } ] 2. Sparse vector PUT sindex { "mappings": { "_doc": { "properties": { "my_vector": { "type": "sparse_vector" }, "my_text" : { "type" : "keyword" } } } } } PUT sindex/_doc/1 { "my_text" : "text1", "my_vector" : {"1": 0.5, "99": -0.5, "5": 1} } PUT sindex/_doc/2 { "my_text" : "text2", "my_vector" : {"103": 0.5, "4": -0.5, "5": 1} } GET sindex/_search { "query" : { "vector" : { "field" : "my_vector", "query_vector": {"99": -0.5, "1": 0.5, "5": 1} } } } Result: "hits": [ { "_index": "sindex", "_type": "_doc", "_id": "1", "_score": 0.99999994, "_source": { "my_text": "text1", "my_vector": { "1": 0.5, "99": -0.5, "5": 1 } } }, { "_index": "sindex", "_type": "_doc", "_id": "2", "_score": 0.6666666, "_source": { "my_text": "text2", "my_vector": { "103": 0.5, "4": -0.5, "5": 1 } } } ] Search with filter: GET sindex/_search { "query": { "bool": { "must" : { "match": { "my_text": "text2" } }, "should" : { "vector" : { "field" : "my_vector", "query_vector": {"99": -0.5, "1": 0.5, "5": 1} } } } } } Result: "hits": [ { "_index": "sindex", "_type": "_doc", "_id": "2", "_score": 0.6931472, "_source": { "my_text": "text2", "my_vector": { "103": 0.5, "4": -0.5, "5": 1 } } } ] 3. Implementation details 3.1 Dense Vector - BinaryDocValuesField - byte array -> - integer (number of dimensions) - array of integers (encoded array of float values) 3.2 Sparse Vector - BinaryDocValuesField - byte array -> - integer (number of dimenstions) - array of integers (encoded array of float values) - array of integers (array of integer dimensions) Relates to elastic#31615
1. Dense vector PUT dindex { "mappings": { "_doc": { "properties": { "my_vector": { "type": "dense_vector" }, "my_text" : { "type" : "keyword" } } } } } PUT dinex/_doc/1 { "my_text" : "text1", "my_vector" : [ 0.5, 10, 6 ] } PUT dindex/_doc/2 { "my_text" : "text2", "my_vector" : [ 0.5, 10, 10] } GET dindex/_search { "query" : { "vector" : { "field" : "my_vector", "query_vector": [ 0.5, 10, 10] } } } Result: .... "hits": [ { "_index": "dindex", "_type": "_doc", "_id": "2", "_score": 1.0000001, "_source": { "my_text": "text1", "my_vector": [ 0.5, 10, 10 ] } }, { "_index": "dindex", "_type": "_doc", "_id": "1", "_score": 0.97016037, "_source": { "my_text": "text1", "my_vector": [ 0.5, 10, 6 ] } } ] 2. Sparse vector PUT sindex { "mappings": { "_doc": { "properties": { "my_vector": { "type": "sparse_vector" }, "my_text" : { "type" : "keyword" } } } } } PUT sindex/_doc/1 { "my_text" : "text1", "my_vector" : {"1": 0.5, "99": -0.5, "5": 1} } PUT sindex/_doc/2 { "my_text" : "text2", "my_vector" : {"103": 0.5, "4": -0.5, "5": 1} } GET sindex/_search { "query" : { "vector" : { "field" : "my_vector", "query_vector": {"99": -0.5, "1": 0.5, "5": 1} } } } Result: "hits": [ { "_index": "sindex", "_type": "_doc", "_id": "1", "_score": 0.99999994, "_source": { "my_text": "text1", "my_vector": { "1": 0.5, "99": -0.5, "5": 1 } } }, { "_index": "sindex", "_type": "_doc", "_id": "2", "_score": 0.6666666, "_source": { "my_text": "text2", "my_vector": { "103": 0.5, "4": -0.5, "5": 1 } } } ] Search with filter: GET sindex/_search { "query": { "bool": { "must" : { "match": { "my_text": "text2" } }, "should" : { "vector" : { "field" : "my_vector", "query_vector": {"99": -0.5, "1": 0.5, "5": 1} } } } } } Result: "hits": [ { "_index": "sindex", "_type": "_doc", "_id": "2", "_score": 0.6931472, "_source": { "my_text": "text2", "my_vector": { "103": 0.5, "4": -0.5, "5": 1 } } } ] 3. Implementation details 3.1 Dense Vector - BinaryDocValuesField - byte array -> - integer (number of dimensions) - array of integers (encoded array of float values) 3.2 Sparse Vector - BinaryDocValuesField - byte array -> - integer (number of dimenstions) - array of integers (encoded array of float values) - array of integers (array of integer dimensions) Relates to elastic#31615
Are there plans to use this to control matching as well? Such as filter in/out based on proximity (maybe some kind of distance) to a point being queried? Then it would be applicable outside a rescoring context |
@softwaredoug We are still debating if we should use this field for matching, as it may make queries slow. For now the plan is to introduce two functions |
We've been discussing this a bit in Relevant Search slack. I'm hoping we can use this field for matching too.
|
Hey guys, awesome job. btw, this feature has been added in 7.0-alpha2? I'm testing dense vector rescore but I didn't find the right way to query...
and I got:
|
Introduce painless functions of cosineSimilarity and dotProduct distance measures for dense and sparse vector fields. ```js { "query": { "script_score": { "query": { "match_all": {} }, "script": { "source": "cosineSimilarity(params.queryVector, doc['my_dense_vector'].value)", "params": { "queryVector": [4, 3.4, -1.2] } } } } } ``` ```js { "query": { "script_score": { "query": { "match_all": {} }, "script": { "source": "cosineSimilaritySparse(params.queryVector, doc['my_sparse_vector'].value)", "params": { "queryVector": {"2": -0.5, "10" : 111.3, "50": -13.0, "113": 14.8, "4545": -156.0} } } } } } ``` Closes elastic#31615
@ailurus1991 Yes, you are right, currently there is no way to query vector fields. |
@mayya-sharipova wow I see, great work! |
* Distance measures for dense and sparse vectors Introduce painless functions of cosineSimilarity and dotProduct distance measures for dense and sparse vector fields. ```js { "query": { "script_score": { "query": { "match_all": {} }, "script": { "source": "cosineSimilarity(params.queryVector, doc['my_dense_vector'].value)", "params": { "queryVector": [4, 3.4, -1.2] } } } } } ``` ```js { "query": { "script_score": { "query": { "match_all": {} }, "script": { "source": "cosineSimilaritySparse(params.queryVector, doc['my_sparse_vector'].value)", "params": { "queryVector": {"2": -0.5, "10" : 111.3, "50": -13.0, "113": 14.8, "4545": -156.0} } } } } } ``` Closes #31615
* Distance measures for dense and sparse vectors Introduce painless functions of cosineSimilarity and dotProduct distance measures for dense and sparse vector fields. ```js { "query": { "script_score": { "query": { "match_all": {} }, "script": { "source": "cosineSimilarity(params.queryVector, doc['my_dense_vector'].value)", "params": { "queryVector": [4, 3.4, -1.2] } } } } } ``` ```js { "query": { "script_score": { "query": { "match_all": {} }, "script": { "source": "cosineSimilaritySparse(params.queryVector, doc['my_sparse_vector'].value)", "params": { "queryVector": {"2": -0.5, "10" : 111.3, "50": -13.0, "113": 14.8, "4545": -156.0} } } } } } ``` Closes elastic#31615
@mayya-sharipova I need help. I just installed new Elastic,create index and try mapping by your example:
and i get error:
Thank you advance for reply! |
@psyapathy What version of elasticsearch have you installed? The indexing of vectors are available from v7.0.0-beta1, but querying of them will be avaialable only from v7.1. |
@mayya-sharipova Thank you for reply! |
@mayya-sharipova hi mayya, I've installed ES7.1 and indexed documents with dense vector mapping successfully, but I didn't find a right way to query in documentation. Could you give me a hint? |
@ailurus1991 Sorry, this is a deficiency of our documentation. The scoring is available only from 7.2 |
@mayya-sharipova I just set up the version 7.2, but both the functions are not there. I can see that branch 7.x has these functions. Is there a way I can manually add these functions? |
@prem6667 Sorry, we have decided to move these functions starting from 7.3. |
@mayya-sharipova is this feature published in 7.3? But I didn't fint it. |
I believe I's mentioned here:
https://www.elastic.co/blog/elasticsearch-7-3-0-released
see "Built-in vector similarity functions for document script scoring"
…On Thu, Aug 1, 2019 at 10:11 AM LiuGangR ***@***.***> wrote:
@mayya-sharipova <https://github.com/mayya-sharipova> is this feature
published in 7.3? But I didn't fint it.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#31615?email_source=notifications&email_token=ABGGISCQ7E5OHPKAVKHR4JDQCKECPA5CNFSM4FHHO5N2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3JRVNA#issuecomment-517151412>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/ABGGISBR3GUTIL44D6OK2C3QCKECPANCNFSM4FHHO5NQ>
.
|
I thank you for this clear presentation, and thank you to the participants for the exchange. Since we wanted to index several documents with several sentences each, which data structure is the most suitable of the two I present? and what will be the good mapping? <PUT my_index_1/_doc/1 or |
@adouib this seems like a good question for our discuss forums, would you be able to create a discuss post and we can continue the conversation there? We usually try to keep GitHub focused on development efforts like bug reports and feature requests. |
@mayya-sharipova I do not quite understand why we need to |
@dragon-warrior-nyc Please refer to our official documentation. The details on this PR are potential implementations we have considered that may not be relevant any more. "Vectors are encoded as binaries" means that vectors are encoded as Lucene BinaryDocValues. |
@mayya-sharipova got it and thanks for the explanation! |
@
@adouib Hi , Did we get resolution for same please ? |
Anything new here? |
Introduce a new field of type
vector
on which vector calculations can be done during rescoring phaseIndexing
Allow only a single value per document
Allow to index both dense and sparse vectors?
Dense form:
Sparse form (represented as list of dimension names and values for corresponding dimensions):
Query and Rescoring
Introduce a special type of
vector
query:This query can only be used in the rescoring context.
This query produces a score for every document in the rescoring context in the following way:
field
, 0 value will be returnedfield
: doc_vector, the cosine similarity between doc_vector andquery_vector
is calculated:dotProduct(doc_vector, query_vector) / (sqrt(doc_vector) * sqrt(query_vector))
Internal encoding
Encoding of vectors:
Internally both dense and sparse vectors are encoded as sorted hash?
Thus dense array is transformed:
[4, 12] -> {0: 4, 1: 12}
Keys are sorted, so we can iterate over them instead of calculating hash
What should be values in vectors?
Vectors are encoded as binaries.
The text was updated successfully, but these errors were encountered: