Introduce vector field, vector query and rescoring based on them #31615

mayya-sharipova · 2018-06-27T14:53:36Z

Introduce a new field of type vector on which vector calculations can be done during rescoring phase

PUT my_index
{
  "mappings": {
    "_doc": {
      "properties": {
        "my_feature": {
          "type": "vector"   
      }
    }
  }
}

Indexing

Allow only a single value per document
Allow to index both dense and sparse vectors?

Dense form:

PUT my_index/_doc/1
{
  "my_feature":   [11.5, 10.4, 23.0]
}

Sparse form (represented as list of dimension names and values for corresponding dimensions):

PUT my_index/_doc/1
{
  "my_feature": {"1": 11.5, "5": 10.5,  "101": 23.0}
}

Query and Rescoring

Introduce a special type of vector query:

"vector" : {
   "field" : "my_feature",
    "query_vector": {"1": 3, "5": 10.5,  "101": 12}
}

This query can only be used in the rescoring context.
This query produces a score for every document in the rescoring context in the following way:

If a document doesn't have a vector value for field, 0 value will be returned
If a document does have a vector value for field : doc_vector, the cosine similarity between doc_vector and query_vector is calculated:
dotProduct(doc_vector, query_vector) / (sqrt(doc_vector) * sqrt(query_vector))

POST /_search
{
   "query" : {"<user-query>"},
   "rescore" : {
      "window_size" : 50,
      "query" : {
         "rescore_query" : {
            "vector" : {
               "field" : "my_feature",
               "query_vector": {"1": 3, "5": 10.5,  "101": 12}
            }
         }
      }
   }
}

Internal encoding

Encoding of vectors:
Internally both dense and sparse vectors are encoded as sorted hash?
Thus dense array is transformed:
[4, 12] -> {0: 4, 1: 12}
Keys are sorted, so we can iterate over them instead of calculating hash
What should be values in vectors?
- floats?
- smaller than floats? (lost some precision here, but less index size)
Vectors are encoded as binaries.

The text was updated successfully, but these errors were encountered:

elasticmachine · 2018-06-27T14:54:01Z

Pinging @elastic/es-search-aggs

jpountz · 2018-06-27T15:16:24Z

This query can only be used in the rescoring context.

If we want to enforce this, then it might be easier to have a rescorer rather than a query (today we only have one rescore implementation: QueryRescorer, but we can add more of them, see eg. https://github.com/elastic/elasticsearch/tree/master/plugins/examples/rescore). We might also want to give it a more explicit name like cosine_similarity?

etienne1985 · 2018-07-03T04:23:33Z

Hi, commenting here on @mayya-sharipova 's invitation. Our use case is that we'd want to use ES to search for sentences that have similar meaning to the sentence in the query, based on each sentence having an embedding. Vectors would be dense. Dimensionality would be 100-300 most of the time presumably. Cosine similarity would be my starting point for computing the similarity of embeddings.

james-daily · 2018-07-03T15:00:33Z

Allow only a single value per document

Do you mean only one vector field per document or only one value for each field? It would be useful to allow more than one one vector field per document for testing different embeddings, dimensionalities, etc. Something like:

PUT my_index
{
  "mappings": {
    "_doc": {
      "properties": {
        "GloVe": {
          "type": "vector"   
      },
        "word2vec": {
          "type": "vector"   
      }
    }
  }
}

mayya-sharipova · 2018-07-03T21:17:43Z

@james-daily Thanks for your feedback, James. Sorry, for a single value per document, we meant a single value per field, so it would be possible to have several vector fields.

djptek · 2018-07-27T07:02:43Z

Have you considered Manhattan distance as a cheaper alternative in terms of processing? Though this will not deliver the same result it can be comparable in terms of ranking vectors while delivering higher throughput than euclidian/cosine

jtibshirani · 2018-07-27T21:39:25Z

In case it’s useful, here’s another datapoint from @gangeli, who also expressed interest in the feature:

Their use case also involves retrieving sentences or short paragraphs. Both the query and documents would be modelled using a sentence embedding (based on an RNN).
Vectors are dense and can have from 50 - 1000 dimensions, but are concentrated in the 200 - 300 range.
Ideally, cosine similarity would be applied to all documents when scoring (as opposed to just during a rescoring phase). In their use case, sentence retrieval is a component of a fairly general NLP pipeline, and they rely strongly on these sentence embeddings to understand synonyms/ textual similarity.

mayya-sharipova · 2018-07-27T22:01:01Z

@djp-search thanks for a suggestion, we will study Manhattan distance

@jtibshirani thanks for another use-case

1. Dense vector PUT dindex { "mappings": { "_doc": { "properties": { "my_vector": { "type": "dense_vector" }, "my_text" : { "type" : "keyword" } } } } } PUT dinex/_doc/1 { "my_text" : "text1", "my_vector" : [ 0.5, 10, 6 ] } PUT dindex/_doc/2 { "my_text" : "text2", "my_vector" : [ 0.5, 10, 10] } GET dindex/_search { "query" : { "vector" : { "field" : "my_vector", "query_vector": [ 0.5, 10, 10] } } } Result: .... "hits": [ { "_index": "dindex", "_type": "_doc", "_id": "2", "_score": 1.0000001, "_source": { "my_text": "text1", "my_vector": [ 0.5, 10, 10 ] } }, { "_index": "dindex", "_type": "_doc", "_id": "1", "_score": 0.97016037, "_source": { "my_text": "text1", "my_vector": [ 0.5, 10, 6 ] } } ] 2. Sparse vector PUT sindex { "mappings": { "_doc": { "properties": { "my_vector": { "type": "sparse_vector" }, "my_text" : { "type" : "keyword" } } } } } PUT sindex/_doc/1 { "my_text" : "text1", "my_vector" : {"1": 0.5, "99": -0.5, "5": 1} } PUT sindex/_doc/2 { "my_text" : "text2", "my_vector" : {"103": 0.5, "4": -0.5, "5": 1} } GET sindex/_search { "query" : { "vector" : { "field" : "my_vector", "query_vector": {"99": -0.5, "1": 0.5, "5": 1} } } } Result: "hits": [ { "_index": "sindex", "_type": "_doc", "_id": "1", "_score": 0.99999994, "_source": { "my_text": "text1", "my_vector": { "1": 0.5, "99": -0.5, "5": 1 } } }, { "_index": "sindex", "_type": "_doc", "_id": "2", "_score": 0.6666666, "_source": { "my_text": "text2", "my_vector": { "103": 0.5, "4": -0.5, "5": 1 } } } ] Search with filter: GET sindex/_search { "query": { "bool": { "must" : { "match": { "my_text": "text2" } }, "should" : { "vector" : { "field" : "my_vector", "query_vector": {"99": -0.5, "1": 0.5, "5": 1} } } } } } Result: "hits": [ { "_index": "sindex", "_type": "_doc", "_id": "2", "_score": 0.6931472, "_source": { "my_text": "text2", "my_vector": { "103": 0.5, "4": -0.5, "5": 1 } } } ] 3. Implementation details 3.1 Dense Vector - BinaryDocValuesField - byte array -> - integer (number of dimensions) - array of integers (encoded array of float values) 3.2 Sparse Vector - BinaryDocValuesField - byte array -> - integer (number of dimenstions) - array of integers (encoded array of float values) - array of integers (array of integer dimensions) Relates to elastic#31615

softwaredoug · 2018-12-21T20:48:06Z

Are there plans to use this to control matching as well? Such as filter in/out based on proximity (maybe some kind of distance) to a point being queried? Then it would be applicable outside a rescoring context

mayya-sharipova · 2018-12-27T12:32:07Z

@softwaredoug We are still debating if we should use this field for matching, as it may make queries slow. For now the plan is to introduce two functions cosineSimilarity and dotProduct as a part of script score query. The idea is that these functions will be used for scoring after the match is already done.

JnBrymn-EB · 2018-12-29T01:56:21Z

We've been discussing this a bit in Relevant Search slack. I'm hoping we can use this field for matching too.

Certainly matching with this field will be a little slower, but there aren't any real surprises here. For instance, normal search with posting lists, etc. executes in O(num_docs), this field will surely still be O(num_docs) right? And if it's slower, I bet it's not that much slower is it? (Is it?)
The users of this field are likely to be the more sophisticated users who would more likely know the issues they are getting into.
Part of the nice value of using this field for matching is that presumably you would also be able to use it with other normal fields. For instance, I could have an index of "users" and I could say, "find me all users that are in San Francisco (geo search), that are most similar to this sample user (vector similarity)".

cailurus · 2019-01-28T08:16:00Z

Hey guys, awesome job. btw, this feature has been added in 7.0-alpha2? I'm testing dense vector rescore but I didn't find the right way to query...
I've tried

POST /_search
{
   "query" : {"<user-query>"},
   "rescore" : {
      "window_size" : 50,
      "query" : {
         "rescore_query" : {
            "vector" : {
               "field" : "my_feature",
               "query_vector": {"1": 3, "5": 10.5,  "101": 12}
            }
         }
      }
   }
}

and I got:

"error":{"root_cause":[{"type":"parsing_exception","reason":"no [query] registered for [vector]","line":9,"col":24}],

Introduce painless functions of cosineSimilarity and dotProduct distance measures for dense and sparse vector fields. ```js { "query": { "script_score": { "query": { "match_all": {} }, "script": { "source": "cosineSimilarity(params.queryVector, doc['my_dense_vector'].value)", "params": { "queryVector": [4, 3.4, -1.2] } } } } } ``` ```js { "query": { "script_score": { "query": { "match_all": {} }, "script": { "source": "cosineSimilaritySparse(params.queryVector, doc['my_sparse_vector'].value)", "params": { "queryVector": {"2": -0.5, "10" : 111.3, "50": -13.0, "113": 14.8, "4545": -156.0} } } } } } ``` Closes elastic#31615

mayya-sharipova · 2019-01-29T00:43:06Z

@ailurus1991 Yes, you are right, currently there is no way to query vector fields.
We are working on introducing the ways through painless script functions.

cailurus · 2019-02-05T14:49:50Z

@mayya-sharipova wow I see, great work!

* Distance measures for dense and sparse vectors Introduce painless functions of cosineSimilarity and dotProduct distance measures for dense and sparse vector fields. ```js { "query": { "script_score": { "query": { "match_all": {} }, "script": { "source": "cosineSimilarity(params.queryVector, doc['my_dense_vector'].value)", "params": { "queryVector": [4, 3.4, -1.2] } } } } } ``` ```js { "query": { "script_score": { "query": { "match_all": {} }, "script": { "source": "cosineSimilaritySparse(params.queryVector, doc['my_sparse_vector'].value)", "params": { "queryVector": {"2": -0.5, "10" : 111.3, "50": -13.0, "113": 14.8, "4545": -156.0} } } } } } ``` Closes #31615

* Distance measures for dense and sparse vectors Introduce painless functions of cosineSimilarity and dotProduct distance measures for dense and sparse vector fields. ```js { "query": { "script_score": { "query": { "match_all": {} }, "script": { "source": "cosineSimilarity(params.queryVector, doc['my_dense_vector'].value)", "params": { "queryVector": [4, 3.4, -1.2] } } } } } ``` ```js { "query": { "script_score": { "query": { "match_all": {} }, "script": { "source": "cosineSimilaritySparse(params.queryVector, doc['my_sparse_vector'].value)", "params": { "queryVector": {"2": -0.5, "10" : 111.3, "50": -13.0, "113": 14.8, "4545": -156.0} } } } } } ``` Closes elastic#31615

arpsyapathy · 2019-03-06T20:35:18Z

@mayya-sharipova
Hello Mayya, thank you for your work!

I need help. I just installed new Elastic,create index and try mapping by your example:

{
 "properties": {
   "my_vector": {
     "type": "dense_vector"
    },
    "my_text" : {
      "type" : "keyword"
    }
  }
}

and i get error:

{
    "error": {
        "root_cause": [
            {
                "type": "mapper_parsing_exception",
                "reason": "No handler for type [dense_vector] declared on field [my_vector]"
            }
        ],
        "type": "mapper_parsing_exception",
        "reason": "No handler for type [dense_vector] declared on field [my_vector]"
    },
    "status": 400
}

Thank you advance for reply!

mayya-sharipova · 2019-03-07T10:39:32Z

@psyapathy What version of elasticsearch have you installed?

The indexing of vectors are available from v7.0.0-beta1, but querying of them will be avaialable only from v7.1.

arpsyapathy · 2019-03-11T08:18:09Z

@mayya-sharipova Thank you for reply!
it's happy and sad at the same time.
is there an alternative still under development?

cailurus · 2019-05-21T23:55:07Z

@mayya-sharipova hi mayya, I've installed ES7.1 and indexed documents with dense vector mapping successfully, but I didn't find a right way to query in documentation. Could you give me a hint?

mayya-sharipova · 2019-05-22T13:24:28Z

@ailurus1991 Sorry, this is a deficiency of our documentation. The scoring is available only from 7.2
From 7.2 two functions as a part of script_score will be available cosineSimilarity and dotProduct

prem6667 · 2019-06-26T14:20:33Z

@mayya-sharipova I just set up the version 7.2, but both the functions are not there. I can see that branch 7.x has these functions. Is there a way I can manually add these functions?

mayya-sharipova · 2019-06-27T21:30:56Z

@prem6667 Sorry, we have decided to move these functions starting from 7.3.
Adding these functions manually involves non-trivial amount of work as besides painless functions, we need to add classes for supporting Doc and script values.
Also, please be aware, that these features are still experimental, and may change.

LiuGangR · 2019-08-01T07:09:18Z

@mayya-sharipova is this feature published in 7.3? But I didn't fint it.

lior-k · 2019-08-01T11:48:07Z

I believe I's mentioned here: https://www.elastic.co/blog/elasticsearch-7-3-0-released see "Built-in vector similarity functions for document script scoring"

…

On Thu, Aug 1, 2019 at 10:11 AM LiuGangR ***@***.***> wrote: @mayya-sharipova <https://github.com/mayya-sharipova> is this feature published in 7.3? But I didn't fint it. — You are receiving this because you are subscribed to this thread. Reply to this email directly, view it on GitHub <#31615?email_source=notifications&email_token=ABGGISCQ7E5OHPKAVKHR4JDQCKECPA5CNFSM4FHHO5N2YY3PNVWWK3TUL52HS4DFVREXG43VMVBW63LNMVXHJKTDN5WW2ZLOORPWSZGOD3JRVNA#issuecomment-517151412>, or mute the thread <https://github.com/notifications/unsubscribe-auth/ABGGISBR3GUTIL44D6OK2C3QCKECPANCNFSM4FHHO5NQ> .

adouib · 2019-10-03T20:04:01Z

I thank you for this clear presentation, and thank you to the participants for the exchange.

Since we wanted to index several documents with several sentences each, which data structure is the most suitable of the two I present? and what will be the good mapping?

or
<PUT my_index_1/_doc/1
{
"sentence_text": ["my first sentence", "my second sentence"],
"sentence_vector": [[0.0,0.0,0.0,0.0], [0.0,0.0,0.1,0.1]],
}>

jtibshirani · 2019-10-03T23:49:10Z

@adouib this seems like a good question for our discuss forums, would you be able to create a discuss post and we can continue the conversation there? We usually try to keep GitHub focused on development efforts like bug reports and feature requests.

dragon-warrior-nyc · 2020-01-26T12:16:55Z

@mayya-sharipova I do not quite understand why we need to encode vector as sorted hash. Why do we have to do so? and what does it mean that Vectors are encoded as binaries.

mayya-sharipova · 2020-01-27T17:16:58Z

@dragon-warrior-nyc Please refer to our official documentation.

The details on this PR are potential implementations we have considered that may not be relevant any more.

"Vectors are encoded as binaries" means that vectors are encoded as Lucene BinaryDocValues.

dragon-warrior-nyc · 2020-01-29T01:50:30Z

@mayya-sharipova got it and thanks for the explanation!

devendrathomare · 2020-09-12T15:34:33Z

I thank you for this clear presentation, and thank you to the participants for the exchange.

Since we wanted to index several documents with several sentences each, which data structure is the most suitable of the two I present? and what will be the good mapping?

<PUT my_index_1/_doc/1
{
"sentences": [
{"sentence_text" : "my first sentence",
"sentence_vector" : [0.0,0.0,0.0,0.0]}
{"sentence_text" : "my second sentence",
"sentence_vector" : [0.0,0.0,0.1,0.1]}
}>

or
<PUT my_index_1/_doc/1
{
"sentence_text": ["my first sentence", "my second sentence"],
"sentence_vector": [[0.0,0.0,0.0,0.0], [0.0,0.0,0.1,0.1]],
}>

@

@mayya-sharipova got it and thanks for the explanation!

I thank you for this clear presentation, and thank you to the participants for the exchange.

Since we wanted to index several documents with several sentences each, which data structure is the most suitable of the two I present? and what will be the good mapping?

<PUT my_index_1/_doc/1
{
"sentences": [
{"sentence_text" : "my first sentence",
"sentence_vector" : [0.0,0.0,0.0,0.0]}
{"sentence_text" : "my second sentence",
"sentence_vector" : [0.0,0.0,0.1,0.1]}
}>

or
<PUT my_index_1/_doc/1
{
"sentence_text": ["my first sentence", "my second sentence"],
"sentence_vector": [[0.0,0.0,0.0,0.0], [0.0,0.0,0.1,0.1]],
}>

@adouib Hi , Did we get resolution for same please ?

Brentbin · 2021-01-27T07:54:04Z

Anything new here?

mayya-sharipova added the :Search Relevance/Ranking Scoring, rescoring, rank evaluation. label Jun 27, 2018

mayya-sharipova self-assigned this Aug 13, 2018

mayya-sharipova mentioned this issue Aug 21, 2018

Vector field #33022

Merged

mayya-sharipova mentioned this issue Jan 29, 2019

Distance measures for dense and sparse vectors #37947

Merged

mayya-sharipova closed this as completed in #37947 Feb 20, 2019

jtibshirani added :Search Relevance/Vectors Vector search and removed :Search Relevance/Ranking Scoring, rescoring, rank evaluation. labels Jul 21, 2022

javanna added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Jul 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Introduce vector field, vector query and rescoring based on them #31615

Introduce vector field, vector query and rescoring based on them #31615

mayya-sharipova commented Jun 27, 2018

elasticmachine commented Jun 27, 2018

jpountz commented Jun 27, 2018

etienne1985 commented Jul 3, 2018

james-daily commented Jul 3, 2018 •

edited

Loading

mayya-sharipova commented Jul 3, 2018

djptek commented Jul 27, 2018

jtibshirani commented Jul 27, 2018 •

edited

Loading

mayya-sharipova commented Jul 27, 2018

softwaredoug commented Dec 21, 2018

mayya-sharipova commented Dec 27, 2018

JnBrymn-EB commented Dec 29, 2018

cailurus commented Jan 28, 2019

mayya-sharipova commented Jan 29, 2019

cailurus commented Feb 5, 2019

arpsyapathy commented Mar 6, 2019

mayya-sharipova commented Mar 7, 2019

arpsyapathy commented Mar 11, 2019

cailurus commented May 21, 2019

mayya-sharipova commented May 22, 2019

prem6667 commented Jun 26, 2019 •

edited

Loading

mayya-sharipova commented Jun 27, 2019

LiuGangR commented Aug 1, 2019

lior-k commented Aug 1, 2019 via email

adouib commented Oct 3, 2019

jtibshirani commented Oct 3, 2019

dragon-warrior-nyc commented Jan 26, 2020

mayya-sharipova commented Jan 27, 2020

dragon-warrior-nyc commented Jan 29, 2020

devendrathomare commented Sep 12, 2020

Brentbin commented Jan 27, 2021

Introduce vector field, vector query and rescoring based on them #31615

Introduce vector field, vector query and rescoring based on them #31615

Comments

mayya-sharipova commented Jun 27, 2018

Indexing

Query and Rescoring

Internal encoding

elasticmachine commented Jun 27, 2018

jpountz commented Jun 27, 2018

etienne1985 commented Jul 3, 2018

james-daily commented Jul 3, 2018 • edited Loading

mayya-sharipova commented Jul 3, 2018

djptek commented Jul 27, 2018

jtibshirani commented Jul 27, 2018 • edited Loading

mayya-sharipova commented Jul 27, 2018

softwaredoug commented Dec 21, 2018

mayya-sharipova commented Dec 27, 2018

JnBrymn-EB commented Dec 29, 2018

cailurus commented Jan 28, 2019

mayya-sharipova commented Jan 29, 2019

cailurus commented Feb 5, 2019

arpsyapathy commented Mar 6, 2019

mayya-sharipova commented Mar 7, 2019

arpsyapathy commented Mar 11, 2019

cailurus commented May 21, 2019

mayya-sharipova commented May 22, 2019

prem6667 commented Jun 26, 2019 • edited Loading

mayya-sharipova commented Jun 27, 2019

LiuGangR commented Aug 1, 2019

lior-k commented Aug 1, 2019 via email

adouib commented Oct 3, 2019

jtibshirani commented Oct 3, 2019

dragon-warrior-nyc commented Jan 26, 2020

mayya-sharipova commented Jan 27, 2020

dragon-warrior-nyc commented Jan 29, 2020

devendrathomare commented Sep 12, 2020

Brentbin commented Jan 27, 2021

james-daily commented Jul 3, 2018 •

edited

Loading

jtibshirani commented Jul 27, 2018 •

edited

Loading

prem6667 commented Jun 26, 2019 •

edited

Loading