Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce vector field, vector query and rescoring based on them #31615

Closed
mayya-sharipova opened this issue Jun 27, 2018 · 30 comments · Fixed by #37947
Closed

Introduce vector field, vector query and rescoring based on them #31615

mayya-sharipova opened this issue Jun 27, 2018 · 30 comments · Fixed by #37947
Assignees
Labels
:Search Relevance/Vectors Vector search Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch

Comments

@mayya-sharipova
Copy link
Contributor

Introduce a new field of type vector on which vector calculations can be done during rescoring phase

PUT my_index
{
  "mappings": {
    "_doc": {
      "properties": {
        "my_feature": {
          "type": "vector"   
      }
    }
  }
}

Indexing

Allow only a single value per document
Allow to index both dense and sparse vectors?

Dense form:

PUT my_index/_doc/1
{
  "my_feature":   [11.5, 10.4, 23.0]
}

Sparse form (represented as list of dimension names and values for corresponding dimensions):

PUT my_index/_doc/1
{
  "my_feature": {"1": 11.5, "5": 10.5,  "101": 23.0}
}

Query and Rescoring

Introduce a special type of vector query:

"vector" : {
   "field" : "my_feature",
    "query_vector": {"1": 3, "5": 10.5,  "101": 12}
}

This query can only be used in the rescoring context.
This query produces a score for every document in the rescoring context in the following way:

  1. If a document doesn't have a vector value for field, 0 value will be returned
  2. If a document does have a vector value for field : doc_vector, the cosine similarity between doc_vector and query_vector is calculated:
    dotProduct(doc_vector, query_vector) / (sqrt(doc_vector) * sqrt(query_vector))
POST /_search
{
   "query" : {"<user-query>"},
   "rescore" : {
      "window_size" : 50,
      "query" : {
         "rescore_query" : {
            "vector" : {
               "field" : "my_feature",
               "query_vector": {"1": 3, "5": 10.5,  "101": 12}
            }
         }
      }
   }
}

Internal encoding

  1. Encoding of vectors:
    Internally both dense and sparse vectors are encoded as sorted hash?
    Thus dense array is transformed:
    [4, 12] -> {0: 4, 1: 12}
    Keys are sorted, so we can iterate over them instead of calculating hash

  2. What should be values in vectors?

    • floats?
    • smaller than floats? (lost some precision here, but less index size)
  3. Vectors are encoded as binaries.

@mayya-sharipova mayya-sharipova added the :Search Relevance/Ranking Scoring, rescoring, rank evaluation. label Jun 27, 2018
@elasticmachine
Copy link
Collaborator

Pinging @elastic/es-search-aggs

@jpountz
Copy link
Contributor

jpountz commented Jun 27, 2018

This query can only be used in the rescoring context.

If we want to enforce this, then it might be easier to have a rescorer rather than a query (today we only have one rescore implementation: QueryRescorer, but we can add more of them, see eg. https://github.com/elastic/elasticsearch/tree/master/plugins/examples/rescore). We might also want to give it a more explicit name like cosine_similarity?

@etienne1985
Copy link

Hi, commenting here on @mayya-sharipova 's invitation. Our use case is that we'd want to use ES to search for sentences that have similar meaning to the sentence in the query, based on each sentence having an embedding. Vectors would be dense. Dimensionality would be 100-300 most of the time presumably. Cosine similarity would be my starting point for computing the similarity of embeddings.

@james-daily
Copy link

james-daily commented Jul 3, 2018

Allow only a single value per document

Do you mean only one vector field per document or only one value for each field? It would be useful to allow more than one one vector field per document for testing different embeddings, dimensionalities, etc. Something like:

PUT my_index
{
  "mappings": {
    "_doc": {
      "properties": {
        "GloVe": {
          "type": "vector"   
      },
        "word2vec": {
          "type": "vector"   
      }
    }
  }
}

@mayya-sharipova
Copy link
Contributor Author

@james-daily Thanks for your feedback, James. Sorry, for a single value per document, we meant a single value per field, so it would be possible to have several vector fields.

@djptek
Copy link
Contributor

djptek commented Jul 27, 2018

Have you considered Manhattan distance as a cheaper alternative in terms of processing? Though this will not deliver the same result it can be comparable in terms of ranking vectors while delivering higher throughput than euclidian/cosine

@jtibshirani
Copy link
Contributor

jtibshirani commented Jul 27, 2018

In case it’s useful, here’s another datapoint from @gangeli, who also expressed interest in the feature:

  • Their use case also involves retrieving sentences or short paragraphs. Both the query and documents would be modelled using a sentence embedding (based on an RNN).
  • Vectors are dense and can have from 50 - 1000 dimensions, but are concentrated in the 200 - 300 range.
  • Ideally, cosine similarity would be applied to all documents when scoring (as opposed to just during a rescoring phase). In their use case, sentence retrieval is a component of a fairly general NLP pipeline, and they rely strongly on these sentence embeddings to understand synonyms/ textual similarity.

@mayya-sharipova
Copy link
Contributor Author

@djp-search thanks for a suggestion, we will study Manhattan distance

@jtibshirani thanks for another use-case

@mayya-sharipova mayya-sharipova self-assigned this Aug 13, 2018
mayya-sharipova added a commit to mayya-sharipova/elasticsearch that referenced this issue Aug 21, 2018
1. Dense vector

PUT dindex
{
  "mappings": {
    "_doc": {
      "properties": {
        "my_vector": {
          "type": "dense_vector"
        },
        "my_text" : {
          "type" : "keyword"
        }
      }
    }
  }
}

PUT dinex/_doc/1
{
  "my_text" : "text1",
  "my_vector" : [ 0.5, 10, 6 ]
}

PUT dindex/_doc/2
{
  "my_text" : "text2",
  "my_vector" : [ 0.5, 10, 10]
}

GET dindex/_search
{
  "query" : {
        "vector" : {
            "field" : "my_vector",
            "query_vector": [ 0.5, 10, 10]
        }
    }
}

Result:
....
"hits": [
    {
        "_index": "dindex",
        "_type": "_doc",
        "_id": "2",
        "_score": 1.0000001,
        "_source": {
            "my_text": "text1",
            "my_vector": [
                0.5,
                10,
                10
            ]
        }
    },
    {
        "_index": "dindex",
        "_type": "_doc",
        "_id": "1",
        "_score": 0.97016037,
        "_source": {
            "my_text": "text1",
            "my_vector": [
                0.5,
                10,
                6
            ]
        }
    }
]

2. Sparse vector

PUT sindex
{
  "mappings": {
    "_doc": {
      "properties": {
        "my_vector": {
          "type": "sparse_vector"
        },
        "my_text" : {
          "type" : "keyword"
        }
      }
    }
  }
}

PUT sindex/_doc/1
{
  "my_text" : "text1",
  "my_vector" : {"1": 0.5, "99": -0.5,  "5": 1}
}

PUT sindex/_doc/2
{
  "my_text" : "text2",
  "my_vector" : {"103": 0.5, "4": -0.5,  "5": 1}
}

GET sindex/_search
{
  "query" : {
        "vector" : {
            "field" : "my_vector",
            "query_vector": {"99": -0.5,  "1": 0.5,  "5": 1}
        }
    }
}

Result:
"hits": [
    {
        "_index": "sindex",
        "_type": "_doc",
        "_id": "1",
        "_score": 0.99999994,
        "_source": {
            "my_text": "text1",
            "my_vector": {
                "1": 0.5,
                "99": -0.5,
                "5": 1
            }
        }
    },
    {
        "_index": "sindex",
        "_type": "_doc",
        "_id": "2",
        "_score": 0.6666666,
        "_source": {
            "my_text": "text2",
            "my_vector": {
                "103": 0.5,
                "4": -0.5,
                "5": 1
            }
        }
    }
]

Search with filter:

GET sindex/_search
{
  "query": {
    "bool": {
      "must" : {
        "match": {
          "my_text": "text2"
        }
      },
      "should" : {
        "vector" : {
            "field" : "my_vector",
            "query_vector": {"99": -0.5,  "1": 0.5,  "5": 1}
        }
      }
    }
  }
}

Result:
"hits": [
    {
        "_index": "sindex",
        "_type": "_doc",
        "_id": "2",
        "_score": 0.6931472,
        "_source": {
            "my_text": "text2",
            "my_vector": {
                "103": 0.5,
                "4": -0.5,
                "5": 1
            }
        }
    }
]

3. Implementation details

3.1 Dense Vector
- BinaryDocValuesField
- byte array ->
    - integer (number of dimensions)
    - array of integers (encoded array of float values)

3.2 Sparse Vector
- BinaryDocValuesField
- byte array ->
    - integer (number of dimenstions)
    - array of integers (encoded array of float values)
    - array of integers (array of integer dimensions)

Relates to elastic#31615
mayya-sharipova added a commit to mayya-sharipova/elasticsearch that referenced this issue Nov 6, 2018
1. Dense vector

PUT dindex
{
  "mappings": {
    "_doc": {
      "properties": {
        "my_vector": {
          "type": "dense_vector"
        },
        "my_text" : {
          "type" : "keyword"
        }
      }
    }
  }
}

PUT dinex/_doc/1
{
  "my_text" : "text1",
  "my_vector" : [ 0.5, 10, 6 ]
}

PUT dindex/_doc/2
{
  "my_text" : "text2",
  "my_vector" : [ 0.5, 10, 10]
}

GET dindex/_search
{
  "query" : {
        "vector" : {
            "field" : "my_vector",
            "query_vector": [ 0.5, 10, 10]
        }
    }
}

Result:
....
"hits": [
    {
        "_index": "dindex",
        "_type": "_doc",
        "_id": "2",
        "_score": 1.0000001,
        "_source": {
            "my_text": "text1",
            "my_vector": [
                0.5,
                10,
                10
            ]
        }
    },
    {
        "_index": "dindex",
        "_type": "_doc",
        "_id": "1",
        "_score": 0.97016037,
        "_source": {
            "my_text": "text1",
            "my_vector": [
                0.5,
                10,
                6
            ]
        }
    }
]

2. Sparse vector

PUT sindex
{
  "mappings": {
    "_doc": {
      "properties": {
        "my_vector": {
          "type": "sparse_vector"
        },
        "my_text" : {
          "type" : "keyword"
        }
      }
    }
  }
}

PUT sindex/_doc/1
{
  "my_text" : "text1",
  "my_vector" : {"1": 0.5, "99": -0.5,  "5": 1}
}

PUT sindex/_doc/2
{
  "my_text" : "text2",
  "my_vector" : {"103": 0.5, "4": -0.5,  "5": 1}
}

GET sindex/_search
{
  "query" : {
        "vector" : {
            "field" : "my_vector",
            "query_vector": {"99": -0.5,  "1": 0.5,  "5": 1}
        }
    }
}

Result:
"hits": [
    {
        "_index": "sindex",
        "_type": "_doc",
        "_id": "1",
        "_score": 0.99999994,
        "_source": {
            "my_text": "text1",
            "my_vector": {
                "1": 0.5,
                "99": -0.5,
                "5": 1
            }
        }
    },
    {
        "_index": "sindex",
        "_type": "_doc",
        "_id": "2",
        "_score": 0.6666666,
        "_source": {
            "my_text": "text2",
            "my_vector": {
                "103": 0.5,
                "4": -0.5,
                "5": 1
            }
        }
    }
]

Search with filter:

GET sindex/_search
{
  "query": {
    "bool": {
      "must" : {
        "match": {
          "my_text": "text2"
        }
      },
      "should" : {
        "vector" : {
            "field" : "my_vector",
            "query_vector": {"99": -0.5,  "1": 0.5,  "5": 1}
        }
      }
    }
  }
}

Result:
"hits": [
    {
        "_index": "sindex",
        "_type": "_doc",
        "_id": "2",
        "_score": 0.6931472,
        "_source": {
            "my_text": "text2",
            "my_vector": {
                "103": 0.5,
                "4": -0.5,
                "5": 1
            }
        }
    }
]

3. Implementation details

3.1 Dense Vector
- BinaryDocValuesField
- byte array ->
    - integer (number of dimensions)
    - array of integers (encoded array of float values)

3.2 Sparse Vector
- BinaryDocValuesField
- byte array ->
    - integer (number of dimenstions)
    - array of integers (encoded array of float values)
    - array of integers (array of integer dimensions)

Relates to elastic#31615
@softwaredoug
Copy link
Contributor

Are there plans to use this to control matching as well? Such as filter in/out based on proximity (maybe some kind of distance) to a point being queried? Then it would be applicable outside a rescoring context

@mayya-sharipova
Copy link
Contributor Author

@softwaredoug We are still debating if we should use this field for matching, as it may make queries slow. For now the plan is to introduce two functions cosineSimilarity and dotProduct as a part of script score query. The idea is that these functions will be used for scoring after the match is already done.

@JnBrymn-EB
Copy link

We've been discussing this a bit in Relevant Search slack. I'm hoping we can use this field for matching too.

  • Certainly matching with this field will be a little slower, but there aren't any real surprises here. For instance, normal search with posting lists, etc. executes in O(num_docs), this field will surely still be O(num_docs) right? And if it's slower, I bet it's not that much slower is it? (Is it?)
  • The users of this field are likely to be the more sophisticated users who would more likely know the issues they are getting into.
  • Part of the nice value of using this field for matching is that presumably you would also be able to use it with other normal fields. For instance, I could have an index of "users" and I could say, "find me all users that are in San Francisco (geo search), that are most similar to this sample user (vector similarity)".

@cailurus
Copy link

Hey guys, awesome job. btw, this feature has been added in 7.0-alpha2? I'm testing dense vector rescore but I didn't find the right way to query...
I've tried

POST /_search
{
   "query" : {"<user-query>"},
   "rescore" : {
      "window_size" : 50,
      "query" : {
         "rescore_query" : {
            "vector" : {
               "field" : "my_feature",
               "query_vector": {"1": 3, "5": 10.5,  "101": 12}
            }
         }
      }
   }
}

and I got:

"error":{"root_cause":[{"type":"parsing_exception","reason":"no [query] registered for [vector]","line":9,"col":24}],

mayya-sharipova added a commit to mayya-sharipova/elasticsearch that referenced this issue Jan 29, 2019
Introduce painless functions of
cosineSimilarity and dotProduct distance
measures for dense and sparse vector fields.

```js
{
  "query": {
    "script_score": {
      "query": {
        "match_all": {}
      },
      "script": {
        "source": "cosineSimilarity(params.queryVector, doc['my_dense_vector'].value)",
        "params": {
          "queryVector": [4, 3.4, -1.2]
        }
      }
    }
  }
}
```

```js
{
  "query": {
    "script_score": {
      "query": {
        "match_all": {}
      },
      "script": {
        "source": "cosineSimilaritySparse(params.queryVector, doc['my_sparse_vector'].value)",
        "params": {
          "queryVector": {"2": -0.5, "10" : 111.3, "50": -13.0, "113": 14.8, "4545": -156.0}
        }
      }
    }
  }
}
```

Closes elastic#31615
@mayya-sharipova
Copy link
Contributor Author

@ailurus1991 Yes, you are right, currently there is no way to query vector fields.
We are working on introducing the ways through painless script functions.

@cailurus
Copy link

cailurus commented Feb 5, 2019

@mayya-sharipova wow I see, great work!

mayya-sharipova added a commit that referenced this issue Feb 20, 2019
* Distance measures for dense and sparse vectors

Introduce painless functions of
cosineSimilarity and dotProduct distance
measures for dense and sparse vector fields.

```js
{
  "query": {
    "script_score": {
      "query": {
        "match_all": {}
      },
      "script": {
        "source": "cosineSimilarity(params.queryVector, doc['my_dense_vector'].value)",
        "params": {
          "queryVector": [4, 3.4, -1.2]
        }
      }
    }
  }
}
```

```js
{
  "query": {
    "script_score": {
      "query": {
        "match_all": {}
      },
      "script": {
        "source": "cosineSimilaritySparse(params.queryVector, doc['my_sparse_vector'].value)",
        "params": {
          "queryVector": {"2": -0.5, "10" : 111.3, "50": -13.0, "113": 14.8, "4545": -156.0}
        }
      }
    }
  }
}
```

Closes #31615
mayya-sharipova added a commit to mayya-sharipova/elasticsearch that referenced this issue Feb 22, 2019
* Distance measures for dense and sparse vectors

Introduce painless functions of
cosineSimilarity and dotProduct distance
measures for dense and sparse vector fields.

```js
{
  "query": {
    "script_score": {
      "query": {
        "match_all": {}
      },
      "script": {
        "source": "cosineSimilarity(params.queryVector, doc['my_dense_vector'].value)",
        "params": {
          "queryVector": [4, 3.4, -1.2]
        }
      }
    }
  }
}
```

```js
{
  "query": {
    "script_score": {
      "query": {
        "match_all": {}
      },
      "script": {
        "source": "cosineSimilaritySparse(params.queryVector, doc['my_sparse_vector'].value)",
        "params": {
          "queryVector": {"2": -0.5, "10" : 111.3, "50": -13.0, "113": 14.8, "4545": -156.0}
        }
      }
    }
  }
}
```

Closes elastic#31615
@arpsyapathy
Copy link

@mayya-sharipova
Hello Mayya, thank you for your work!

I need help. I just installed new Elastic,create index and try mapping by your example:

{
 "properties": {
   "my_vector": {
     "type": "dense_vector"
    },
    "my_text" : {
      "type" : "keyword"
    }
  }
}

and i get error:

{
    "error": {
        "root_cause": [
            {
                "type": "mapper_parsing_exception",
                "reason": "No handler for type [dense_vector] declared on field [my_vector]"
            }
        ],
        "type": "mapper_parsing_exception",
        "reason": "No handler for type [dense_vector] declared on field [my_vector]"
    },
    "status": 400
}

Thank you advance for reply!

@mayya-sharipova
Copy link
Contributor Author

@psyapathy What version of elasticsearch have you installed?

The indexing of vectors are available from v7.0.0-beta1, but querying of them will be avaialable only from v7.1.

@arpsyapathy
Copy link

@mayya-sharipova Thank you for reply!
it's happy and sad at the same time.
is there an alternative still under development?

@cailurus
Copy link

@mayya-sharipova hi mayya, I've installed ES7.1 and indexed documents with dense vector mapping successfully, but I didn't find a right way to query in documentation. Could you give me a hint?

@mayya-sharipova
Copy link
Contributor Author

@ailurus1991 Sorry, this is a deficiency of our documentation. The scoring is available only from 7.2
From 7.2 two functions as a part of script_score will be available cosineSimilarity and dotProduct

@prem6667
Copy link

prem6667 commented Jun 26, 2019

@mayya-sharipova I just set up the version 7.2, but both the functions are not there. I can see that branch 7.x has these functions. Is there a way I can manually add these functions?

@mayya-sharipova
Copy link
Contributor Author

@prem6667 Sorry, we have decided to move these functions starting from 7.3.
Adding these functions manually involves non-trivial amount of work as besides painless functions, we need to add classes for supporting Doc and script values.
Also, please be aware, that these features are still experimental, and may change.

@LiuGangR
Copy link

LiuGangR commented Aug 1, 2019

@mayya-sharipova is this feature published in 7.3? But I didn't fint it.

@lior-k
Copy link

lior-k commented Aug 1, 2019 via email

@adouib
Copy link

adouib commented Oct 3, 2019

I thank you for this clear presentation, and thank you to the participants for the exchange.

Since we wanted to index several documents with several sentences each, which data structure is the most suitable of the two I present? and what will be the good mapping?

<PUT my_index_1/_doc/1
{
"sentences": [
{"sentence_text" : "my first sentence",
"sentence_vector" : [0.0,0.0,0.0,0.0]}
{"sentence_text" : "my second sentence",
"sentence_vector" : [0.0,0.0,0.1,0.1]}
}>

or
<PUT my_index_1/_doc/1
{
"sentence_text": ["my first sentence", "my second sentence"],
"sentence_vector": [[0.0,0.0,0.0,0.0], [0.0,0.0,0.1,0.1]],
}>

@jtibshirani
Copy link
Contributor

@adouib this seems like a good question for our discuss forums, would you be able to create a discuss post and we can continue the conversation there? We usually try to keep GitHub focused on development efforts like bug reports and feature requests.

@dragon-warrior-nyc
Copy link

@mayya-sharipova I do not quite understand why we need to encode vector as sorted hash. Why do we have to do so? and what does it mean that Vectors are encoded as binaries.

@mayya-sharipova
Copy link
Contributor Author

@dragon-warrior-nyc Please refer to our official documentation.

The details on this PR are potential implementations we have considered that may not be relevant any more.

"Vectors are encoded as binaries" means that vectors are encoded as Lucene BinaryDocValues.

@dragon-warrior-nyc
Copy link

@mayya-sharipova got it and thanks for the explanation!

@devendrathomare
Copy link

I thank you for this clear presentation, and thank you to the participants for the exchange.

Since we wanted to index several documents with several sentences each, which data structure is the most suitable of the two I present? and what will be the good mapping?

<PUT my_index_1/_doc/1
{
"sentences": [
{"sentence_text" : "my first sentence",
"sentence_vector" : [0.0,0.0,0.0,0.0]}
{"sentence_text" : "my second sentence",
"sentence_vector" : [0.0,0.0,0.1,0.1]}
}>

or
<PUT my_index_1/_doc/1
{
"sentence_text": ["my first sentence", "my second sentence"],
"sentence_vector": [[0.0,0.0,0.0,0.0], [0.0,0.0,0.1,0.1]],
}>

@

@mayya-sharipova got it and thanks for the explanation!

I thank you for this clear presentation, and thank you to the participants for the exchange.

Since we wanted to index several documents with several sentences each, which data structure is the most suitable of the two I present? and what will be the good mapping?

<PUT my_index_1/_doc/1
{
"sentences": [
{"sentence_text" : "my first sentence",
"sentence_vector" : [0.0,0.0,0.0,0.0]}
{"sentence_text" : "my second sentence",
"sentence_vector" : [0.0,0.0,0.1,0.1]}
}>

or
<PUT my_index_1/_doc/1
{
"sentence_text": ["my first sentence", "my second sentence"],
"sentence_vector": [[0.0,0.0,0.0,0.0], [0.0,0.0,0.1,0.1]],
}>

@adouib Hi , Did we get resolution for same please ?

@Brentbin
Copy link

Anything new here?

@jtibshirani jtibshirani added :Search Relevance/Vectors Vector search and removed :Search Relevance/Ranking Scoring, rescoring, rank evaluation. labels Jul 21, 2022
@javanna javanna added the Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch label Jul 12, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
:Search Relevance/Vectors Vector search Team:Search Relevance Meta label for the Search Relevance team in Elasticsearch
Projects
None yet
Development

Successfully merging a pull request may close this issue.