-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Distance measures for dense and sparse vectors #37947
Distance measures for dense and sparse vectors #37947
Conversation
Introduce painless functions of cosineSimilarity and dotProduct distance measures for dense and sparse vector fields. ```js { "query": { "script_score": { "query": { "match_all": {} }, "script": { "source": "cosineSimilarity(params.queryVector, doc['my_dense_vector'].value)", "params": { "queryVector": [4, 3.4, -1.2] } } } } } ``` ```js { "query": { "script_score": { "query": { "match_all": {} }, "script": { "source": "cosineSimilaritySparse(params.queryVector, doc['my_sparse_vector'].value)", "params": { "queryVector": {"2": -0.5, "10" : 111.3, "50": -13.0, "113": 14.8, "4545": -156.0} } } } } } ``` Closes elastic#31615
Pinging @elastic/es-search |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only had a quick look, one concern that I have is that we are leaking the internal representation of vector fields.
I believe we should instead expose vectors in scripts via a dedicated ScriptDocValues sub-class, like we are doing for dates for instance, or only give access to vector fields via functions, whose signature would look like dotProduct(queryVector, fieldName)
.
@@ -9,7 +9,8 @@ not exceed 500. The number of dimensions can be | |||
different across documents. A `dense_vector` field is | |||
a single-valued field. | |||
|
|||
These vectors can be used for document scoring. | |||
These vectors can be used for | |||
{ref}/query-dsl-script-score-query.html#vector-functions[document scoring]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is there a reason not use an internal link, eg. <<vector-functions,document scoring>>
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Adrien. I think we can use internal links only to reference within the same document. What I wanted to do here is reference a section of the external document
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still a bit confused, this is the same document, isn't it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jpountz Sorry Adrien, I meant that inside one asciidoc doc dense-vector.asciidoc
we want to reference a section of another asciidoc doc script-score-query.asciidoc
.
We can indeed use an easier format : <<query-dsl-script-score-query,document_scoring
>>, but this will link to the whole document. And as I understood after talking with the documentation team, the only way to link to the section of another doc is to use this full html link.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ok
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jpountz Sorry Adrien, please disregard my previous comments. I have followed your advice to use internal links and it looks like documentation CI passed.
this.queryVectorMagnitude = (float) Math.sqrt(dotProduct); | ||
} | ||
|
||
public float cosineSimilarity(BytesRef docVectorBR) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would make these methods return a double. We only support floats at index time because of space contraints, but this isn't a problem here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@jpountz Thanks for the review, Adrien. I will change this to double
. The main reason for float
was that it is a document's score, and all other Scorers are returning floats.
@jpountz Thanks for the initial review, Adrien. I have tried to address your comments and this PR is ready for the review when you have time:
About exposing vectors in scripts via a dedicated ScriptDocValues sub-class - this was already initially implemented through About leaking the internal representation of vector fields - I have made
|
how to use cosineSimilarity? it just tell me '"lang":"painless","caused_by":{"type":"illegal_argument_exception","reason":"Variable [my_feature] is not defined'
and this is my mapping |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks Mayya, I like this approach much more. I left some minor comments. One additional thing that would be nice to address would be to make sure that users get a nice error if they call the sparse functions on dense vectors or vice-versa, I have the feeling that users would get cryptic decoding errors if they do that with the current state of your PR?
|
||
@Override | ||
public SortedBinaryDocValues getBytesValues() { | ||
return null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you throw an exception instead?
@@ -9,7 +9,8 @@ not exceed 500. The number of dimensions can be | |||
different across documents. A `dense_vector` field is | |||
a single-valued field. | |||
|
|||
These vectors can be used for document scoring. | |||
These vectors can be used for | |||
{ref}/query-dsl-script-score-query.html#vector-functions[document scoring]. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm still a bit confused, this is the same document, isn't it?
@@ -74,6 +74,108 @@ to be the most efficient by using the internal mechanisms. | |||
-------------------------------------------------- | |||
// NOTCONSOLE | |||
|
|||
[[vector-functions]] | |||
===== Distance functions for vector fields | |||
These functions are used to calculate distances |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's maybe avoid mentioning "distance" since eg. cosineSimilarity measure the similarity between two vectors rather than their distance?
// NOTCONSOLE | ||
|
||
NOTE: If a document doesn't have a value for a vector field on which | ||
a distance function is executed, 0 will be returned as a result. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Let's also clarify what happens for dense vectors if they don't have the same number of dimensions?
public static int[] decodeSparseVectorDims(BytesRef vectorBR) { | ||
if (vectorBR == null) { | ||
throw new IllegalStateException("A document doesn't have a value for a vector field!"); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Shouldn't this be an illegal argument exception?
int i = 0; | ||
for (Map.Entry<String, Number> dimValue : queryVector.entrySet()) { | ||
queryDims[i] = Integer.parseInt(dimValue.getKey()); | ||
queryValues[i] = dimValue.getValue().floatValue(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
s/floatValue/doubleValue/?
double dotProduct = 0; | ||
int i = 0; | ||
for (Map.Entry<String, Number> dimValue : queryVector.entrySet()) { | ||
queryDims[i] = Integer.parseInt(dimValue.getKey()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
catch the NumberFormatException to return a more user-friendly exception?
// calculate docVector magnitude | ||
double dotProduct = 0; | ||
for (float value : docValues) { | ||
dotProduct += value * value; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cast one of the values to a double to have better accuracy and avoid overflows?
|
||
VectorDVAtomicFieldData(BinaryDocValues values) { | ||
super(); | ||
this.values = values; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's take a LeafReader and a String field like other impls do and re-pull binary doc values each time, this way calling getScriptDocValues() multiple times on the same AtomicFieldData instance will work as expected
} | ||
|
||
// package private access only for {@link ScoreScriptUtils} | ||
BytesRef getValue() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
let's call it something like getEncodedValue
to clarify what it is about?
@LiuGangR You need to put quotes around the field name. |
@jpountz Thanks! |
But there is new problem script score function must not produce negative scores
|
This is a good point, we should update examples so that they may only create positive scores, regardless of what vectors are indexed. |
@jpountz |
@LiuGangR Hopefully 7.1. |
@jpountz |
@LiuGangR yes, the only way to use |
@jpountz Thanks Adrien for another review. I have addressed all your feedback except 1 comment, and this PR is ready for another round of review whenever you have time. Unaddressed feedback:
Uses can make two mistakes here:
About changing the encoding for vector fields, I was also thinking possibly to encode the magnitude of a document vector, so not to calculate it each time. What do you think about this? |
@jpountz
|
@mayya-sharipova - For clarification, does this native vector function use Only asking because the documentation seems to suggest the use of _source values - https://www.elastic.co/guide/en/elasticsearch/reference/master/query-dsl-script-score-query.html#vector-functions Also - do you have any performance numbers you've run/tested? Someone mentioned this feature was being added and said a test with 5 Million documents with vectors of dim=300 took 5 seconds to return results, which seems like pretty anemic response times. |
@wmelton Answering your questions:
We use binary document values, we encode vectors as binaries during indexing, and decode them back to numeric vectors during search.
No, currently, we don't have, but we plan to work on adding some benchmarks. Vector functions use linear scan over all matched docs, so the response time should increase linearly with the number of matched docs. Also, would like to note that vector fields is an experimental feature, and APIs and the way the vectors are indexed and encoded may be changed in the non-backward compatible way. |
Hi @mayya-sharipova - Thank you for your responses. Regarding "Vector functions use linear scan over all matched docs, so the response time should increase linearly with the number of matched docs." - I think taking the linear approach for this is a mistake, personally. The pL2AP algorithm and Facebooks open source FAISS (Fast Similarity Search) both highlight ways to parallelize the search space. I think implementing a linear search approach will be frustrating to the type of users who are actually the most likely to want to use the dense or sparse vector field type you are proposing adding. |
@wmelton Thanks for your comment. Indeed linear scan would not scale, and it is intended mostly to score a limited set of documents. About |
@mayya-sharipova with 200 floating point numbers |
Your example above with "queryVector": [ 4.5, 3.4, -1.2] works fine, but when it comes to [0.7831882238388062, 0.8473913073539734, 0.6641695499420166...] vectors, I get an error: |
@ra1ski What do you mean by "long dense vectors"? Do you mean to use 200 dimensions? Yes, you can use up to 1024 dimensions. It should be fine. |
Yes, 200 dimensions. Here is the query {
"query": {
"script_score": {
"query": {
"match_all": {}
},
"script": {
"source": "cosineSimilarity(params.queryVector, doc['vector'])",
"params": {
"queryVector": [0.7831882238388062, 0.8473913073539734, 0.6641695499420166, -0.7800988554954529, 0.6427151560783386, 0.8618375062942505, -0.7508959174156189, 0.8940073251724243, -0.8382183313369751, -0.8465797305107117, 0.8887408375740051, 0.8348124623298645, 0.7685972452163696, -0.8586599230766296, 0.7378193140029907, -0.7119467854499817, -0.8077011108398438, 0.8601088523864746, 0.8935535550117493, 0.6392208337783813, 0.8716743588447571, -0.7871374487876892, 0.6682323217391968, -0.8151301145553589, -0.8227899670600891, -0.7399943470954895, -0.897373378276825, 0.8426622152328491, 0.8269796371459961, 0.8424233198165894, 0.8509830236434937, -0.7777097821235657, 0.8377213478088379, 0.9059052467346191, 0.7352653741836548, -0.7400990128517151, -0.8934587240219116, -0.9130118489265442, -0.8574285507202148, -0.8946468234062195, 0.8552821278572083, 0.8763160705566406, -0.7989016771316528, -0.642711341381073, -0.7476733922958374, -0.8486865758895874, 0.8278630971908569, -0.8525271415710449, -0.8806391954421997, -0.6730614304542542, -0.881908118724823, 0.7430080771446228, 0.7847618460655212, 0.8260719180107117, -0.8224948644638062, -0.7607067823410034, 0.8367544412612915, 0.20206642150878906, 0.7692943215370178, -0.8679789304733276, -0.7517973780632019, -0.8642300367355347, -0.7322789430618286, -0.8890762329101562, -0.8113778829574585, -0.8182528614997864, -0.8263254165649414, 0.8806875944137573, -0.8628260493278503, 0.838936984539032, 0.8677369952201843, -0.776382565498352, 0.8289804458618164, 0.6592877507209778, -0.8425590395927429, -0.763074517250061, 0.8569432497024536, -0.7417001128196716, 0.8681409955024719, -0.8540714979171753, -0.8500930070877075, -0.8368064761161804, -0.8406449556350708, -0.8733716011047363, -0.8958595991134644, 0.8130819201469421, -0.8314911723136902, 0.8423287272453308, 0.8449920415878296, -0.8795095682144165, 0.7511520981788635, -0.8035956621170044, 0.7193001508712769, 0.7730565071105957, -0.857988178730011, 0.8187726140022278, 0.831302285194397, 0.8996239900588989, -0.863531231880188, 0.8358138799667358, -0.8426796197891235, 0.8390976190567017, 0.7986222505569458, -0.8568884134292603, 0.8369844555854797, 0.8447090983390808, 0.8311792612075806, -0.8208156824111938, -0.7700560092926025, -0.784808874130249, -0.874031662940979, 0.8473763465881348, 0.8083603978157043, 0.8634394407272339, 0.8724079132080078, -0.7952577471733093, 0.5091663599014282, 0.656829833984375, -0.8029653429985046, -0.8171727061271667, 0.8314194679260254, -0.8559287190437317, 0.8022019267082214, 0.7917070388793945, -0.8446627855300903, -0.7673274278640747, 0.832277774810791, -0.8024963140487671, 0.9498147964477539, -0.7452983856201172, 0.8978539705276489, 0.8834426999092102, 0.8543949127197266, 0.8466156721115112, -0.8207280039787292, 0.8191858530044556, -0.8309515118598938, 0.7519159317016602, 0.8341091275215149, -0.8656532168388367, 0.8573458790779114, -0.8247408866882324, 0.9135391116142273, -0.8272571563720703, -0.8448845148086548, -0.8408781290054321, -0.8409822583198547, -0.842566967010498, 0.7356223464012146, 0.8904960751533508, 0.8448322415351868, -0.8642748594284058, 0.8605462908744812, 0.8045945167541504, -0.8715876340866089, -0.8079540133476257, -0.8474785089492798, -0.8472393155097961, 0.8432945013046265, -0.8253397941589355, 0.7905577421188354, 0.7081928253173828, 0.6722716093063354, 0.8101333379745483, -0.8465112447738647, 0.8858150243759155, 0.8352972269058228, -0.7904651761054993, -0.8659583330154419, -0.8847810626029968, -0.762391209602356, -0.7752716541290283, -0.7860286831855774, -0.8350412249565125, -0.8377161026000977, -0.8326281309127808, 0.6579743027687073, -0.8490581512451172, 0.7932018041610718, 0.7292879819869995, 0.8307300806045532, 0.8333244323730469, -0.7778127193450928, -0.8621459007263184, -0.8240952491760254, 0.8149698376655579, 0.8036678433418274, 0.7759568691253662, -0.8074528574943542, -0.8319423794746399, -0.685379683971405, -0.6155311465263367, 0.771338701248169, 0.7577664256095886, 0.7837430238723755, -0.7604954838752747, 0.8120626211166382, -0.8959243893623352, -0.7081544995307922, 0.8636442422866821]
}
}
}
}
} |
@ra1ski Vector functions are available starting from 7.2 |
Add L1norm - Manhattan distance Add L2norm - Euclidean distance relates to elastic#37947
Add L1norm - Manhattan distance Add L2norm - Euclidean distance relates to #37947
Hi @mayya-sharipova . these two functions are not available in painless when using them sorting with painless script. I think they are not available in so-called Sort Context Is there a reason for that? Example query {
"_source": {
"excludes": [
"*"
]
},
"from": 0,
"profile": true,
"query": {
"bool": {
"filter": [
{
"match_all": {}
}
]
}
},
"sort": {
"_script": {
"order": "desc",
"script": {
"lang": "painless",
"params": {
"user_vector": [
140,
45
]
},
"source": "\nreturn dotProduct(params.user_vector, doc[\"vctr\"]);"
},
"type": "number"
}
}
} {
"error": {
"root_cause": [
{
"type": "script_exception",
"reason": "compile error",
"script_stack": [
"\nreturn dotProduct(params.user_ve ...",
" ^---- HERE"
],
"script": "\nreturn dotProduct(params.user_vector, doc[\"vctr\"]);",
"lang": "painless"
}
],
"type": "search_phase_execution_exception",
"reason": "all shards failed",
"phase": "query",
"grouped": true,
"failed_shards": [
{
"shard": 0,
"index": "test",
"node": "LwRebTTXS3CgSzdYSqpFFA",
"reason": {
"type": "script_exception",
"reason": "compile error",
"script_stack": [
"\nreturn dotProduct(params.user_ve ...",
" ^---- HERE"
],
"script": "\nreturn dotProduct(params.user_vector, doc[\"vctr\"]);",
"lang": "painless",
"caused_by": {
"type": "illegal_argument_exception",
"reason": "Unknown call [dotProduct] with [2] arguments."
}
}
}
]
},
"status": 400
} |
This limitation makes vectors a bit useless in script sorting. One can try to implement dotProduct or cosineSimilarity in painless but it is not possible since this is package private and makes impossible of decoding values. |
@csenol Thanks for letting us know about this issue. Indeed, vector functions are not available in the Sort Script Context in 7.3. We have made a patch to add them to the Sort Context from 7.4. |
@mayya-sharipova Thanks million times for such a quick action. I wish it was a fix in 7.3.X instead of waiting for 7.4 but anyway thanks a lot |
Hi, @mayya-sharipova! Thanks for great work with vector scoring! In my setup it can sort 3m 512d vectors in ~1200ms, and in pair with LSH it can achieve around 100ms while scoring 120k top matches (10k per shard). Only issue I've found this far is that when using dotProduct on normalized vectors, score might be in range (-1,1) which can cause error with negative score. Currently I'm fixing it with normalizing score to range (0,1) with following: |
@SthPhoenix Thanks for the info on your setup.
I was wondering what LSH are you using? |
I'm using my fork of @alexklibisz elastik-nearest-neighbors plugin. |
@@ -119,8 +120,7 @@ public Query existsQuery(QueryShardContext context) { | |||
|
|||
@Override | |||
public IndexFieldData.Builder fielddataBuilder(String fullyQualifiedIndexName) { | |||
throw new UnsupportedOperationException( | |||
"Field [" + name() + "] of type [" + typeName() + "] doesn't support sorting, scripting or aggregating"); | |||
return new VectorDVIndexFieldData.Builder(true); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hi @mayya-sharipova I've been doing some digging to figure out why fielddataBuilder looks like makes the dense vector field aggregatable, while docValueFormat says throw new UnsupportedOperationException("Field [" + name() + "] of type [" + typeName() + "] doesn't support docvalue_fields or aggregations");
. Is that correct or does docValueFormat need updating? I am totally fine with fixing if needed, just trying to put all the pieces together first. ;)
Introduce painless functions of
cosineSimilarity and dotProduct distance
measures for dense and sparse vector fields.
Closes #31615