Making vector similarity functions pluggable #12219

benwtrent · 2023-03-31T11:55:57Z

Description

There are two major reasons for adding a custom vector similarity function:

Adding new nuanced similarity functions (jaccard, hamming, etc.)
Have external support for unstable JVM APIs (Vector incubator)

I am not 100% sure Lucene itself should go through the work of consistently adding new similarity functions.

We should make these pluggable in such a way that developers using Lucene can provide specialized distance functions.

I think the main issue is that the Vector Similarity function is tied to the FieldType and currently that is not pluggable via any external configuration.

There are two ways I can think of for doing this:

The functions being provided in indexing and search configurations (not a fan of this option).
Using an SPI (seems like a more natural option).

Opening this issue for discussion.

jpountz · 2023-03-31T17:15:18Z

What about doing this through vector formats: vector formats could take a similarity parameter, which would win over the one configured on the field type when set. The benefit is that it builds on the fact that vector formats are already pluggable, which makes things simpler in my view. The downside is that you can't plug in similarity functions independently from vectors formats.

This makes me wonder more generally if the similarity should have been an implementation detail of vector formats, like maxConn and beamWidth. Having it on the field type is a bit more user friendly if we think it's important for users to be able to choose between cosine, dot-product and euclidean, but a downside of this choice it that it requires any legal KNN vectors format to support all 3 similarity functions.

benwtrent · 2023-04-17T18:02:46Z

@msokolov what do you think about this?

msokolov · 2023-04-18T20:35:21Z

It makes sense to me. I think we got where we are because initially all these things were field-level and then some of them got migrated to the format. Now we're in a middle place straddling two different APIs. Adrien's suggestion to override the field-level setting seems a bit odd from a user perspective though -- I guess a per-field vector format can choose to ignore the similarity function, but then it feels like we ought to avoid setting it in the first place. Maybe we could create a FieldType similarity function "DEFAULT" that is whatever default is provided by the format?

benwtrent · 2023-04-26T17:30:32Z

OK, I think this design would look like this:

When creating the format, a similarity function can be optionally provided
- The vector similarity function will be a new java interface
- We will default to the same default as the current vector similarity enumeration.
Add a new enum to the existing vector similarity enumeration indicating DEFAULT that indicates that the format similarity interface must be used. The implementation of this enum will throw UnsupportedOperationException for its similarity functions.
Add a vector similarity parameter for the vector reader & writer that the format will provide.
- If we didn't do this, there would have to be some serialization support for storing custom vector similarity functions, this seems weird to me.

rmuir · 2023-05-17T02:00:56Z

Sorry, I'm -1 to this. This is going with the approach that only those with the resources of amazon or elastic can have performance search. IMO it is OUR JOB AS A LIBRARY to implement these functions in a performant way, for everyone, not just those with the resources of big tech to plug in some custom shit because openjdk is a shitshow.

taking a stand.

I propose an alternative approach here: #12302

benwtrent · 2023-05-17T18:09:31Z

@rmuir I was not suggesting it as a way to only get performance for "some big company". I just thought using an incubating API was out of the question in Lucene (you have implied as much in other vector API discussions) and I was hoping to find a way forward while we were stalled. You have mentioned before that it must at least be "preview".

I am very happy to see you initiating the work for Vector API support in Lucene. I think it will make so many things faster! I will be glad to help where I can in using the Vector API in Lucene.

I will happily pause any further work here for the time being :)

benwtrent · 2023-12-06T14:54:07Z

We can close this, we added panama vector API to Lucene directly, that was my main concern with this issue.

benwtrent added the type:enhancement label Mar 31, 2023

uschindler mentioned this issue May 17, 2023

vector API integration, plan B #12302

Closed

ashvardanian mentioned this issue Aug 11, 2023

USearch integration and potential Vector Search performance improvements #12502

Open

6 tasks

benwtrent closed this as completed Dec 6, 2023

benwtrent mentioned this issue Feb 6, 2024

Adding binary Hamming distance as similarity option for byte vectors #13076

Closed

alessandrobenedetti added the vector-based-search label May 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Making vector similarity functions pluggable #12219

Making vector similarity functions pluggable #12219

benwtrent commented Mar 31, 2023

jpountz commented Mar 31, 2023

benwtrent commented Apr 17, 2023

msokolov commented Apr 18, 2023

benwtrent commented Apr 26, 2023

rmuir commented May 17, 2023

benwtrent commented May 17, 2023

benwtrent commented Dec 6, 2023

Making vector similarity functions pluggable #12219

Making vector similarity functions pluggable #12219

Comments

benwtrent commented Mar 31, 2023

Description

jpountz commented Mar 31, 2023

benwtrent commented Apr 17, 2023

msokolov commented Apr 18, 2023

benwtrent commented Apr 26, 2023

rmuir commented May 17, 2023

benwtrent commented May 17, 2023

benwtrent commented Dec 6, 2023