Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Making vector similarity functions pluggable #12219

Closed
benwtrent opened this issue Mar 31, 2023 · 7 comments
Closed

Making vector similarity functions pluggable #12219

benwtrent opened this issue Mar 31, 2023 · 7 comments

Comments

@benwtrent
Copy link
Member

Description

There are two major reasons for adding a custom vector similarity function:

  • Adding new nuanced similarity functions (jaccard, hamming, etc.)
  • Have external support for unstable JVM APIs (Vector incubator)

I am not 100% sure Lucene itself should go through the work of consistently adding new similarity functions.

We should make these pluggable in such a way that developers using Lucene can provide specialized distance functions.

I think the main issue is that the Vector Similarity function is tied to the FieldType and currently that is not pluggable via any external configuration.

There are two ways I can think of for doing this:

  • The functions being provided in indexing and search configurations (not a fan of this option).
  • Using an SPI (seems like a more natural option).

Opening this issue for discussion.

@jpountz
Copy link
Contributor

jpountz commented Mar 31, 2023

What about doing this through vector formats: vector formats could take a similarity parameter, which would win over the one configured on the field type when set. The benefit is that it builds on the fact that vector formats are already pluggable, which makes things simpler in my view. The downside is that you can't plug in similarity functions independently from vectors formats.

This makes me wonder more generally if the similarity should have been an implementation detail of vector formats, like maxConn and beamWidth. Having it on the field type is a bit more user friendly if we think it's important for users to be able to choose between cosine, dot-product and euclidean, but a downside of this choice it that it requires any legal KNN vectors format to support all 3 similarity functions.

@benwtrent
Copy link
Member Author

@msokolov what do you think about this?

@msokolov
Copy link
Contributor

It makes sense to me. I think we got where we are because initially all these things were field-level and then some of them got migrated to the format. Now we're in a middle place straddling two different APIs. Adrien's suggestion to override the field-level setting seems a bit odd from a user perspective though -- I guess a per-field vector format can choose to ignore the similarity function, but then it feels like we ought to avoid setting it in the first place. Maybe we could create a FieldType similarity function "DEFAULT" that is whatever default is provided by the format?

@benwtrent
Copy link
Member Author

OK, I think this design would look like this:

  • When creating the format, a similarity function can be optionally provided
    • The vector similarity function will be a new java interface
    • We will default to the same default as the current vector similarity enumeration.
  • Add a new enum to the existing vector similarity enumeration indicating DEFAULT that indicates that the format similarity interface must be used. The implementation of this enum will throw UnsupportedOperationException for its similarity functions.
  • Add a vector similarity parameter for the vector reader & writer that the format will provide.
    • If we didn't do this, there would have to be some serialization support for storing custom vector similarity functions, this seems weird to me.

@rmuir
Copy link
Member

rmuir commented May 17, 2023

Sorry, I'm -1 to this. This is going with the approach that only those with the resources of amazon or elastic can have performance search. IMO it is OUR JOB AS A LIBRARY to implement these functions in a performant way, for everyone, not just those with the resources of big tech to plug in some custom shit because openjdk is a shitshow.

taking a stand.

I propose an alternative approach here: #12302

@benwtrent
Copy link
Member Author

@rmuir I was not suggesting it as a way to only get performance for "some big company". I just thought using an incubating API was out of the question in Lucene (you have implied as much in other vector API discussions) and I was hoping to find a way forward while we were stalled. You have mentioned before that it must at least be "preview".

I am very happy to see you initiating the work for Vector API support in Lucene. I think it will make so many things faster! I will be glad to help where I can in using the Vector API in Lucene.

I will happily pause any further work here for the time being :)

@benwtrent
Copy link
Member Author

We can close this, we added panama vector API to Lucene directly, that was my main concern with this issue.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants