-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
FieldInfosFormat translation should be independent of VectorSimilartyFunction enum #13119
FieldInfosFormat translation should be independent of VectorSimilartyFunction enum #13119
Conversation
This PR is a prerequisite for future work to make the similarity function in the format symbolic and lookup-able, see #13076 (comment). |
lucene/core/src/java/org/apache/lucene/codecs/lucene94/Lucene94FieldInfosFormat.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Index format wise, I think the index corruption can occur when reading a Lucene 9.8.0 index with Lucene 9.7.0, as the format would allow that, but I am not sure this is an expected scenario.
lucene/core/src/java/org/apache/lucene/codecs/lucene94/Lucene94FieldInfosFormat.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi,
as stated in the other issue: I am not really happy to have that enum at all! The similarity/distance functions should be pluggable using NamedSPILoader
. To implement that, the ordinals must removed in a new file format version and instead names be written using the Codec utility classes.
As a first step this PR is fine as it does not change file format and just decouples the ordinals from the enum. In future, when we have SPI, we can use the current code of the ordinals
In my opinion, the strings as lookup keys are not needed: Just define it as List<VectorSimilarityFunction>
to get the link between them. At a later stage the backwards layer could then fallback to the list with SPI instances to lookup the legacy ordinals. The coec and the enum are still enough decoupled.
This is perfectly fine. |
Agreed on where we wanna get to. Just trying to get there incrementally, since format changes are quite noisy.
Exactly, this is just a first step. It (for the most part) encapsulates the translation in the format. When we add a new format and/or evolve VectorSimilarityFunction, this format should be largely immune to the change.
Yeah, that's probably good enough for now. Updated. |
Ok cool. I was worried for nothing then. |
The comment is a bit outdated. I was thinking of making it even more vebose by having 2 maps for the lookup... +1 looks fine. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Depending on enum order was always trappy. Thanks for decoupling!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if you fix the comment and remove "names" from
maybe be sure to explicitly say: add new ones always at end of list :-)
I see now that we have a similar dependency in |
lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99HnswVectorsReader.java
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99HnswVectorsReader.java
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99HnswVectorsReader.java
Show resolved
Hide resolved
Thanks for the reviews. All comments have been addressed. |
…Function enum (#13119) This commit updates the FieldInfosFormat translation of vector similarity functions to be independent of the VectorSimilartyFunction enum. The VectorSimilartyFunction enum lives outside of the codec format, and the format should not inadvertently depend upon the declaration order or values in VectorSimilartyFunction. The format should be in charge of the translation of similarity function to format ordinal (and visa versa). In reality, and for now, the translation remains the same as the declaration order, but this may not be the case in the future.
@benwtrent honestly don't remember, but I do know that early on we tried things a few different ways. There was some discussion about whether the similarity function and dimensions should be in the codec vs in the field info. I suspect it evolved and we did not end up removing the redundant version? |
@msokolov thanks for clarifying. I just wanted to make sure there wasn't an important reason that I missed. |
This commit updates the FieldInfosFormat translation of vector similarity functions to be independent of the
VectorSimilartyFunction
enum.The
VectorSimilartyFunction
enum lives outside of the codec format, and the format should not inadvertently depend upon the declaration order or values in VectorSimilartyFunction. The format should be in charge of the translation of similarity function to format ordinal (and visa versa). In reality, and for now, the translation remains the same as the declaration order, but this may not be the case in the future.Note: did we introduce a potential index corruption issue when adding maximum inner product in 9.8.0? since the format was not updated when the enum value was added - the ordinal for maximum inner product is unknown to Lucene 9.7.0, which uses the same format.