-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add BitVectors format and make flat vectors format easier to extend #13288
Add BitVectors format and make flat vectors format easier to extend #13288
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I only had a quick look but I like the idea. (I also like that you removed the generics!)
lucene/core/src/java/org/apache/lucene/codecs/FlatVectorsScorer.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/codecs/lucene95/OffHeapFloatVectorValues.java
Outdated
Show resolved
Hide resolved
lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/bitvectors/HnswBitVectorsFormat.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this, and will do a more detailed review once I get it into my IDE.
lucene/analysis/common/src/java/org/apache/lucene/analysis/synonym/word2vec/Word2VecModel.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like this approach, it isolates the customisation and extensibility to a specific case (the flat format). We have some cleanup to do with all the random vector scorers and suppliers but these is a step forward in terms of simplification, thanks @benwtrent
lucene/core/src/java/org/apache/lucene/codecs/lucene95/OffHeapRandomAccessVectorValues.java
Outdated
Show resolved
Hide resolved
...sandbox/src/java/org/apache/lucene/sandbox/codecs/bitvectors/OnHeapFlatBitVectorsScorer.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99FlatVectorsWriter.java
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/codecs/hnsw/OnHeapFlatVectorScorer.java
Outdated
Show resolved
Hide resolved
@@ -28,7 +28,7 @@ | |||
|
|||
/** Read the vector values from the index input. This supports both iterated and random access. */ | |||
abstract class OffHeapFloatVectorValues extends FloatVectorValues | |||
implements RandomAccessVectorValues<float[]> { | |||
implements RandomAccessVectorValues.Floats { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not for this PR but I would like to try splitting FloatVectorValues
and RandomAccessVectorValues.Floats
. Having a single hierarchy that mixes the access pattern is not ideal. With the FlatVectorFomat
in the mix we should be able to produce RandomAccessVectorValues
and FloatVectorValues
independently. This change should help this simplification :)
lucene/core/src/java/org/apache/lucene/codecs/lucene99/Lucene99HnswVectorsWriter.java
Outdated
Show resolved
Hide resolved
lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/bitvectors/HnswBitVectorsFormat.java
Outdated
Show resolved
Hide resolved
lucene/sandbox/src/java/org/apache/lucene/sandbox/codecs/bitvectors/HnswBitVectorsFormat.java
Outdated
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My main suggestion is to move the new format to lucene/codecs
. Otherwise LGTM.
lucene/codecs/src/java/org/apache/lucene/codecs/simpletext/SimpleTextKnnVectorsReader.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/codecs/hnsw/DefaultFlatVectorScorer.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/codecs/hnsw/FlatBitVectorsScorer.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/codecs/hnsw/ScalarQuantizedVectorScorer.java
Outdated
Show resolved
Hide resolved
lucene/core/src/java/org/apache/lucene/codecs/hnsw/ScalarQuantizedVectorScorer.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 on moving the new format to codecs
, other than that LGTM.
lucene/core/src/java/org/apache/lucene/codecs/hnsw/package-info.java
Outdated
Show resolved
Hide resolved
…ible-flat-vector-storage
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM.
Hey @uschindler I didn't want to move forward on merging without your thoughts. This is a separate idea from: #13200 This change is more inline to what we do with custom compression functions for other formats. It continues to rely on the reliance of "default formats", which is still an enumeration. However, this allows for a custom set of scorers be provided by a custom codec. The first example of this is the bit vector codec. While working on #13200, it just kept looking more and more like a backwards compatibility nightmare and I just couldn't figure out a good interface with formats (like Scalar quantization) that need to know the exact similarity kind. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@benwtrent I wish we can avoid introducing another hierarchy of vector scorer (FlatVectorScorer
) and reuse the original RandomVectorScorer(Supplier)
. We have too many overlapping concepts imo so I tried to simplify here:
jimczi@cd7d6bf
The proposed simplification is to use the RandomVectorScorerSupplier
consistently in the HNSW graph and in the flat vectors codec for customisation.
The change is built on top of this PR, let me know what you think.
@jimczi I do like the further simplification. I can see about pulling in some of your ideas |
Will check tomorrow. |
I would like to suggest that we reintroduce |
@jimczi OK, I read a bit more of your suggestion. I am not a huge fan of how every scorer can now just get a "queryOrdinal" and overwrite whatever query was passed to it. Some of the code reduction you did does seem nice, I am not sure I like the API however. I would need to fully flesh it out to see it in action. |
…ible-flat-vector-storage
Yep that's tricky. I couldn't find a better way since my goal was to avoid having three level of vector scorers (FlatVectorScorer -> RandomVectorSupplier -> RandomVectorScorer). I'd still argue that this way of exposing things is more straightforward and allows to reduce the amount of code that relies on generic interface that needs to be casted. This |
…pache#13288) Instead of making a separate thing pluggable inside of the FieldFormat, this instead keeps the vector similarities as they are, but allows a custom scorer to be provided to the FlatVector storage used by HNSW. This idea is akin to the compression extensions we have. But in this case, its for vector scorers. To show how this would work in practice, I took the liberty of adding a new HnswBitVectorsFormat in the sandbox module. A larger part of the change is a refactor of the `RandomAccessVectorValues<T>` to remove the `<T>`. Nothing actually uses that any longer, and we should instead rely on well defined classes and stop relying on casting with generics (yuck).
…13288) (#13316) Instead of making a separate thing pluggable inside of the FieldFormat, this instead keeps the vector similarities as they are, but allows a custom scorer to be provided to the FlatVector storage used by HNSW. This idea is akin to the compression extensions we have. But in this case, its for vector scorers. To show how this would work in practice, I took the liberty of adding a new HnswBitVectorsFormat in the sandbox module. A larger part of the change is a refactor of the `RandomAccessVectorValues<T>` to remove the `<T>`. Nothing actually uses that any longer, and we should instead rely on well defined classes and stop relying on casting with generics (yuck).
@benwtrent I see that with this PR and enabled the flat vectors format easier to extend. You showed it with an example for BitVectorsFormat.
If the format is not intended for production use, I would like to enhance the format. Please let me know your thoughts. |
|
I am little confused here. I am still looking for an ans of this question: Another place where I don't have clarity is: what is the point of VectorSimilarity functions in case of bitvectors format. I can set a |
The answer is no.
The answer is yes.
Currently there is none. But I could see it being updated where cosine and dot-product aren't actually just hamming distance (as hamming is more akin to euclidean).
For the default and core codecs, keeping a nice separation so that users don't have to know about the codec and trusting it is doing the right thing is important. Using the similarity in Field Info allows users to have a pick of some default supported vector similarity functions without futzing around with codecs (which is complicated for normal Lucene users). It is important for ease of use. As for the format summarily ignoring the input, this could always be done. The format stores, reads, scores, etc. any way it wants. If the advance user chooses a custom format that ignores the similarity applied to the field, its their prerogative. For example, its conceptual that a format could actually ignore cosine altogether and always normalize, store the magnitude, and always do dot-product. I do not think the bit-vector format necessitates a different contract between vector similarities and formats. |
Instead of making a separate thing pluggable inside of the FieldFormat, this instead keeps the vector similarities as they are, but allows a custom scorer to be provided to the FlatVector storage used by HNSW.
This idea is akin to the compression extensions we have. But in this case, its for vector scorers.
To show how this would work in practice, I took the liberty of adding a new HnswBitVectorsFormat in the sandbox module.
A larger part of the change is a refactor of the
RandomAccessVectorValues<T>
to remove the<T>
. Nothing actually uses that any longer, and we should instead rely on well defined classes and stop relying on casting with generics (yuck).