Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add BitVectors format and make flat vectors format easier to extend #13288

Merged

Conversation

benwtrent
Copy link
Member

@benwtrent benwtrent commented Apr 10, 2024

Instead of making a separate thing pluggable inside of the FieldFormat, this instead keeps the vector similarities as they are, but allows a custom scorer to be provided to the FlatVector storage used by HNSW.

This idea is akin to the compression extensions we have. But in this case, its for vector scorers.

To show how this would work in practice, I took the liberty of adding a new HnswBitVectorsFormat in the sandbox module.

A larger part of the change is a refactor of the RandomAccessVectorValues<T> to remove the <T>. Nothing actually uses that any longer, and we should instead rely on well defined classes and stop relying on casting with generics (yuck).

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I only had a quick look but I like the idea. (I also like that you removed the generics!)

Copy link
Contributor

@ChrisHegarty ChrisHegarty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this, and will do a more detailed review once I get it into my IDE.

Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like this approach, it isolates the customisation and extensibility to a specific case (the flat format). We have some cleanup to do with all the random vector scorers and suppliers but these is a step forward in terms of simplification, thanks @benwtrent

@@ -28,7 +28,7 @@

/** Read the vector values from the index input. This supports both iterated and random access. */
abstract class OffHeapFloatVectorValues extends FloatVectorValues
implements RandomAccessVectorValues<float[]> {
implements RandomAccessVectorValues.Floats {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for this PR but I would like to try splitting FloatVectorValues and RandomAccessVectorValues.Floats. Having a single hierarchy that mixes the access pattern is not ideal. With the FlatVectorFomat in the mix we should be able to produce RandomAccessVectorValues and FloatVectorValues independently. This change should help this simplification :)

Copy link
Contributor

@jpountz jpountz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My main suggestion is to move the new format to lucene/codecs. Otherwise LGTM.

Copy link
Contributor

@tteofili tteofili left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 on moving the new format to codecs, other than that LGTM.

Copy link
Contributor

@ChrisHegarty ChrisHegarty left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@benwtrent
Copy link
Member Author

Hey @uschindler I didn't want to move forward on merging without your thoughts. This is a separate idea from: #13200

This change is more inline to what we do with custom compression functions for other formats. It continues to rely on the reliance of "default formats", which is still an enumeration.

However, this allows for a custom set of scorers be provided by a custom codec. The first example of this is the bit vector codec.

While working on #13200, it just kept looking more and more like a backwards compatibility nightmare and I just couldn't figure out a good interface with formats (like Scalar quantization) that need to know the exact similarity kind.

Copy link
Contributor

@jimczi jimczi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@benwtrent I wish we can avoid introducing another hierarchy of vector scorer (FlatVectorScorer) and reuse the original RandomVectorScorer(Supplier). We have too many overlapping concepts imo so I tried to simplify here:
jimczi@cd7d6bf
The proposed simplification is to use the RandomVectorScorerSupplier consistently in the HNSW graph and in the flat vectors codec for customisation.
The change is built on top of this PR, let me know what you think.

@benwtrent
Copy link
Member Author

@jimczi I do like the further simplification. I can see about pulling in some of your ideas

@uschindler
Copy link
Contributor

Hey @uschindler I didn't want to move forward on merging without your thoughts. This is a separate idea from: #13200

Will check tomorrow.

@ChrisHegarty
Copy link
Contributor

I would like to suggest that we reintroduce getSlice. The getSlice method is critical to any serious implementation that wants to take things into its own hands. The getSlice methods allows to store and retrieve additional metadata per vector, say, like for example the current int8 SQ does (with the per-vector float offset values). The interfaces here are "expert", so I see no issue getSlice. While not inconceivable or a requirement of this work, I would expect that it be possible to rewrite the existing int8 SQ atop this interface, which is a good reason why getSlice should be reintroduced. ( I also eventually want to move towards direct off-heap access, but that is orthogonal )

@benwtrent
Copy link
Member Author

@jimczi OK, I read a bit more of your suggestion.

I am not a huge fan of how every scorer can now just get a "queryOrdinal" and overwrite whatever query was passed to it.

Some of the code reduction you did does seem nice, I am not sure I like the API however. I would need to fully flesh it out to see it in action.

@jimczi
Copy link
Contributor

jimczi commented Apr 16, 2024

I am not a huge fan of how every scorer can now just get a "queryOrdinal" and overwrite whatever query was passed to it.

Yep that's tricky. I couldn't find a better way since my goal was to avoid having three level of vector scorers (FlatVectorScorer -> RandomVectorSupplier -> RandomVectorScorer). I'd still argue that this way of exposing things is more straightforward and allows to reduce the amount of code that relies on generic interface that needs to be casted. This setQueryOrd is only for the builder case though so completely internal and not something that the custom score should worry about (on how to use it).

@benwtrent benwtrent merged commit 3d86ff2 into apache:main Apr 17, 2024
4 checks passed
@benwtrent benwtrent deleted the feature/more-extensible-flat-vector-storage branch April 17, 2024 17:13
benwtrent added a commit to benwtrent/lucene that referenced this pull request Apr 17, 2024
…pache#13288)

Instead of making a separate thing pluggable inside of the FieldFormat, this instead keeps the vector similarities as they are, but allows a custom scorer to be provided to the FlatVector storage used by HNSW.

This idea is akin to the compression extensions we have. But in this case, its for vector scorers.

To show how this would work in practice, I took the liberty of adding a new HnswBitVectorsFormat in the sandbox module.

A larger part of the change is a refactor of the `RandomAccessVectorValues<T>` to remove the `<T>`. Nothing actually uses that any longer, and we should instead rely on well defined classes and stop relying on casting with generics (yuck).
benwtrent added a commit that referenced this pull request Apr 22, 2024
…13288) (#13316)

Instead of making a separate thing pluggable inside of the FieldFormat, this instead keeps the vector similarities as they are, but allows a custom scorer to be provided to the FlatVector storage used by HNSW.

This idea is akin to the compression extensions we have. But in this case, its for vector scorers.

To show how this would work in practice, I took the liberty of adding a new HnswBitVectorsFormat in the sandbox module.

A larger part of the change is a refactor of the `RandomAccessVectorValues<T>` to remove the `<T>`. Nothing actually uses that any longer, and we should instead rely on well defined classes and stop relying on casting with generics (yuck).
@navneet1v
Copy link
Contributor

navneet1v commented May 26, 2024

@benwtrent I see that with this PR and enabled the flat vectors format easier to extend. You showed it with an example for BitVectorsFormat.

  1. Does this mean now Lucene supports BitVectorsFormat officially? Or it was more a prototype and not intended for production use?
  2. Another reason why I am asking this is because from the VectorSimilarity enum standpoint I cannot find Hamming Bit there. So if bitvector is supposed to be for production use then what should be the VectorSimilarity should be used for BitVectors. Ref:
  3. If a user overriding the Scorer for the flatVectorsFormat, then does this mean requirement of VectorSimilarity Function is not a required attribute now? If ans is then, are there plans to remove the VectorSimilarity param while creating the VectorField? Ref:
    public void setVectorAttributes(
    int numDimensions, VectorEncoding encoding, VectorSimilarityFunction similarity) {
    checkIfFrozen();
    if (numDimensions <= 0) {
    throw new IllegalArgumentException("vector numDimensions must be > 0; got " + numDimensions);
    }
    this.vectorDimension = numDimensions;
    this.vectorSimilarityFunction = Objects.requireNonNull(similarity);
    this.vectorEncoding = Objects.requireNonNull(encoding);
    }

If the format is not intended for production use, I would like to enhance the format. Please let me know your thoughts.

@benwtrent
Copy link
Member Author

  1. No, BitVector format is not in the backwards compatible package.
  2. Correct, there have been previous discussions in an effort to add it as a similarity value, but those conversations are blocked until we come up with a better system. We don't want to add fully-backwards compatible similarities that our core formats should support until we have a road for deprecating the existing ones.
  3. "If a user overriding the Scorer for the flatVectorsFormat..." This implies a custom vector format, so the user will handle that themselves. However, this doesn't obviate the need for configured similarities as the default core (and fully bwc formats), still use it.

@navneet1v
Copy link
Contributor

@benwtrent

I am little confused here. I am still looking for an ans of this question: Does this mean now Lucene supports BitVectorsFormat officially? Or it was more a prototype and not intended for production use?

Another place where I don't have clarity is: what is the point of VectorSimilarity functions in case of bitvectors format. I can set a MAX_INNER_PRODUCT for bits vectors but the codec will use Hamming distance for similarity calculation. So it means getting setting vector similarity from a field is not the source truth for what vector similarity function to be used. Hence the implementations should come up with other ways to know what is the vector similarity function.

@benwtrent
Copy link
Member Author

@navneet1v

Does this mean now Lucene supports BitVectorsFormat officially?

The answer is no.

Or it was more a prototype and not intended for production use?

The answer is yes.

what is the point of VectorSimilarity functions in case of bitvectors format.

Currently there is none. But I could see it being updated where cosine and dot-product aren't actually just hamming distance (as hamming is more akin to euclidean).

So it means getting setting vector similarity from a field is not the source truth for what vector similarity function to be used.

For the default and core codecs, keeping a nice separation so that users don't have to know about the codec and trusting it is doing the right thing is important.

Using the similarity in Field Info allows users to have a pick of some default supported vector similarity functions without futzing around with codecs (which is complicated for normal Lucene users). It is important for ease of use.

As for the format summarily ignoring the input, this could always be done. The format stores, reads, scores, etc. any way it wants. If the advance user chooses a custom format that ignores the similarity applied to the field, its their prerogative.

For example, its conceptual that a format could actually ignore cosine altogether and always normalize, store the magnitude, and always do dot-product.

I do not think the bit-vector format necessitates a different contract between vector similarities and formats.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

7 participants