First-class random access API for KnnVectorValues #13779

msokolov · 2024-09-12T18:43:23Z

addresses #13778

Key things in this PR:

Introduces abstract KnnVectorValues from which ByteVectorValues and FloatVectorValues derive;
Folds RandomAccessVectorValues into KnnVectorValues thus eliminating some casts.
RandomAccessVectorValues.Floats becomes FloatVectorValues and RandomAccessVectorValues.Bytes becomes ByteVectorValues. RandomAccessQuantizedByteVectorValues folded into QuantizedByteVectorValues.
IndexInput getSlice() moved to a new HasIndexSlice interface.
Introduces VectorEncoding KnnVectorValues.getEncoding() to enable type-specific branches in a few places where we are dealing with abstract KnnVectorValues (tests only IIRC). Also used that to provide a default getVectorByteLength().
KnnVectorValues no longer extends DocIdSetIterator; rather it provides a tightly-coupled iterator(). This enables refactoring common iteration patterns that were repeated many times in the code base. This new iterator, DocIndexIterator provides an additional method index() analogous to IndexedDISI.

Some of the methods on KnnVectorValues have default impls that throw UnsupportedOperationException enabling subclasses to provide partial implementations and relying on testing to catch missing required methods. I'd like feedback on this. Should we provide implementations we never use, just to make these classes complete? That didn't make sense to me. But the previous alternative of attempting to provide strict adherence to declarative contracts was becoming in my view, overly restrictive and leading to hard-to-maintain code. Some of these readers would only ever be used iteratively. Random access is required for search, but not used when merging the values themselves, and when we merge we do search, but using a temporary file so that searching is always done over a file-based value. Random access also gets used during merging when the index is sorted, again this is provided by specialized readers, so not every reader needs to implement random access. But the API maintenance is greatly simplified if we allow partial implementation. Anyway that is the idea I am trying out here. Can we live with a little less API purity and gain some simplicity and less boilerplate?

Notes for reviewers:

There is a lot of code change here, but much of it is repetitive. I recommend starting with KnnVectorValues and checking its DocIndexIterator inner class. The rest of the changes are basically consequences of introducing those abstrations in place of the Random*Values we removed.

msokolov · 2024-09-12T18:51:10Z

another concern I have is how this would impact ongoing work to enable multiple vectors per doc/field. There would almost certainly be conflicts with that PR on the surface, but I hope this could actually simplify things in that the new DocIndexIterator class could be enhanced / extended to provide access to a series of values (maybe a list or array?) instead of (or in addition to?) a single one, possibly centralizing the required changes (since we have many fewer iterator implementations after this change).

benwtrent · 2024-09-12T19:41:59Z

but I hope this could actually simplify things

That is my intuition as well.

jpountz

I left a few thoughts/questions. In general, I see how such a random-access API change could help with e.g. your BP reordering work and be valuable in general. I was wondering if this API may be too tailored to HNSW and prevent us from supporting other interesting algorithms, but actually I don't think that this is the case?

lucene/core/src/java/org/apache/lucene/index/KnnVectorValues.java

jpountz · 2024-09-12T20:23:57Z

lucene/core/src/java/org/apache/lucene/index/KnnVectorValues.java

+   * Creates a new copy of this {@link KnnVectorValues}. This is helpful when you need to access
+   * different values at once, to avoid overwriting the underlying vector returned.
+   */
+  public abstract KnnVectorValues copy() throws IOException;


I wonder if we could make the API a bit nicer by removing this copy() and instead have something like a FloatVectorDictionary { float[] vectorValue(int ord); } and a method here that can return a new FloatVectorDictionary (a bit like SortedDocValues and TermsEnum).

The way SortedDocValuesTermsEnum is, calling its next method will overwrite the internal buffer ofd the SortedDocValues on which it is built, defeating the purpose of copy() which is to provide two completely independent sources. Another thing we could do is to add vectorValue(int ord, float[] scratch) allowing the caller to provide the memory to write to. If we had that, we wouldn't need copy(). Maybe we could manage to squeeze that into 10.0 too, but I'd rather do it in a separate PR

But if you call SortedDocValues#termsEnum twice, this would give you two independent sources of terms?

I always found copy very strange, but I get why it is there. I'd be tempted to leave it as is in this PR, changing the access model and cache of 1 float[] will be a bit tricky.

jpountz · 2024-09-12T20:26:54Z

lucene/core/src/java/org/apache/lucene/index/KnnVectorValues.java

+    if (iterator == null) {
+      iterator = createIterator();
+    }
+    return iterator;


Could we make this return a new iterator every time to make the API a bit nicer? From a quick look, it seems that call sites could easily be adjusted to not rely on this method returning a shared instance?

Let me try - I was also a bit unhappy about this, but at one point along this journey I was relying on being able to recover the shared state - maybe I finally was able to get rid of that and just didn't notice!

a new iterator would be cleaner, if the use sites allow for it.

jpountz · 2024-09-12T20:30:19Z

lucene/core/src/java/org/apache/lucene/index/KnnVectorValues.java

+   * Creates an iterator from this instance's ordinal-to-docid mapping which must be monotonic
+   * (docid increases when ordinal does).
+   */
+  protected DocIndexIterator fromOrdToDoc() {


nit: could we make it look a bit more like DocIdSetIterator#all by moving it to DocIndexIterator#all?

ah, you mean rename this method to all? sure, makes sense

jpountz · 2024-09-12T20:32:24Z

lucene/core/src/java/org/apache/lucene/index/KnnVectorValues.java

+    @Override
+    public int advance(int target) throws IOException {
+      return slowAdvance(target);
+    }


This looks like it could be a performance trap, which is why DocIdSetIterator offers this helper method without making it the default impl. Should we leave it without a default impl here too?

yes, I don't think anything relies on this, makes sense

jpountz · 2024-09-12T20:33:04Z

lucene/core/src/java/org/apache/lucene/index/KnnVectorValues.java

+
+    @Override
+    public long cost() {
+      throw new UnsupportedOperationException();


Likewise here, I'd rather leave it unimplemented to force implementers to decide if having cost() throw an exception is fine. Presumably, most of the time it's not.

hmm I think cost() is rarely used in the vector reader/writers which instead are concerned with KnnVectorValues.size() -- they typically want to know how many vector values there are and to the extent they care about the number of docs it's only when they must iterate through all of them and have no use for an estimate. These iterators aren't really used during searching?

If we default cost() to returning size(), that would work for me. But I'm not comfortable with having implementations of DocIdSetIterator#cost that may throw, which means e.g. that they cannot be put in a Conjunction DISI(which will want to sort its clauses by cost).

+1. Even in FloatVecotorValues cost() is returning size() only. https://github.com/apache/lucene/blob/main/lucene/core/src/java/org/apache/lucene/index/FloatVectorValues.java#L46-L48

I agree here. Either it should default to size() via some provided dependency or it shouldn't implement at all and force sub-classes.

lucene/core/src/java/org/apache/lucene/index/KnnVectorValues.java

jpountz · 2024-09-12T20:45:35Z

Am guessing correctly that you're targeting 10.0 for this change?

msokolov · 2024-09-12T21:00:57Z

Thanks for the quick review! I will get started on addressing. As for timeline for this change, it would definitely be convenient to get in to 10.0 release. I think you had said 9/22 would be a feature freeze date; it seems we could possibly meet that timeline. I will be traveling starting tomorrow for a week, but I should be able to put in some quality time on the plane LOL

jpountz · 2024-09-13T05:50:22Z

lucene/core/src/java/org/apache/lucene/codecs/KnnVectorsWriter.java

-      public byte[] vectorValue() throws IOException {
-        return current.values.vectorValue();
+      public byte[] vectorValue(int ord) throws IOException {
+        return current.values.vectorValue(current.values.iterator().index());


This part feels a bit hacky, could we instead merge the ord->vector mappings of the various vector values by concatenating them?

Maybe we can enhance DocIDMerger by adding random access to it

jpountz · 2024-09-13T10:43:39Z

think you had said 9/22 would be a feature freeze date

I was thinking of doing it next week, but we can backport this PR even though the branch has been cut if it looks ready/safe.

ChrisHegarty

I really like this change. I see a lot of refactoring similar to what I half started at one point or the other, but never finished. There are some specific comments to be addressed, but otherwise the approach LGTM.

ChrisHegarty · 2024-09-13T15:47:18Z

lucene/core/src/java/org/apache/lucene/codecs/lucene95/HasIndexSlice.java

-  @Override
-  RandomAccessQuantizedByteVectorValues copy() throws IOException;
+  /** Returns an IndexInput from which to read this instance's values. */
+  IndexInput getSlice();


I very much like this, and had something similar in a past unmarked PR. 👍

ChrisHegarty · 2024-09-13T15:49:25Z

lucene/core/src/java/org/apache/lucene/index/KnnVectorValues.java

+   * Creates a new copy of this {@link KnnVectorValues}. This is helpful when you need to access
+   * different values at once, to avoid overwriting the underlying vector returned.
+   */
+  public abstract KnnVectorValues copy() throws IOException;


I always found copy very strange, but I get why it is there. I'd be tempted to leave it as is in this PR, changing the access model and cache of 1 float[] will be a bit tricky.

lucene/core/src/java/org/apache/lucene/index/KnnVectorValues.java

ChrisHegarty · 2024-09-13T15:52:31Z

lucene/core/src/java/org/apache/lucene/index/KnnVectorValues.java

+    if (iterator == null) {
+      iterator = createIterator();
+    }
+    return iterator;


a new iterator would be cleaner, if the use sites allow for it.

msokolov · 2024-09-16T14:09:13Z

I pushed a new revision here addressing some of the major comments:

KnnVectorValues.iterator() now generally provides a new iterator; no caching is done. I removed createIterator(). Main impact was on VectorScorer (and in tests) where we now create iterators and store them locally. This is much better; thanks for the feedback.
I added implementations for advance() and got rid of the default impl.
I removed impls of cost() and added a default impl that throws UOE. This method is only ever used during search() and most of these values sources will never be searched. The exceptions are those that can be used by the ValueSource API: basically the indexed values returned by a reader. We have lots and lots of other values impls that are used during indexing for which we don't need cost. I briefly considered separating these new iterators from DISI, but that ended up in some trouble.
re: getVectorByteLength() @ChrisHegarty is correct that this is needed as it is today. We could in theory make it final (or inline it whatever) if we added some more VectorEncodings to represent the compressed cases, but I'm inclined to leave it as is. This way we could in theory support a variable size encoding? And anyway it isn't clear we want to mix up the "encoding" with compression?

I didn't have a chance to look seriously at removing copy() API. I don't think we ought to create a simple wrapper though since afaict it would require an additional memory copy of every vector value.

msokolov · 2024-09-16T14:32:13Z

OK there seem to be some test failures ... I did a complete run, but randomized testing always seems to ferret out something interesting!

Actually those really should have failed on any test run -- not sure how I missed them, oops

msokolov · 2024-09-16T15:16:26Z

Regarding the rename of fromOrdToDoc to all I think it was not helpful and plan to revert or maybe come up with some other name. The problem is we also have createDenseIterator which is also all. Essentially we have Sparse and Dense all-iterators. Maybe instead of fromOrdToDoc we can say createSparseIterator?

jpountz · 2024-09-16T16:22:41Z

FWIW I started playing with removing copy() by replacing it with a factory method for a dictionary: msokolov@ae7aca3. Not sure how far I'll go. :)

msokolov · 2024-09-19T00:23:09Z

I'll post one more iteration here addressing the concerns about dangerous default impls that adds back impls of copy() and cost(). I also added a test-and-throw ensuring that the vectorValues impls that require forward-iteration enforce it. We can fully implement random access later without breaking any APIs.

I also think we should go ahead with Adrien's Dictionary idea, but do this in two steps because there is a lot going on here already.

benwtrent · 2024-09-19T11:20:31Z

The dictionary idea is OK, but I still don't see how it removes copy(). Besides the caching of values, copy gives us multi-threaded safety by copying the underlying index readers. Otherwise we are using the same reader between threads. For concurrent merging of graphs, this is important.

I agree, any further refactoring should be done in another PR.

msokolov · 2024-09-19T23:11:57Z

I think the idea w/Dictionary is that callers, instead of calling copy().vectorValue(int ord) would call dictionary().vectorValue(int ord). So then the scratch vector storage (if needed) would be in the Dictionary not in the VectorValues, and thus not shared by multiple users of the same values instance. In some sense it's not very different, but in the sense that the Dictionary has a much more limited API than the source it came from, it is different.

jpountz · 2024-09-20T09:37:05Z

Exactly. I tried to model it similarly to what doc values do, where SortedDocValues#termsEnum() returns a dictionary with a different backing IndexInput clone on every call.

msokolov · 2024-09-20T11:24:05Z

OK I think we've addressed the blocking concerns that have been raised here and I plan to push later today if nothing else comes up. Regarding removing copy() in favor of dictionary() I'll open a separate issue. If Adrien finishes it up, great, but I may also see if I can find time to take that forward soon; it would be good to get it done for 10 since it would be a breaking change and ideally we don't want copy() to linger as deprecated. As for implementing better random access in merged values I think we can take that up at a more relaxed pace since it doesn't require any API change.

msokolov · 2024-09-20T12:05:20Z

hm interesting there was an EOFException in there - I'll dig

msokolov · 2024-09-20T18:56:23Z

OK, I found an off-by-one error plus a problem with lazy iterator creation that slipped in when we got rid of createIterator(). It makes me a little nervous these didn't show up in earlier testing. I'm now running with tests.iter=20

msokolov · 2024-09-28T13:13:45Z

OK, I think this is ready after a few minor issues have been addressed. I opened #13831 to track replacing copy() with dictionary()

javanna · 2024-09-29T08:26:29Z

Should there be a migrate entry added with this change?

msokolov · 2024-09-29T20:53:44Z

Should there be a migrate entry added with this change?

oh thanks, yes, and a CHANGES entry. I opened #13833 if you want to review

Our lucene_snapshot branch requires updating after apache/lucene#13779

…13850) introduced in the major refactor #13779 Off-heap scoring is only present for byte[] vectors, and it isn't enough to verify that the vector provider also satisfies the HasIndexSlice interface. The vectors need to be byte vectors otherwise, the slice iterations and scoring are completely nonsensical leading to HNSW graph building to run until the heat-death of the universe.

Michael Sokolov added 12 commits September 12, 2024 14:19

compiles!

cd9c486

adding some ordToDoc

2bbf8f1

restore vector count argument to scalarquantizer methods

a451fdb

remove docToOrd; mostly can use iterator.index()

8152b9d

Make KnnVectorValues primarily a random access API

dce766c

HasIndexSlice

2f0cc8c

remove RandomAccessVectorValues

327b930

tests pass

98ab0a6

fixing up javadocs and making iterator methods instance methods

1450b44

rename DocIterator to DocIndexIterator

8d087e2

clean up some comments

c2ae86b

fix case where index is reordered

ff7a317

jpountz reviewed Sep 12, 2024

View reviewed changes

jpountz reviewed Sep 13, 2024

View reviewed changes

ChrisHegarty reviewed Sep 13, 2024

View reviewed changes

Michael Sokolov added 4 commits September 15, 2024 15:52

rename 'fromOrdToDoc' to 'all'; move fromIndexedDISI to codecs/lucene90

9e5b9f9

no default advance(); default cost() unsupported

d43785d

make iterator() API sane

787e89c

Merge branch 'main' into knn-vector-random

1873955

Rename IteratorSupplier->SortingIteratorSupplier and add javadoc

4feecf8

cache vector values iterators in VectorFieldSources

abc1713

add implementations of KnnVectorValues.copy()

a2ca172

Merge remote-tracking branch 'origin/main' into knn-vector-random

274859f

fix SlowCOmpositeCodecReaderWrapper; off-by-one AND lazy iterator access

2b21668

Michael Sokolov added 5 commits September 20, 2024 18:56

Merge remote-tracking branch 'origin/main' into knn-vector-random

2a284f2

resolve merge conflicts

cb62025

fix NPE introduced in recent patch when segment has no vectors

29c9e00

fix failing test due to leaking static in test class when iters>0

e219f3b

remove stray print in test

a8dfe68

msokolov mentioned this pull request Sep 28, 2024

Replace need for KnnVectorValues.copy() with a dictionary interface #13831

Open

msokolov merged commit 6053e1e into apache:main Sep 28, 2024
4 checks passed

msokolov deleted the knn-vector-random branch September 28, 2024 13:14

javanna added a commit to javanna/elasticsearch that referenced this pull request Sep 30, 2024

Address compile errors after vector api changes upstream

4f16206

Our lucene_snapshot branch requires updating after apache/lucene#13779

javanna mentioned this pull request Sep 30, 2024

Address compile errors after vector api changes upstream elastic/elasticsearch#113766

Merged

javanna added the type:enhancement label Sep 30, 2024

javanna added this to the 10.0.0 milestone Sep 30, 2024

javanna added a commit to elastic/elasticsearch that referenced this pull request Sep 30, 2024

Address compile errors after vector api changes upstream (#113766)

63e524d

Our lucene_snapshot branch requires updating after apache/lucene#13779

benwtrent mentioned this pull request Oct 2, 2024

Fix bug where off-heap scorer would kick on even for float vectors #13850

Merged

First-class random access API for KnnVectorValues #13779

First-class random access API for KnnVectorValues #13779

Conversation

msokolov commented Sep 12, 2024 • edited Loading

msokolov commented Sep 12, 2024

benwtrent commented Sep 12, 2024

jpountz left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jpountz commented Sep 12, 2024

msokolov commented Sep 12, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jpountz commented Sep 13, 2024

ChrisHegarty left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

msokolov commented Sep 16, 2024

msokolov commented Sep 16, 2024 • edited Loading

msokolov commented Sep 16, 2024

jpountz commented Sep 16, 2024

msokolov commented Sep 19, 2024

benwtrent commented Sep 19, 2024

msokolov commented Sep 19, 2024

jpountz commented Sep 20, 2024

msokolov commented Sep 20, 2024

msokolov commented Sep 20, 2024

msokolov commented Sep 20, 2024

msokolov commented Sep 28, 2024

javanna commented Sep 29, 2024

msokolov commented Sep 29, 2024

msokolov commented Sep 12, 2024 •

edited

Loading

msokolov commented Sep 16, 2024 •

edited

Loading