-
Notifications
You must be signed in to change notification settings - Fork 1.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DRAFT] Load vector data directly from the memory segment #12703
Conversation
Some benchmark results. Mac M2, 128 bit
Rocket Lake, AVX 512
|
Thanks for investigating this! Can we just fix vector code to take MemorySegment and wrap array code? I don't think we should add yet another factor to multiply the number of vector impls, there are already too many (encoding * function * cpu). Stuff using IndexInput can be slow and wrap arrays with MemorySegment. IMO we should deprecate IndexInput for lucene 10. |
as far as performance in practice, what kind of alignment is necessary such that it is reasonable for mmap'd files? Please, let it not be 64 bytes alignment for avx-512, that's too wasteful. |
@rmuir If I understand your comment correctly. I unaligned the vector data in the mmap file, in the benchmark. The results are similar enough to the aligned, maybe a little less when the alignment is terrible! see b02b79b Rocket Lake, AVX 512
Mac M2, 128 bit
|
Yes, that is a good idea. I'll do it and see how poorly is performs. If we do this, then we still need handle the case where a vector spans across more than one segment ( well, it should only ever span at most two ). We can just copy the bytes into a float[] for this and get a memory segment view over it - this will happen so infrequently.
Agreed.
I'm not sure how this work work, but I accept the sentiment. And I think I see/agree where you are going longer term with this ( and Java 22 ). |
Well... as simple wrapping of float[] into MemorySegment is not going to work out, the Vector API does not like it due to alignment constraints (which seems overly pedantic since it can operate fine when accessing a float[] directly! )
This approach would of course work fine for the binary computations, since they are byte[] backed. The current approach has three variants:
.. no.1 is used as a fallback when we span segments. I suspect that no.1 and no.2 are not as perf sensitive as no.3? In fact, we could try to trace the float[] back to where it originates, maybe read and store it as bytes but access it as a float. Maybe there are other ways to invert things? I'll need to sleep on it!! |
But all
|
Hi, My idea was to have a new method on IndexInput, returning a ByteBuffer slice on RandomAccessInput for the vector that can be passed and also handled in a typesafe way without hacking "shortcut" paths between the various components removing abstractions which were introduced to separate IO from Lucene logic. I am out of office the next week, I'd like to participate in the discussion; we should not rush anything. |
Take your time. Your input and ideas are very much welcome. We will continue to iterate here and try out different things, but ultimately will not finalize until you have had time to spend on it. |
It's just a leftover API restriction that will be removed in a future JDK version, see https://mail.openjdk.org/pipermail/panama-dev/2023-October/020106.html The upshot is that we cannot easily wrap these float[]'s now. But that may not be strictly necessary if we move to a different approach, as suggested by Uwe. |
For what it's worth, the changes currently in this PR do not perform generally well, since we can have a mix of how we represent the underlying vector values, and where they come from. I agree with Uwe, it will be much more flexible if we consistently represent the values as ByteBuffer everywhere - that way they can on or off heap, represent float or byte values, etc, and can have a MemorySegment view in the Panama vector implementation. |
I've not been able to spend all that much time on this this week, but here's my current thinking. The abstractions in the PR are currently not great (as discussed above), but putting that aside for now since we can get a sense of the potential real performance impact from this approach as it is - so I did some performance experiments other than micro jmh. It seems that this change improves the merge performance of vector data in segments by about 10% - not great, I was hoping for better. Is it worth proceeding with or maybe looking elsewhere? I'm not sure. Here's how I determine the 10% - by just hacking on KnnGraphTester from luceneUtil so that it creates an index with more than one segment when ingesting 100k+ vectors with dimensions of 768, then timing the forceMerge. This is a very rough experiment, but shows that the potential gain is much less than I expected. Caution - I could have goofed up several things, from the actual implementation to the experiment merge. |
[ This PR is draft - not ready to be merged. It is intended to help facilitate a discussion ]
This PR enhances the vector similarity functions so that they can access the underlying memory directly, rather than first copying to a primitive array within the Java heap.
I added a number of overloads to
VectorSimilarityFunction
, to allow the retrieval of the vector data to be pushed down. This way, the actual retrieval can be pushed into the provider implementation. This feels right, to me.RandomAccessVectorValues
encapsulates and provides access to the underlying vector data. I added the ability to retrieve the backing IndexInput here, so that it's possible to bypass the accessor that does the copy. This is not great, but maybe ok, especially if we could restrict access?I updated
MemorySegmentIndexInput
to support retrieval of the backing segment for a given position. That way the vector provider can access this directly. This kinda feels ok, it makes the vector provider and memory segment index input more close in nature, without imposing incubating APIs into the index implementation that is using preview APIs.There are now a couple of extra variants of the distance calculation functions, but they are largely a cut'n'paste of sections of each other. We might not need them all, but that kinda depends on which are more performance sensitive than others.
Currently only float dot product has been updated, so as to first determine the potential performance benefits as well as the approach.
Outstanding work and areas for improvement:
relates #12482