[DRAFT] Load vector data directly from the memory segment #12703

ChrisHegarty · 2023-10-20T10:57:38Z

[ This PR is draft - not ready to be merged. It is intended to help facilitate a discussion ]

This PR enhances the vector similarity functions so that they can access the underlying memory directly, rather than first copying to a primitive array within the Java heap.

I added a number of overloads to VectorSimilarityFunction, to allow the retrieval of the vector data to be pushed down. This way, the actual retrieval can be pushed into the provider implementation. This feels right, to me.

RandomAccessVectorValues encapsulates and provides access to the underlying vector data. I added the ability to retrieve the backing IndexInput here, so that it's possible to bypass the accessor that does the copy. This is not great, but maybe ok, especially if we could restrict access?

I updated MemorySegmentIndexInput to support retrieval of the backing segment for a given position. That way the vector provider can access this directly. This kinda feels ok, it makes the vector provider and memory segment index input more close in nature, without imposing incubating APIs into the index implementation that is using preview APIs.

There are now a couple of extra variants of the distance calculation functions, but they are largely a cut'n'paste of sections of each other. We might not need them all, but that kinda depends on which are more performance sensitive than others.

Currently only float dot product has been updated, so as to first determine the potential performance benefits as well as the approach.

Outstanding work and areas for improvement:

Figure out something better than exposing IndexInput in RandomAccessVectorValues
Binary and other distance calculations
Evaluate perf impact with macro benchmarks ( only micro done so far )
There are a couple of hacks to get the benchmark running - remove these
..

relates #12482

ChrisHegarty · 2023-10-20T10:59:45Z

Some benchmark results.

Mac M2, 128 bit

INFO: Java vector incubator API enabled; uses preferredBitSize=128
...
Benchmark                                     (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.floatDotProductVector       1024  thrpt    5   6.568 ± 0.189  ops/us
VectorUtilBenchmark.floatDotProductVectorMS1    1024  thrpt    5   8.823 ± 0.157  ops/us
VectorUtilBenchmark.floatDotProductVectorMS2    1024  thrpt    5  12.561 ± 0.298  ops/us

Rocket Lake, AVX 512

INFO: Java vector incubator API enabled; uses preferredBitSize=512
...
Benchmark                                     (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.floatDotProductVector       1024  thrpt    5   9.911 ± 0.004  ops/us
VectorUtilBenchmark.floatDotProductVectorMS1    1024  thrpt    5  10.081 ± 0.004  ops/us
VectorUtilBenchmark.floatDotProductVectorMS2    1024  thrpt    5  14.162 ± 0.020  ops/us

rmuir · 2023-10-20T13:56:19Z

Thanks for investigating this! Can we just fix vector code to take MemorySegment and wrap array code?

I don't think we should add yet another factor to multiply the number of vector impls, there are already too many (encoding * function * cpu).

Stuff using IndexInput can be slow and wrap arrays with MemorySegment. IMO we should deprecate IndexInput for lucene 10.

rmuir · 2023-10-20T13:58:57Z

as far as performance in practice, what kind of alignment is necessary such that it is reasonable for mmap'd files? Please, let it not be 64 bytes alignment for avx-512, that's too wasteful.

ChrisHegarty · 2023-10-20T15:33:18Z

@rmuir If I understand your comment correctly. I unaligned the vector data in the mmap file, in the benchmark. The results are similar enough to the aligned, maybe a little less when the alignment is terrible! see b02b79b

Rocket Lake, AVX 512

INFO: Java vector incubator API enabled; uses preferredBitSize=512
...
VectorUtilBenchmark.floatDotProductVector       1024  thrpt    5   9.994 ± 0.002  ops/us
VectorUtilBenchmark.floatDotProductVectorMS1    1024  thrpt    5  11.171 ± 0.007  ops/us
VectorUtilBenchmark.floatDotProductVectorMS2    1024  thrpt    5  13.159 ± 0.017  ops/us

Mac M2, 128 bit

INFO: Java vector incubator API enabled; uses preferredBitSize=128
...
VectorUtilBenchmark.floatDotProductVector       1024  thrpt    5   5.242 ± 0.025  ops/us
VectorUtilBenchmark.floatDotProductVectorMS1    1024  thrpt    5   7.928 ± 0.155  ops/us
VectorUtilBenchmark.floatDotProductVectorMS2    1024  thrpt    5  12.546 ± 0.077  ops/us

ChrisHegarty · 2023-10-20T15:40:19Z

Thanks for investigating this! Can we just fix vector code to take MemorySegment and wrap array code?

Yes, that is a good idea. I'll do it and see how poorly is performs. If we do this, then we still need handle the case where a vector spans across more than one segment ( well, it should only ever span at most two ). We can just copy the bytes into a float[] for this and get a memory segment view over it - this will happen so infrequently.

I don't think we should add yet another factor to multiply the number of vector impls, there are already too many (encoding * function * cpu).

Agreed.

Stuff using IndexInput can be slow and wrap arrays with MemorySegment. IMO we should deprecate IndexInput for lucene 10.

I'm not sure how this work work, but I accept the sentiment. And I think I see/agree where you are going longer term with this ( and Java 22 ).

ChrisHegarty · 2023-10-20T20:39:05Z

Well... as simple wrapping of float[] into MemorySegment is not going to work out, the Vector API does not like it due to alignment constraints (which seems overly pedantic since it can operate fine when accessing a float[] directly! )

FloatVector fromMemorySegment(...)

throws IllegalArgumentException - if the memory segment is a heap segment that is not backed by a byte[] array.

This approach would of course work fine for the binary computations, since they are byte[] backed.

The current approach has three variants:

dotProduct(float[] a, float[] b)
dotProduct(float[] a, RandomAccessVectorValues<float[]> b, int bOffset)
dotProduct(RandomAccessVectorValues<float[]> a, int aOffset, RandomAccessVectorValues<float[]> b, int bOffset)

.. no.1 is used as a fallback when we span segments.

I suspect that no.1 and no.2 are not as perf sensitive as no.3? In fact, we could try to trace the float[] back to where it originates, maybe read and store it as bytes but access it as a float.

Maybe there are other ways to invert things? I'll need to sleep on it!!

rmuir · 2023-10-21T02:25:06Z

Well... as simple wrapping of float[] into MemorySegment is not going to work out, the Vector API does not like it due to alignment constraints (which seems overly pedantic since it can operate fine when accessing a float[] directly! )

But all float[] on the heap are just like any other heap object and aligned to 8 bytes or something?

$ java -XX:+PrintFlagsFinal -version | grep -i align
     bool AlignVector                              = false                                  {C2 product} {default}
     intx InteriorEntryAlignment                   = 16                                  {C2 pd product} {default}
     intx NumberOfLoopInstrToAlign                 = 4                                      {C2 product} {default}
      int ObjectAlignmentInBytes                   = 8                              {product lp64_product} {default}
     intx OptoLoopAlignment                        = 16                                     {pd product} {default}
     bool UseUnalignedLoadStores                   = true                                 {ARCH product} {default}

uschindler · 2023-10-21T11:26:53Z

Hi,
I was also thinking about this but came to a bit different setup. My problem here is that it is directly linking the code in the Java 20+ code to each other and adding instanceof checks.

My idea was to have a new method on IndexInput, returning a ByteBuffer slice on RandomAccessInput for the vector that can be passed and also handled in a typesafe way without hacking "shortcut" paths between the various components removing abstractions which were introduced to separate IO from Lucene logic.

I am out of office the next week, I'd like to participate in the discussion; we should not rush anything.

ChrisHegarty · 2023-10-21T15:53:03Z

I am out of office the next week, I'd like to participate in the discussion; we should not rush anything.

Take your time. Your input and ideas are very much welcome. We will continue to iterate here and try out different things, but ultimately will not finalize until you have had time to spend on it.

ChrisHegarty · 2023-10-25T08:59:57Z

Well... as simple wrapping of float[] into MemorySegment is not going to work out, the Vector API does not like it due to alignment constraints (which seems overly pedantic since it can operate fine when accessing a float[] directly! )

But all float[] on the heap are just like any other heap object and aligned to 8 bytes or something?
....

It's just a leftover API restriction that will be removed in a future JDK version, see

https://mail.openjdk.org/pipermail/panama-dev/2023-October/020106.html
https://bugs.openjdk.org/browse/JDK-8318678

The upshot is that we cannot easily wrap these float[]'s now. But that may not be strictly necessary if we move to a different approach, as suggested by Uwe.

ChrisHegarty · 2023-10-25T11:23:42Z

For what it's worth, the changes currently in this PR do not perform generally well, since we can have a mix of how we represent the underlying vector values, and where they come from. I agree with Uwe, it will be much more flexible if we consistently represent the values as ByteBuffer everywhere - that way they can on or off heap, represent float or byte values, etc, and can have a MemorySegment view in the Panama vector implementation.

ChrisHegarty · 2023-10-27T15:45:36Z

I've not been able to spend all that much time on this this week, but here's my current thinking.

The abstractions in the PR are currently not great (as discussed above), but putting that aside for now since we can get a sense of the potential real performance impact from this approach as it is - so I did some performance experiments other than micro jmh.

It seems that this change improves the merge performance of vector data in segments by about 10% - not great, I was hoping for better. Is it worth proceeding with or maybe looking elsewhere? I'm not sure. Here's how I determine the 10% - by just hacking on KnnGraphTester from luceneUtil so that it creates an index with more than one segment when ingesting 100k+ vectors with dimensions of 768, then timing the forceMerge. This is a very rough experiment, but shows that the potential gain is much less than I expected. Caution - I could have goofed up several things, from the actual implementation to the experiment merge.

ChrisHegarty added 4 commits October 16, 2023 10:18

initial commit

15b3d98

Merge branch 'main' into ms_vector_access

6169991

fixes and updates

ac44aa8

Merge branch 'main' into ms_vector_access

654e480

unalign vector data in the mmap file

b02b79b

replace UOE with null

d525f81

benwtrent mentioned this pull request Feb 6, 2024

Adding binary Hamming distance as similarity option for byte vectors #13076

Closed

benwtrent mentioned this pull request Mar 13, 2024

Making vector comparisons pluggable #13182

Closed

ChrisHegarty closed this Jul 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[DRAFT] Load vector data directly from the memory segment #12703

[DRAFT] Load vector data directly from the memory segment #12703

ChrisHegarty commented Oct 20, 2023 •

edited

Loading

ChrisHegarty commented Oct 20, 2023

rmuir commented Oct 20, 2023

rmuir commented Oct 20, 2023

ChrisHegarty commented Oct 20, 2023

ChrisHegarty commented Oct 20, 2023

ChrisHegarty commented Oct 20, 2023

rmuir commented Oct 21, 2023

uschindler commented Oct 21, 2023

ChrisHegarty commented Oct 21, 2023

ChrisHegarty commented Oct 25, 2023

ChrisHegarty commented Oct 25, 2023

ChrisHegarty commented Oct 27, 2023

[DRAFT] Load vector data directly from the memory segment #12703

[DRAFT] Load vector data directly from the memory segment #12703

Conversation

ChrisHegarty commented Oct 20, 2023 • edited Loading

ChrisHegarty commented Oct 20, 2023

rmuir commented Oct 20, 2023

rmuir commented Oct 20, 2023

ChrisHegarty commented Oct 20, 2023

ChrisHegarty commented Oct 20, 2023

ChrisHegarty commented Oct 20, 2023

rmuir commented Oct 21, 2023

uschindler commented Oct 21, 2023

ChrisHegarty commented Oct 21, 2023

ChrisHegarty commented Oct 25, 2023

ChrisHegarty commented Oct 25, 2023

ChrisHegarty commented Oct 27, 2023

ChrisHegarty commented Oct 20, 2023 •

edited

Loading