Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[DRAFT] Load vector data directly from the memory segment #12703

Closed
wants to merge 6 commits into from

Conversation

ChrisHegarty
Copy link
Contributor

@ChrisHegarty ChrisHegarty commented Oct 20, 2023

[ This PR is draft - not ready to be merged. It is intended to help facilitate a discussion ]

This PR enhances the vector similarity functions so that they can access the underlying memory directly, rather than first copying to a primitive array within the Java heap.

I added a number of overloads to VectorSimilarityFunction, to allow the retrieval of the vector data to be pushed down. This way, the actual retrieval can be pushed into the provider implementation. This feels right, to me.

RandomAccessVectorValues encapsulates and provides access to the underlying vector data. I added the ability to retrieve the backing IndexInput here, so that it's possible to bypass the accessor that does the copy. This is not great, but maybe ok, especially if we could restrict access?

I updated MemorySegmentIndexInput to support retrieval of the backing segment for a given position. That way the vector provider can access this directly. This kinda feels ok, it makes the vector provider and memory segment index input more close in nature, without imposing incubating APIs into the index implementation that is using preview APIs.

There are now a couple of extra variants of the distance calculation functions, but they are largely a cut'n'paste of sections of each other. We might not need them all, but that kinda depends on which are more performance sensitive than others.

Currently only float dot product has been updated, so as to first determine the potential performance benefits as well as the approach.

Outstanding work and areas for improvement:

  1. Figure out something better than exposing IndexInput in RandomAccessVectorValues
  2. Binary and other distance calculations
  3. Evaluate perf impact with macro benchmarks ( only micro done so far )
  4. There are a couple of hacks to get the benchmark running - remove these
  5. ..

relates #12482

@ChrisHegarty
Copy link
Contributor Author

Some benchmark results.

Mac M2, 128 bit

INFO: Java vector incubator API enabled; uses preferredBitSize=128
...
Benchmark                                     (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.floatDotProductVector       1024  thrpt    5   6.568 ± 0.189  ops/us
VectorUtilBenchmark.floatDotProductVectorMS1    1024  thrpt    5   8.823 ± 0.157  ops/us
VectorUtilBenchmark.floatDotProductVectorMS2    1024  thrpt    5  12.561 ± 0.298  ops/us

Rocket Lake, AVX 512

INFO: Java vector incubator API enabled; uses preferredBitSize=512
...
Benchmark                                     (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.floatDotProductVector       1024  thrpt    5   9.911 ± 0.004  ops/us
VectorUtilBenchmark.floatDotProductVectorMS1    1024  thrpt    5  10.081 ± 0.004  ops/us
VectorUtilBenchmark.floatDotProductVectorMS2    1024  thrpt    5  14.162 ± 0.020  ops/us

@rmuir
Copy link
Member

rmuir commented Oct 20, 2023

Thanks for investigating this! Can we just fix vector code to take MemorySegment and wrap array code?

I don't think we should add yet another factor to multiply the number of vector impls, there are already too many (encoding * function * cpu).

Stuff using IndexInput can be slow and wrap arrays with MemorySegment. IMO we should deprecate IndexInput for lucene 10.

@rmuir
Copy link
Member

rmuir commented Oct 20, 2023

as far as performance in practice, what kind of alignment is necessary such that it is reasonable for mmap'd files? Please, let it not be 64 bytes alignment for avx-512, that's too wasteful.

@ChrisHegarty
Copy link
Contributor Author

@rmuir If I understand your comment correctly. I unaligned the vector data in the mmap file, in the benchmark. The results are similar enough to the aligned, maybe a little less when the alignment is terrible! see b02b79b

Rocket Lake, AVX 512

INFO: Java vector incubator API enabled; uses preferredBitSize=512
...
VectorUtilBenchmark.floatDotProductVector       1024  thrpt    5   9.994 ± 0.002  ops/us
VectorUtilBenchmark.floatDotProductVectorMS1    1024  thrpt    5  11.171 ± 0.007  ops/us
VectorUtilBenchmark.floatDotProductVectorMS2    1024  thrpt    5  13.159 ± 0.017  ops/us

Mac M2, 128 bit

INFO: Java vector incubator API enabled; uses preferredBitSize=128
...
VectorUtilBenchmark.floatDotProductVector       1024  thrpt    5   5.242 ± 0.025  ops/us
VectorUtilBenchmark.floatDotProductVectorMS1    1024  thrpt    5   7.928 ± 0.155  ops/us
VectorUtilBenchmark.floatDotProductVectorMS2    1024  thrpt    5  12.546 ± 0.077  ops/us

@ChrisHegarty
Copy link
Contributor Author

Thanks for investigating this! Can we just fix vector code to take MemorySegment and wrap array code?

Yes, that is a good idea. I'll do it and see how poorly is performs. If we do this, then we still need handle the case where a vector spans across more than one segment ( well, it should only ever span at most two ). We can just copy the bytes into a float[] for this and get a memory segment view over it - this will happen so infrequently.

I don't think we should add yet another factor to multiply the number of vector impls, there are already too many (encoding * function * cpu).

Agreed.

Stuff using IndexInput can be slow and wrap arrays with MemorySegment. IMO we should deprecate IndexInput for lucene 10.

I'm not sure how this work work, but I accept the sentiment. And I think I see/agree where you are going longer term with this ( and Java 22 ).

@ChrisHegarty
Copy link
Contributor Author

Well... as simple wrapping of float[] into MemorySegment is not going to work out, the Vector API does not like it due to alignment constraints (which seems overly pedantic since it can operate fine when accessing a float[] directly! )

FloatVector fromMemorySegment(...)

throws IllegalArgumentException - if the memory segment is a heap segment that is not backed by a byte[] array.

This approach would of course work fine for the binary computations, since they are byte[] backed.

The current approach has three variants:

  1. dotProduct(float[] a, float[] b)
  2. dotProduct(float[] a, RandomAccessVectorValues<float[]> b, int bOffset)
  3. dotProduct(RandomAccessVectorValues<float[]> a, int aOffset, RandomAccessVectorValues<float[]> b, int bOffset)

.. no.1 is used as a fallback when we span segments.

I suspect that no.1 and no.2 are not as perf sensitive as no.3? In fact, we could try to trace the float[] back to where it originates, maybe read and store it as bytes but access it as a float.

Maybe there are other ways to invert things? I'll need to sleep on it!!

@rmuir
Copy link
Member

rmuir commented Oct 21, 2023

Well... as simple wrapping of float[] into MemorySegment is not going to work out, the Vector API does not like it due to alignment constraints (which seems overly pedantic since it can operate fine when accessing a float[] directly! )

But all float[] on the heap are just like any other heap object and aligned to 8 bytes or something?

$ java -XX:+PrintFlagsFinal -version | grep -i align
     bool AlignVector                              = false                                  {C2 product} {default}
     intx InteriorEntryAlignment                   = 16                                  {C2 pd product} {default}
     intx NumberOfLoopInstrToAlign                 = 4                                      {C2 product} {default}
      int ObjectAlignmentInBytes                   = 8                              {product lp64_product} {default}
     intx OptoLoopAlignment                        = 16                                     {pd product} {default}
     bool UseUnalignedLoadStores                   = true                                 {ARCH product} {default}

@uschindler
Copy link
Contributor

Hi,
I was also thinking about this but came to a bit different setup. My problem here is that it is directly linking the code in the Java 20+ code to each other and adding instanceof checks.

My idea was to have a new method on IndexInput, returning a ByteBuffer slice on RandomAccessInput for the vector that can be passed and also handled in a typesafe way without hacking "shortcut" paths between the various components removing abstractions which were introduced to separate IO from Lucene logic.

I am out of office the next week, I'd like to participate in the discussion; we should not rush anything.

@ChrisHegarty
Copy link
Contributor Author

I am out of office the next week, I'd like to participate in the discussion; we should not rush anything.

Take your time. Your input and ideas are very much welcome. We will continue to iterate here and try out different things, but ultimately will not finalize until you have had time to spend on it.

@ChrisHegarty
Copy link
Contributor Author

Well... as simple wrapping of float[] into MemorySegment is not going to work out, the Vector API does not like it due to alignment constraints (which seems overly pedantic since it can operate fine when accessing a float[] directly! )

But all float[] on the heap are just like any other heap object and aligned to 8 bytes or something?
....

It's just a leftover API restriction that will be removed in a future JDK version, see

https://mail.openjdk.org/pipermail/panama-dev/2023-October/020106.html
https://bugs.openjdk.org/browse/JDK-8318678

The upshot is that we cannot easily wrap these float[]'s now. But that may not be strictly necessary if we move to a different approach, as suggested by Uwe.

@ChrisHegarty
Copy link
Contributor Author

For what it's worth, the changes currently in this PR do not perform generally well, since we can have a mix of how we represent the underlying vector values, and where they come from. I agree with Uwe, it will be much more flexible if we consistently represent the values as ByteBuffer everywhere - that way they can on or off heap, represent float or byte values, etc, and can have a MemorySegment view in the Panama vector implementation.

@ChrisHegarty
Copy link
Contributor Author

I've not been able to spend all that much time on this this week, but here's my current thinking.

The abstractions in the PR are currently not great (as discussed above), but putting that aside for now since we can get a sense of the potential real performance impact from this approach as it is - so I did some performance experiments other than micro jmh.

It seems that this change improves the merge performance of vector data in segments by about 10% - not great, I was hoping for better. Is it worth proceeding with or maybe looking elsewhere? I'm not sure. Here's how I determine the 10% - by just hacking on KnnGraphTester from luceneUtil so that it creates an index with more than one segment when ingesting 100k+ vectors with dimensions of 768, then timing the forceMerge. This is a very rough experiment, but shows that the potential gain is much less than I expected. Caution - I could have goofed up several things, from the actual implementation to the experiment merge.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants