Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure Panama float vector distance impls inlinable #14031

Merged
merged 7 commits into from
Dec 3, 2024

Conversation

ChrisHegarty
Copy link
Contributor

@ChrisHegarty ChrisHegarty commented Dec 2, 2024

This commit reduces the Panama vector distance float implementations to less than the maximum bytecode size of a hot method to be inlined (325).

E.g. Previously: org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport::dotProductBody (355 bytes) failed to inline: callee is too large.

After: org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport::dotProductBody (311 bytes) inline (hot)

This helps things a little.

Copy link

@john-wagster john-wagster left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

very cool; LGTM

@rmuir
Copy link
Member

rmuir commented Dec 2, 2024

good here too. we can also save another 5 bytes with something like this. it seems to help me a tiny bit according to the JMH too.

not sure if it makes the code harder or easier to read/maintain. i sorta like today that it is clear at a glance there are no data dependencies. We could also move the i2/i3/4 to top of loop to accomplish that if we wanted.

--- a/lucene/core/src/java21/org/apache/lucene/internal/vectorization/PanamaVectorUtilSupport.java
+++ b/lucene/core/src/java21/org/apache/lucene/internal/vectorization/PanamaVectorUtilSupport.java
@@ -129,18 +129,21 @@ final class PanamaVectorUtilSupport implements VectorUtilSupport {
       acc1 = fma(va, vb, acc1);
 
       // two
-      FloatVector vc = FloatVector.fromArray(FLOAT_SPECIES, a, i + floatSpeciesLength);
-      FloatVector vd = FloatVector.fromArray(FLOAT_SPECIES, b, i + floatSpeciesLength);
+      final int i2 = i + floatSpeciesLength;
+      FloatVector vc = FloatVector.fromArray(FLOAT_SPECIES, a, i2);
+      FloatVector vd = FloatVector.fromArray(FLOAT_SPECIES, b, i2);
       acc2 = fma(vc, vd, acc2);
 
       // three
-      FloatVector ve = FloatVector.fromArray(FLOAT_SPECIES, a, i + 2 * floatSpeciesLength);
-      FloatVector vf = FloatVector.fromArray(FLOAT_SPECIES, b, i + 2 * floatSpeciesLength);
+      final int i3 = i2 + floatSpeciesLength;
+      FloatVector ve = FloatVector.fromArray(FLOAT_SPECIES, a, i3);
+      FloatVector vf = FloatVector.fromArray(FLOAT_SPECIES, b, i3);
       acc3 = fma(ve, vf, acc3);
 
       // four
-      FloatVector vg = FloatVector.fromArray(FLOAT_SPECIES, a, i + 3 * floatSpeciesLength);
-      FloatVector vh = FloatVector.fromArray(FLOAT_SPECIES, b, i + 3 * floatSpeciesLength);
+      final int i4 = i3 + floatSpeciesLength;
+      FloatVector vg = FloatVector.fromArray(FLOAT_SPECIES, a, i4);
+      FloatVector vh = FloatVector.fromArray(FLOAT_SPECIES, b, i4);
       acc4 = fma(vg, vh, acc4);
     }
     // vector tail: less scalar computations for unaligned sizes, esp with big vector sizes

@rmuir
Copy link
Member

rmuir commented Dec 2, 2024

We can iterate on last patch and save a few more bytes (302b) if we just pull out into a static final constant instead, too:

--- a/lucene/core/src/java21/org/apache/lucene/internal/vectorization/PanamaVectorUtilSupport.java
+++ b/lucene/core/src/java21/org/apache/lucene/internal/vectorization/PanamaVectorUtilSupport.java
@@ -75,6 +75,9 @@ final class PanamaVectorUtilSupport implements VectorUtilSupport {
     }
   }
 
+  // cached vector sizes for smaller method bodies
+  private static final int FLOAT_SPECIES_LENGTH = FLOAT_SPECIES.length();
+
   // the way FMA should work! if available use it, otherwise fall back to mul/add
   private static FloatVector fma(FloatVector a, FloatVector b, FloatVector c) {
     if (Constants.HAS_FAST_VECTOR_FMA) {
@@ -99,7 +102,7 @@ final class PanamaVectorUtilSupport implements VectorUtilSupport {
     float res = 0;
 
     // if the array size is large (> 2x platform vector size), its worth the overhead to vectorize
-    if (a.length > 2 * FLOAT_SPECIES.length()) {
+    if (a.length > 2 * FLOAT_SPECIES_LENGTH) {
       i += FLOAT_SPECIES.loopBound(a.length);
       res += dotProductBody(a, b, i);
     }
@@ -120,31 +123,33 @@ final class PanamaVectorUtilSupport implements VectorUtilSupport {
     FloatVector acc2 = FloatVector.zero(FLOAT_SPECIES);
     FloatVector acc3 = FloatVector.zero(FLOAT_SPECIES);
     FloatVector acc4 = FloatVector.zero(FLOAT_SPECIES);
-    final int floatSpeciesLength = FLOAT_SPECIES.length();
-    final int unrolledLimit = limit - 3 * floatSpeciesLength;
-    for (; i < unrolledLimit; i += 4 * floatSpeciesLength) {
+    final int unrolledLimit = limit - 3 * FLOAT_SPECIES_LENGTH;
+    for (; i < unrolledLimit; i += 4 * FLOAT_SPECIES_LENGTH) {
       // one
       FloatVector va = FloatVector.fromArray(FLOAT_SPECIES, a, i);
       FloatVector vb = FloatVector.fromArray(FLOAT_SPECIES, b, i);
       acc1 = fma(va, vb, acc1);
 
       // two
-      FloatVector vc = FloatVector.fromArray(FLOAT_SPECIES, a, i + floatSpeciesLength);
-      FloatVector vd = FloatVector.fromArray(FLOAT_SPECIES, b, i + floatSpeciesLength);
+      final int i2 = i + FLOAT_SPECIES_LENGTH;
+      FloatVector vc = FloatVector.fromArray(FLOAT_SPECIES, a, i2);
+      FloatVector vd = FloatVector.fromArray(FLOAT_SPECIES, b, i2);
       acc2 = fma(vc, vd, acc2);
 
       // three
-      FloatVector ve = FloatVector.fromArray(FLOAT_SPECIES, a, i + 2 * floatSpeciesLength);
-      FloatVector vf = FloatVector.fromArray(FLOAT_SPECIES, b, i + 2 * floatSpeciesLength);
+      final int i3 = i2 + FLOAT_SPECIES_LENGTH;
+      FloatVector ve = FloatVector.fromArray(FLOAT_SPECIES, a, i3);
+      FloatVector vf = FloatVector.fromArray(FLOAT_SPECIES, b, i3);
       acc3 = fma(ve, vf, acc3);
 
       // four
-      FloatVector vg = FloatVector.fromArray(FLOAT_SPECIES, a, i + 3 * floatSpeciesLength);
-      FloatVector vh = FloatVector.fromArray(FLOAT_SPECIES, b, i + 3 * floatSpeciesLength);
+      final int i4 = i3 + FLOAT_SPECIES_LENGTH;
+      FloatVector vg = FloatVector.fromArray(FLOAT_SPECIES, a, i4);
+      FloatVector vh = FloatVector.fromArray(FLOAT_SPECIES, b, i4);
       acc4 = fma(vg, vh, acc4);
     }
     // vector tail: less scalar computations for unaligned sizes, esp with big vector sizes
-    for (; i < limit; i += floatSpeciesLength) {
+    for (; i < limit; i += FLOAT_SPECIES_LENGTH) {
       FloatVector va = FloatVector.fromArray(FLOAT_SPECIES, a, i);
       FloatVector vb = FloatVector.fromArray(FLOAT_SPECIES, b, i);
       acc1 = fma(va, vb, acc1);

I feel like it makes the code a bit easier on the eyes, and benchie is happy:

Benchmark                                         (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.floatDotProductVector (main)    1024  thrpt   75  12.347 ± 0.148  ops/us
VectorUtilBenchmark.floatDotProductVector (patch)   1024  thrpt   75  12.754 ± 0.106  ops/us

@ChrisHegarty
Copy link
Contributor Author

@rmuir nice!!! wanna push that to the branch? Then I’ll do some more benchmark runs tomorrow too.

@rmuir
Copy link
Member

rmuir commented Dec 3, 2024

I applied and tested the same approach with the other 2 functions too. cosine was already underweight: it is only unrolled twice due to complexity of the mathematical formula, but it keeps the floats consistent. we could tidy up the binary ones in similar fashion as a followup for more consistency, but since jvm can already unroll the integer math, they arent unrolled and i expect they are already under limit. microbenchmarks seem happy but I assume the real gains are from more macrobenchmark where the inlining can help.

Before:
Benchmark                                  (size)   Mode  Cnt   Score   Error   Units "body" size
VectorUtilBenchmark.floatCosineVector        1024  thrpt   75   8.216 ± 0.026  ops/us 345 bytes
VectorUtilBenchmark.floatDotProductVector    1024  thrpt   75  12.466 ± 0.100  ops/us 355 bytes
VectorUtilBenchmark.floatSquareVector        1024  thrpt   75  11.986 ± 0.074  ops/us 400 bytes

After:
Benchmark                                  (size)   Mode  Cnt   Score   Error   Units "body" size
VectorUtilBenchmark.floatCosineVector        1024  thrpt   75   8.377 ± 0.040  ops/us 320 bytes
VectorUtilBenchmark.floatDotProductVector    1024  thrpt   75  12.917 ± 0.113  ops/us 302 bytes
VectorUtilBenchmark.floatSquareVector        1024  thrpt   75  12.365 ± 0.085  ops/us 302 bytes

edit: re-ran square after getting it under the limit, too.

@ChrisHegarty
Copy link
Contributor Author

we could tidy up the binary ones in similar fashion as a followup for more consistency, but since jvm can already unroll the integer math, they arent unrolled and i expect they are already under limit.

++

microbenchmarks seem happy but I assume the real gains are from more macrobenchmark where the inlining can help.

Right. The microbenchmarks show some modest improvement, but it seemed reasonably straightforward to eliminate not being inlined from the equation when trawling over local luceneutil runs. I don't have specific numbers yet, but let's merge this change as an incremental improvement and keep an eye on Mike's nightly benchmark runs. :-)

@ChrisHegarty
Copy link
Contributor Author

FTR

Apple M2 Pro

baseline 
VectorUtilBenchmark.floatCosineVector        1024  thrpt   75   7.966 ± 0.093  ops/us
VectorUtilBenchmark.floatDotProductVector    1024  thrpt   75  16.439 ± 0.432  ops/us
VectorUtilBenchmark.floatSquareVector        1024  thrpt   75  14.562 ± 0.152  ops/us

candidate 
VectorUtilBenchmark.floatCosineVector        1024  thrpt   75   8.190 ± 0.089  ops/us
VectorUtilBenchmark.floatDotProductVector    1024  thrpt   75  17.063 ± 0.440  ops/us
VectorUtilBenchmark.floatSquareVector        1024  thrpt   75  14.614 ± 0.241  ops/us

Intel SkyLake

baseline
VectorUtilBenchmark.floatCosineVector        1024  thrpt   75  15.118 ± 0.128  ops/us
VectorUtilBenchmark.floatDotProductVector    1024  thrpt   75  26.564 ± 0.714  ops/us
VectorUtilBenchmark.floatSquareVector        1024  thrpt   75  25.131 ± 0.406  ops/us  

candidate
VectorUtilBenchmark.floatCosineVector        1024  thrpt   75  15.111 ± 0.136  ops/us
VectorUtilBenchmark.floatDotProductVector    1024  thrpt   75  29.269 ± 0.765  ops/us
VectorUtilBenchmark.floatSquareVector        1024  thrpt   75  24.599 ± 0.135  ops/us

@ChrisHegarty ChrisHegarty changed the title Reduce dotProductBody to less than the maximum bytecode size of a hot method to be inlined (325) Ensure Panama float distance impls inlinable Dec 3, 2024
@ChrisHegarty ChrisHegarty changed the title Ensure Panama float distance impls inlinable Ensure Panama float vector distance impls inlinable Dec 3, 2024
@ChrisHegarty ChrisHegarty merged commit 4f08f3d into apache:main Dec 3, 2024
3 checks passed
@ChrisHegarty ChrisHegarty deleted the dotProduct_codeSize branch December 3, 2024 10:49
ChrisHegarty added a commit that referenced this pull request Dec 3, 2024
This commit reduces the Panama vector distance float implementations to less than the maximum bytecode size of a hot method to be inlined (325).

E.g. Previously:  org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport::dotProductBody (355 bytes)   failed to inline: callee is too large.

After: org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport::dotProductBody (3xx bytes)   inline (hot)

This helps things a little.

Co-authored-by: Robert Muir <rmuir@apache.org>
@jpountz
Copy link
Contributor

jpountz commented Dec 4, 2024

FYI nightly benchmarks had a big regression last night, and this is the only change I can find that could have caused this: https://benchmarks.mikemccandless.com/VectorSearch.html.

@rmuir
Copy link
Member

rmuir commented Dec 4, 2024

@jpountz lets just revert it and figure it out separately?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants