Ensure Panama float vector distance impls inlinable #14031

ChrisHegarty · 2024-12-02T16:37:29Z

This commit reduces the Panama vector distance float implementations to less than the maximum bytecode size of a hot method to be inlined (325).

E.g. Previously: org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport::dotProductBody (355 bytes) failed to inline: callee is too large.

After: org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport::dotProductBody (311 bytes) inline (hot)

This helps things a little.

… method to be inlined (325)

john-wagster

very cool; LGTM

rmuir · 2024-12-02T21:19:57Z

good here too. we can also save another 5 bytes with something like this. it seems to help me a tiny bit according to the JMH too.

not sure if it makes the code harder or easier to read/maintain. i sorta like today that it is clear at a glance there are no data dependencies. We could also move the i2/i3/4 to top of loop to accomplish that if we wanted.

--- a/lucene/core/src/java21/org/apache/lucene/internal/vectorization/PanamaVectorUtilSupport.java
+++ b/lucene/core/src/java21/org/apache/lucene/internal/vectorization/PanamaVectorUtilSupport.java
@@ -129,18 +129,21 @@ final class PanamaVectorUtilSupport implements VectorUtilSupport {
       acc1 = fma(va, vb, acc1);
 
       // two
-      FloatVector vc = FloatVector.fromArray(FLOAT_SPECIES, a, i + floatSpeciesLength);
-      FloatVector vd = FloatVector.fromArray(FLOAT_SPECIES, b, i + floatSpeciesLength);
+      final int i2 = i + floatSpeciesLength;
+      FloatVector vc = FloatVector.fromArray(FLOAT_SPECIES, a, i2);
+      FloatVector vd = FloatVector.fromArray(FLOAT_SPECIES, b, i2);
       acc2 = fma(vc, vd, acc2);
 
       // three
-      FloatVector ve = FloatVector.fromArray(FLOAT_SPECIES, a, i + 2 * floatSpeciesLength);
-      FloatVector vf = FloatVector.fromArray(FLOAT_SPECIES, b, i + 2 * floatSpeciesLength);
+      final int i3 = i2 + floatSpeciesLength;
+      FloatVector ve = FloatVector.fromArray(FLOAT_SPECIES, a, i3);
+      FloatVector vf = FloatVector.fromArray(FLOAT_SPECIES, b, i3);
       acc3 = fma(ve, vf, acc3);
 
       // four
-      FloatVector vg = FloatVector.fromArray(FLOAT_SPECIES, a, i + 3 * floatSpeciesLength);
-      FloatVector vh = FloatVector.fromArray(FLOAT_SPECIES, b, i + 3 * floatSpeciesLength);
+      final int i4 = i3 + floatSpeciesLength;
+      FloatVector vg = FloatVector.fromArray(FLOAT_SPECIES, a, i4);
+      FloatVector vh = FloatVector.fromArray(FLOAT_SPECIES, b, i4);
       acc4 = fma(vg, vh, acc4);
     }
     // vector tail: less scalar computations for unaligned sizes, esp with big vector sizes

rmuir · 2024-12-02T21:47:54Z

We can iterate on last patch and save a few more bytes (302b) if we just pull out into a static final constant instead, too:

--- a/lucene/core/src/java21/org/apache/lucene/internal/vectorization/PanamaVectorUtilSupport.java
+++ b/lucene/core/src/java21/org/apache/lucene/internal/vectorization/PanamaVectorUtilSupport.java
@@ -75,6 +75,9 @@ final class PanamaVectorUtilSupport implements VectorUtilSupport {
     }
   }
 
+  // cached vector sizes for smaller method bodies
+  private static final int FLOAT_SPECIES_LENGTH = FLOAT_SPECIES.length();
+
   // the way FMA should work! if available use it, otherwise fall back to mul/add
   private static FloatVector fma(FloatVector a, FloatVector b, FloatVector c) {
     if (Constants.HAS_FAST_VECTOR_FMA) {
@@ -99,7 +102,7 @@ final class PanamaVectorUtilSupport implements VectorUtilSupport {
     float res = 0;
 
     // if the array size is large (> 2x platform vector size), its worth the overhead to vectorize
-    if (a.length > 2 * FLOAT_SPECIES.length()) {
+    if (a.length > 2 * FLOAT_SPECIES_LENGTH) {
       i += FLOAT_SPECIES.loopBound(a.length);
       res += dotProductBody(a, b, i);
     }
@@ -120,31 +123,33 @@ final class PanamaVectorUtilSupport implements VectorUtilSupport {
     FloatVector acc2 = FloatVector.zero(FLOAT_SPECIES);
     FloatVector acc3 = FloatVector.zero(FLOAT_SPECIES);
     FloatVector acc4 = FloatVector.zero(FLOAT_SPECIES);
-    final int floatSpeciesLength = FLOAT_SPECIES.length();
-    final int unrolledLimit = limit - 3 * floatSpeciesLength;
-    for (; i < unrolledLimit; i += 4 * floatSpeciesLength) {
+    final int unrolledLimit = limit - 3 * FLOAT_SPECIES_LENGTH;
+    for (; i < unrolledLimit; i += 4 * FLOAT_SPECIES_LENGTH) {
       // one
       FloatVector va = FloatVector.fromArray(FLOAT_SPECIES, a, i);
       FloatVector vb = FloatVector.fromArray(FLOAT_SPECIES, b, i);
       acc1 = fma(va, vb, acc1);
 
       // two
-      FloatVector vc = FloatVector.fromArray(FLOAT_SPECIES, a, i + floatSpeciesLength);
-      FloatVector vd = FloatVector.fromArray(FLOAT_SPECIES, b, i + floatSpeciesLength);
+      final int i2 = i + FLOAT_SPECIES_LENGTH;
+      FloatVector vc = FloatVector.fromArray(FLOAT_SPECIES, a, i2);
+      FloatVector vd = FloatVector.fromArray(FLOAT_SPECIES, b, i2);
       acc2 = fma(vc, vd, acc2);
 
       // three
-      FloatVector ve = FloatVector.fromArray(FLOAT_SPECIES, a, i + 2 * floatSpeciesLength);
-      FloatVector vf = FloatVector.fromArray(FLOAT_SPECIES, b, i + 2 * floatSpeciesLength);
+      final int i3 = i2 + FLOAT_SPECIES_LENGTH;
+      FloatVector ve = FloatVector.fromArray(FLOAT_SPECIES, a, i3);
+      FloatVector vf = FloatVector.fromArray(FLOAT_SPECIES, b, i3);
       acc3 = fma(ve, vf, acc3);
 
       // four
-      FloatVector vg = FloatVector.fromArray(FLOAT_SPECIES, a, i + 3 * floatSpeciesLength);
-      FloatVector vh = FloatVector.fromArray(FLOAT_SPECIES, b, i + 3 * floatSpeciesLength);
+      final int i4 = i3 + FLOAT_SPECIES_LENGTH;
+      FloatVector vg = FloatVector.fromArray(FLOAT_SPECIES, a, i4);
+      FloatVector vh = FloatVector.fromArray(FLOAT_SPECIES, b, i4);
       acc4 = fma(vg, vh, acc4);
     }
     // vector tail: less scalar computations for unaligned sizes, esp with big vector sizes
-    for (; i < limit; i += floatSpeciesLength) {
+    for (; i < limit; i += FLOAT_SPECIES_LENGTH) {
       FloatVector va = FloatVector.fromArray(FLOAT_SPECIES, a, i);
       FloatVector vb = FloatVector.fromArray(FLOAT_SPECIES, b, i);
       acc1 = fma(va, vb, acc1);

I feel like it makes the code a bit easier on the eyes, and benchie is happy:

Benchmark                                         (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.floatDotProductVector (main)    1024  thrpt   75  12.347 ± 0.148  ops/us
VectorUtilBenchmark.floatDotProductVector (patch)   1024  thrpt   75  12.754 ± 0.106  ops/us

ChrisHegarty · 2024-12-02T23:20:10Z

@rmuir nice!!! wanna push that to the branch? Then I’ll do some more benchmark runs tomorrow too.

rmuir · 2024-12-03T00:50:10Z

I applied and tested the same approach with the other 2 functions too. cosine was already underweight: it is only unrolled twice due to complexity of the mathematical formula, but it keeps the floats consistent. we could tidy up the binary ones in similar fashion as a followup for more consistency, but since jvm can already unroll the integer math, they arent unrolled and i expect they are already under limit. microbenchmarks seem happy but I assume the real gains are from more macrobenchmark where the inlining can help.

Before:
Benchmark                                  (size)   Mode  Cnt   Score   Error   Units "body" size
VectorUtilBenchmark.floatCosineVector        1024  thrpt   75   8.216 ± 0.026  ops/us 345 bytes
VectorUtilBenchmark.floatDotProductVector    1024  thrpt   75  12.466 ± 0.100  ops/us 355 bytes
VectorUtilBenchmark.floatSquareVector        1024  thrpt   75  11.986 ± 0.074  ops/us 400 bytes

After:
Benchmark                                  (size)   Mode  Cnt   Score   Error   Units "body" size
VectorUtilBenchmark.floatCosineVector        1024  thrpt   75   8.377 ± 0.040  ops/us 320 bytes
VectorUtilBenchmark.floatDotProductVector    1024  thrpt   75  12.917 ± 0.113  ops/us 302 bytes
VectorUtilBenchmark.floatSquareVector        1024  thrpt   75  12.365 ± 0.085  ops/us 302 bytes

edit: re-ran square after getting it under the limit, too.

ChrisHegarty · 2024-12-03T10:12:37Z

we could tidy up the binary ones in similar fashion as a followup for more consistency, but since jvm can already unroll the integer math, they arent unrolled and i expect they are already under limit.

++

microbenchmarks seem happy but I assume the real gains are from more macrobenchmark where the inlining can help.

Right. The microbenchmarks show some modest improvement, but it seemed reasonably straightforward to eliminate not being inlined from the equation when trawling over local luceneutil runs. I don't have specific numbers yet, but let's merge this change as an incremental improvement and keep an eye on Mike's nightly benchmark runs. :-)

ChrisHegarty · 2024-12-03T10:20:17Z

FTR

Apple M2 Pro

baseline 
VectorUtilBenchmark.floatCosineVector        1024  thrpt   75   7.966 ± 0.093  ops/us
VectorUtilBenchmark.floatDotProductVector    1024  thrpt   75  16.439 ± 0.432  ops/us
VectorUtilBenchmark.floatSquareVector        1024  thrpt   75  14.562 ± 0.152  ops/us

candidate 
VectorUtilBenchmark.floatCosineVector        1024  thrpt   75   8.190 ± 0.089  ops/us
VectorUtilBenchmark.floatDotProductVector    1024  thrpt   75  17.063 ± 0.440  ops/us
VectorUtilBenchmark.floatSquareVector        1024  thrpt   75  14.614 ± 0.241  ops/us

Intel SkyLake

baseline
VectorUtilBenchmark.floatCosineVector        1024  thrpt   75  15.118 ± 0.128  ops/us
VectorUtilBenchmark.floatDotProductVector    1024  thrpt   75  26.564 ± 0.714  ops/us
VectorUtilBenchmark.floatSquareVector        1024  thrpt   75  25.131 ± 0.406  ops/us  

candidate
VectorUtilBenchmark.floatCosineVector        1024  thrpt   75  15.111 ± 0.136  ops/us
VectorUtilBenchmark.floatDotProductVector    1024  thrpt   75  29.269 ± 0.765  ops/us
VectorUtilBenchmark.floatSquareVector        1024  thrpt   75  24.599 ± 0.135  ops/us

This commit reduces the Panama vector distance float implementations to less than the maximum bytecode size of a hot method to be inlined (325). E.g. Previously: org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport::dotProductBody (355 bytes) failed to inline: callee is too large. After: org.apache.lucene.internal.vectorization.PanamaVectorUtilSupport::dotProductBody (3xx bytes) inline (hot) This helps things a little. Co-authored-by: Robert Muir <rmuir@apache.org>

jpountz · 2024-12-04T17:41:57Z

FYI nightly benchmarks had a big regression last night, and this is the only change I can find that could have caused this: https://benchmarks.mikemccandless.com/VectorSearch.html.

rmuir · 2024-12-04T19:39:29Z

@jpountz lets just revert it and figure it out separately?

This reverts commit 4f08f3d.

…#14041) This reverts commit 4f08f3d.

Reduce dotProductBody to less than the maximum bytecode size of a hot…

1286119

… method to be inlined (325)

john-wagster approved these changes Dec 2, 2024

View reviewed changes

rmuir approved these changes Dec 2, 2024

View reviewed changes

rmuir added 2 commits December 2, 2024 18:48

simplify dotProductBody a bit more: use static final

775b005

use same approach for square and cosine

9c0e2b6

reduce square distance to 302 bytes

ad786f4

Merge branch 'main' into dotProduct_codeSize

8326fe4

ChrisHegarty changed the title ~~Reduce dotProductBody to less than the maximum bytecode size of a hot method to be inlined (325)~~ Ensure Panama float distance impls inlinable Dec 3, 2024

add changes entry

fd246b2

ChrisHegarty changed the title ~~Ensure Panama float distance impls inlinable~~ Ensure Panama float vector distance impls inlinable Dec 3, 2024

minor

a9e46d3

ChrisHegarty merged commit 4f08f3d into apache:main Dec 3, 2024
3 checks passed

ChrisHegarty deleted the dotProduct_codeSize branch December 3, 2024 10:49

rmuir added a commit that referenced this pull request Dec 4, 2024

Revert "Ensure Panama float vector distance impls inlinable (#14031)"

d74e970

This reverts commit 4f08f3d.

rmuir mentioned this pull request Dec 4, 2024

Revert "Ensure Panama float vector distance impls inlinable " #14041

Merged

rmuir added a commit that referenced this pull request Dec 4, 2024

Revert "Ensure Panama float vector distance impls inlinable (#14031)" (…

c1362cc

…#14041) This reverts commit 4f08f3d.

asfgit pushed a commit that referenced this pull request Dec 4, 2024

Revert "Ensure Panama float vector distance impls inlinable (#14031)" (…

8951778

…#14041) This reverts commit 4f08f3d.

rmuir mentioned this pull request Dec 4, 2024

debug what happened with 14031 #14042

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure Panama float vector distance impls inlinable #14031

Ensure Panama float vector distance impls inlinable #14031

ChrisHegarty commented Dec 2, 2024 •

edited

Loading

john-wagster left a comment

rmuir commented Dec 2, 2024

rmuir commented Dec 2, 2024

ChrisHegarty commented Dec 2, 2024

rmuir commented Dec 3, 2024 •

edited

Loading

ChrisHegarty commented Dec 3, 2024

ChrisHegarty commented Dec 3, 2024

jpountz commented Dec 4, 2024

rmuir commented Dec 4, 2024

Ensure Panama float vector distance impls inlinable #14031

Ensure Panama float vector distance impls inlinable #14031

Conversation

ChrisHegarty commented Dec 2, 2024 • edited Loading

john-wagster left a comment

Choose a reason for hiding this comment

rmuir commented Dec 2, 2024

rmuir commented Dec 2, 2024

ChrisHegarty commented Dec 2, 2024

rmuir commented Dec 3, 2024 • edited Loading

ChrisHegarty commented Dec 3, 2024

ChrisHegarty commented Dec 3, 2024

jpountz commented Dec 4, 2024

rmuir commented Dec 4, 2024

ChrisHegarty commented Dec 2, 2024 •

edited

Loading

rmuir commented Dec 3, 2024 •

edited

Loading