Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Reduce unrolling in Panama dotProduct float variant #14071

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

ChrisHegarty
Copy link
Contributor

Reduce unrolling in Panama dotProduct float variants.

@msokolov
Copy link
Contributor

Q: were you able to confirm what happens on ARM ?

@ChrisHegarty
Copy link
Contributor Author

yeah, I do have some ARM results. I'll post them shortly.

@rmuir
Copy link
Member

rmuir commented Dec 16, 2024

You have rocket lake right? with only 2 fma units? so the 2x may work fine for you because of that.

I haven't looked at the assembly, but if the jvm is unrolling 4x, we shouldnt need to unroll at all to keep your CPU busy. Last time i checked, it didnt do this.

I can run the script in the repo against various aws instances so that we are sure.

@rmuir
Copy link
Member

rmuir commented Dec 17, 2024

did a quick run: looks ok, but it hurts the haswell, zen2, and zen3. I'll do a pass on the instance types, since they are a bit outdated and try to bring them up to speed (e.g. no graviton4 represented).

cascadelake: ['0', 'GenuineIntel', 'Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz', '1', 'GenuineIntel', 'Intel(R) Xeon(R) Platinum 8275CL CPU @ 3.00GHz']

main

Benchmark                                  (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.floatDotProductVector    1024  thrpt   75  14.038 ± 0.140  ops/us

patch

Benchmark                                  (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.floatDotProductVector    1024  thrpt   75  14.229 ± 0.092  ops/us

graviton2: ['0', '1']

main

Benchmark                                  (size)   Mode  Cnt  Score   Error   Units
VectorUtilBenchmark.floatDotProductVector    1024  thrpt   75  7.671 ± 0.040  ops/us

patch

Benchmark                                  (size)   Mode  Cnt  Score   Error   Units
VectorUtilBenchmark.floatDotProductVector    1024  thrpt   75  8.116 ± 0.172  ops/us

graviton3: ['0', '1']

main

Benchmark                                  (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.floatDotProductVector    1024  thrpt   75  10.894 ± 0.233  ops/us

patch

Benchmark                                  (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.floatDotProductVector    1024  thrpt   75  11.194 ± 0.241  ops/us

haswell: ['0', 'GenuineIntel', 'Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz', '1', 'GenuineIntel', 'Intel(R) Xeon(R) CPU E5-2666 v3 @ 2.90GHz']

main

Benchmark                                  (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.floatDotProductVector    1024  thrpt   75  13.091 ± 0.079  ops/us

patch

Benchmark                                  (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.floatDotProductVector    1024  thrpt   75  11.177 ± 0.046  ops/us

icelake: ['0', 'GenuineIntel', 'Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz', '1', 'GenuineIntel', 'Intel(R) Xeon(R) Platinum 8375C CPU @ 2.90GHz']

main

Benchmark                                  (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.floatDotProductVector    1024  thrpt   75  14.924 ± 0.777  ops/us

patch

Benchmark                                  (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.floatDotProductVector    1024  thrpt   75  15.803 ± 0.974  ops/us

sapphirerapids: ['0', 'GenuineIntel', 'Intel(R) Xeon(R) Platinum 8488C', '1', 'GenuineIntel', 'Intel(R) Xeon(R) Platinum 8488C']

main

Benchmark                                  (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.floatDotProductVector    1024  thrpt   75  20.011 ± 0.716  ops/us

patch

Benchmark                                  (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.floatDotProductVector    1024  thrpt   75  21.082 ± 1.011  ops/us

zen2: ['0', 'AuthenticAMD', 'AMD EPYC 7R32', '1', 'AuthenticAMD', 'AMD EPYC 7R32']

main

Benchmark                                  (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.floatDotProductVector    1024  thrpt   75  16.704 ± 0.239  ops/us

patch

Benchmark                                  (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.floatDotProductVector    1024  thrpt   75  10.656 ± 0.016  ops/us

zen3: ['0', 'AuthenticAMD', 'AMD EPYC 7R13 Processor', '1', 'AuthenticAMD', 'AMD EPYC 7R13 Processor']

main

Benchmark                                  (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.floatDotProductVector    1024  thrpt   75  18.279 ± 0.395  ops/us

patch

Benchmark                                  (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.floatDotProductVector    1024  thrpt   75  17.248 ± 0.092  ops/us

zen4: ['0', 'AuthenticAMD', 'AMD EPYC 9R14', '1', 'AuthenticAMD', 'AMD EPYC 9R14']

main

Benchmark                                  (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.floatDotProductVector    1024  thrpt   75  18.610 ± 0.328  ops/us

patch

Benchmark                                  (size)   Mode  Cnt   Score   Error   Units
VectorUtilBenchmark.floatDotProductVector    1024  thrpt   75  19.149 ± 0.456  ops/us

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants