-
Notifications
You must be signed in to change notification settings - Fork 68
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Complex Bilinear Forms for Computational Physics #240
Merged
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
ashvardanian
force-pushed
the
main-dev
branch
from
November 24, 2024 13:22
191de7c
to
8941462
Compare
ashvardanian
force-pushed
the
main-dev
branch
from
November 26, 2024 12:11
8b22064
to
08c7ac0
Compare
ashvardanian
force-pushed
the
main-dev
branch
2 times, most recently
from
November 26, 2024 13:44
fcd5991
to
48ac9e4
Compare
ashvardanian
force-pushed
the
main-dev
branch
from
November 26, 2024 13:44
48ac9e4
to
5d9a219
Compare
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Bilinear Forms are essential in Scientific Computing. Some of the most computationally intensive cases arise in Quantum systems and their simulations, as discussed on
r/Quantum
. This PR adds support for complex inputs to make it more broadly applicable.In Python, you can execute this by consecutively calling 2 NumPy functions. Ideally, reusing a buffer for the intermediate results:
With SimSIMD, the last 2 lines are fused:
For 128-dimensional
np.float32
, the latency of 2.11 μs with NumPy went down to 1.31 μs. For smaller 16-dimensionalnp.float32
, the latency of 1.31 μs with NumPy went down to 202 ns. As always, the gap is wider for low-precisionnp.float16
representations: 2.68 μs with NumPy vs 313 ns with NumPy.Small Matrices and AVX-512
In the past, developers were used to providing separate precompiled kernels for every reasonable matrix size when dealing with small matrices. That negatively affects the binary size and makes CPU
L1i
instruction caches ineffective. With AVX-512, however, for different matrix sizes, we can reuse the same single-instruction vectorized loops with just a single additionalBZHI
instruction precomputing the load masks.Avoiding Data Dependency
A common approach in dot products is to use a single register to accumulate dot products. That
VFMADD132PS
instruction:Assuming it can run on 2 ports simultaneously, even on modern hardware, introducing data dependency between consecutive statements is inefficient. In future generations, we may be able to compute this on more ports, so to "futureproof" the solution, I use 4 intermediaries.
Avoiding Horizontal Reductions
When computing$a \dot X \dot b$ , we may prefer to evaluate $X \dot b$ first due to the associativity of matrix multiplication. On tiny inputs, the operation may be bottlenecked by computing horizontal reductions for every one of the rows in $X$ . Instead, we use more serial loads and broadcasts but only perform one horizontal accumulation in the end, assuming all of the needed intermediaries fit into a single register (or a few if we minimize the data dependency).
Intel Sapphire Rapids Benchmarks
Running on recent Intel Sapphire Rapids CPUs, one can expect the following performance metrics for 128-dimensional Bilinear Forms for SimSIMD and OpenBLAS:
Highlights:
bf16
andf16
kernels provide linear speedups proportional to the number of bits in the data type.On low-dimensional inputs, the performance gap is larger:
Highlights:
f32
, the performance grew from 31.07 to 137.34 Million operations per second.f64
, the performance grew from 23.44 to 139.63 Million operations per second.