-
Notifications
You must be signed in to change notification settings - Fork 13
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Shrink x86_64 blas library size #144
Comments
I wonder what else we could do to tweak the dynamic kernels we ship. Fro instance, can we boost the minimum x86_64 on linux from PRESCOTT? |
It would be worth summarizing what we are actually building in this repo I think. It looks like we are using Other questions I'd have:
|
TARGET=PRESCOTT in combination with DYNAMIC_ARCH is "use compiler options for Prescott when compiling the common code (thread setup, interfaces, LAPACK)", DYNAMIC _ARCH on x86_64 is a list of about 15 Intel and AMD cpus unless you specify your own subset via DYNAMIC_LIST. Can't give an exact answer for the per-model overhead, but it is something like 50 BLAS kernel objects plus parameters and function pointer table setup. Any non-included target gets supported by the next best available - again no exact figures but I'd guess at most 10 percent performance loss unless the fallback incurs restrictions like going from avx to sse. |
We discussed this a bit in the NumPy optimization team meeting yesterday. It seemed reasonable to everyone to build for fewer target architectures. When selecting the ones to ship, the expectation was that Haswell (first AVX2) and SkylakeX (first AVX512) would be important. With target
With
With a custom selection (Prescott baseline, plus 3 newer architectures):
Note: see So it's about 1.5 MB shared library size extra per architecture that is built for. The compression factor for the shared library is about 3.5x. Meaning that the current contribution of |
Given the current traffic for numpy/scipy on PyPI, such a 4 MB reduction would saves about 17 PB/year (petabytes - how often do you get to use those:)) of download volume. I think we should make a selection of architectures based on what we know, then do some performance testing, and ship the reduced-size wheels unless we find a serious performance concern. |
Makes sense to me. We should keep the PRESCOTT target for low-end processors, and add a few others based on some measure of middle-end and high-end processors. The aarch64 wheels, with only a few kernels shipped, are much smaller than the x86_64 ones. |
Something I only recently learned about is psABI level (https://gitlab.com/x86-psABIs/x86-64-ABI). This is what Linux distros have been using recently to select and deploy different optimization levels. The levels are: For NumPy, SSE3 has been part of the baseline for quite a while now, so we're kinda halfway to |
For NumPy we pick different levels per target, making the performance/size tradeoff per set of functions based on benchmarking or knowledge of what instructions are used:
Baseline is SSE3 as the highest level, which matches Prescott. We should hopefully have some more benchmarking results to decide on this soon. |
Is there any resolution of a recommended set of |
I spent some time looking through the wikipedia page about avx-512. It seems there are many more instruction variations than actual chips that support them. AMD Zen 4 (ryzen 7000) has AVX-512f, but it seems Intel has stopped shipping consumer chips with AVX-512 variations. They are available for server (XEON) level processors. There is also some talk about use of AVX-512 actually harming overall performance: due to thermal and power considerations. Although the vectored operations are faster, they come at a cost. |
From my (very) limited understanding of Apple products, the only relevant "flavor" of AVX512 support in terms of OpenBLAS' TARGET names would be SKYLAKEX for the 2017 iMac Pro, 2019 Mac Pro and 2020 MacBook Pro. |
@ev-br ran the numpy linalg benchmarks against all the possible values of From what I can tell, we can see 5 different groupings with similar performance on these benchmarks:
Does that make sense? Are there other benchmarks we should explore? Next steps: If we are agreed these are the 5 groups, I should try to make a scipy-openblas wheel using |
The main difference between SkylakeX and the newer AVX512 targets should be availability of BFLOAT16 instructions - if numpy/scipy does not benefit from them, you should stick to SKYLAKEX. Currently Cooperlake and SkylakeX would both fall back "downwards" to Haswell if no SkylakeX support is available (as Sapphire Rapids is the one with the most advanced instruction set, which may not be supported on the older models). |
Ahh, so the correct invocation would be NumPy (and so SciPy) does not yet have a native |
For those who prefer figures to text: https://github.com/ev-br/numpy-benchmarks-openblas/blob/main/viz_perf_arch.svg, generated by https://github.com/ev-br/numpy-benchmarks-openblas/blob/main/viz.benchmark.outputs.ipynb (basically, the figure is just what Matti said) |
That list looks right to me - the only thing I'm not sure about by looking at it is whether ATOM falls back to NEHALEM or not. |
yes it does (see driver/others/dynamic c), further fallback is to PRESCOTT if NEHALEM is not available |
@ev-br could you (on the m5 machine)
|
Here are the relative timings: https://github.com/ev-br/numpy-benchmarks-openblas/blob/reduced_kernel_list/results/benchmarks.output.txt Results are generated with pretty much the procedure of #144 (comment), only accounting for the fact that I'm building with The relative timings indeed seem to form pretty much the same groups, @czgdp1807 would you be able to double-check EDIT: edited to give the direct link to the txt file with relative timings. |
Is there a way to query OpenBLAS and get back the list of dynamic kernels built into the library? |
No, there is no function for that. The fallback mechanism is supposed to ensure that you always get some runnable kernel for your hardware, and you can set OPENBLAS_VERBOSE to get runtime feedback on which it is. As the list can get quite long (especially on x86_64 with the optional DYNAMIC_OLDER), I would not want to add it to the output of openblas_get_config. |
Awesome. @mattip probably still just in time for numpy 2.1.0rc1? That would be very nice to get in. |
Any new version of the wheels needs aarch64 wheels, which is currently failing. So something like #169 is required to unblock that. |
can you recheck #169 with Mousius' fix for your gcc10.2 build (and the target list typo corrected) please ? |
That seems to work, thanks |
Regarding this, I noticed one problem with |
Hmm. On that benchmark in the many-kernel run the |
So I added this script to parse and compare the benchmarking results produced by my repository - https://github.com/czgdp1807/numpy-benchmarks-openblas. The results in this gist. The question is how much difference between the absolute timings in these two files are acceptable. I use 10% but can be increased 20%, 30%. If we are willing to accept 38% difference in the absolute timings then we are good to go. This is what I get when I set threshold to (scipy-dev) 15:54:04:~/Quansight/numpy-benchmarks-openblas % python CompareAndParseMarkdownResults.py --file1=../benchmarks.output.txt --file2=../benchmarks.output_reduced_kernel.txt --threshold=0.38
Machine info for file ../benchmarks.output.txt is {'arch': 'x86_64', 'cpu': 'Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz', 'machine': 'asv machine', 'num_cpu': '2', 'os': 'Linux 6.8.0-1009-aws', 'ram': '8008632', 'python': '3.12', 'Cython': '', 'build': '', 'packaging': ''}
Machine info for file ../benchmarks.output_reduced_kernel.txt is {'arch': 'x86_64', 'cpu': 'Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz', 'machine': 'ip-172-31-25-34', 'num_cpu': '2', 'os': 'Linux 6.8.0-1009-aws', 'ram': '8008640', 'python': '3.12', 'Cython': '', 'build': '', 'packaging': ''} |
I can't say I understand the results yet. Two questions:
|
So here is the explanation behind the results in this gist. Consider the following snippet from the above gist. First explaining what each tuple containing 2 floating point number means here. The number in each tuple on the left comes from main branch's results/benchmarks.output.txt and the numbers on the right comes from reduced_kernel branch's results/benchmarks.output.txt. For example Now if you calculate percentage change from The results in my bench_linalg.Eindot.time_dot_d_dot_b_c
|
So there is no difference between |
Yes indeed, but if both the 15-kernel and 5-kernel builds contain HASWELL, then the "mean" result for HASWELL should be identical. Variation in absolute timing is due to something external influencing the absolute timings. It's a cloud environment and the benchmarks for 15 were run far apart from those for 5, so perhaps there was something else going on on the machine. 10% variation isn't strange, but it doesn't mean anything. |
Ah! I see. Exactly. 100% correct. Well then I think absolute timings shouldn't be compared. What should be compare is In fact, if I compare Click to see `perf_ratio` comparison between `main` and `reduced_kernel`bench_linalg.Eindot.time_matmul_d_matmul_b_c
('bench_linalg.Einsum.time_einsum_noncon_outer',"<class'numpy.float32'>")
('bench_linalg.Einsum.time_einsum_noncon_outer',"<class'numpy.float64'>")
('bench_linalg.Linalg.time_det',"'complex128'")
('bench_linalg.Linalg.time_det',"'int16'")
('bench_linalg.Linalg.time_det',"'int32'")
('bench_linalg.Linalg.time_pinv',"'complex128'")
('bench_linalg.Linalg.time_pinv',"'complex64'")
('bench_linalg.Linalg.time_svd',"'complex128'")
('bench_linalg.Linalg.time_svd',"'complex64'")
('bench_linalg.LinalgNorm.time_norm',"'complex128'")
bench_linalg.LinalgSmallArrays.time_det_small_array
|
Only |
OK, I spent a while going over the two results. It seems to me everything is as expected. There were 5 clear groupings in the |
As @martin-frbg says in a comment elsewhere
The text was updated successfully, but these errors were encountered: