Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shrink x86_64 blas library size #144

Closed
mattip opened this issue Mar 6, 2024 · 36 comments · Fixed by #179
Closed

Shrink x86_64 blas library size #144

mattip opened this issue Mar 6, 2024 · 36 comments · Fixed by #179

Comments

@mattip
Copy link
Collaborator

mattip commented Mar 6, 2024

As @martin-frbg says in a comment elsewhere

Given that there never were Macs with AMD processors, or (AFAIK) with AVX-512, you could reduce the size of your library build by adding DYNAMIC_LIST="CORE2 NEHALEM SANDYBRIDGE HASWELL", removing the dedicated BLAS kernels for 10 irrelevant cpus (if you don't already do this). Probably no longer worth it for you, but I guess I should add this as the default for x86_64 builds with OSNAME=Darwin...

@mattip
Copy link
Collaborator Author

mattip commented Mar 6, 2024

I wonder what else we could do to tweak the dynamic kernels we ship. Fro instance, can we boost the minimum x86_64 on linux from PRESCOTT?

@rgommers
Copy link
Collaborator

rgommers commented Mar 7, 2024

It would be worth summarizing what we are actually building in this repo I think. It looks like we are using DYNAMIC_ARCH=1 TARGET=PRESCOTT, but it's not exactly clear to me from the OpenBLAS README what that does. I am guessing "all architectures from PRESCOTT up" - but if so, that seems a little excessive?

Other questions I'd have:

  • What is the binary size impact of including/excluding a single architecture?
  • What happens on non-included architectures? Does it use the next-older target? If so, what performance is left on the table on average?

@martin-frbg
Copy link

TARGET=PRESCOTT in combination with DYNAMIC_ARCH is "use compiler options for Prescott when compiling the common code (thread setup, interfaces, LAPACK)", DYNAMIC _ARCH on x86_64 is a list of about 15 Intel and AMD cpus unless you specify your own subset via DYNAMIC_LIST. Can't give an exact answer for the per-model overhead, but it is something like 50 BLAS kernel objects plus parameters and function pointer table setup. Any non-included target gets supported by the next best available - again no exact figures but I'd guess at most 10 percent performance loss unless the fallback incurs restrictions like going from avx to sse.

@rgommers
Copy link
Collaborator

rgommers commented Apr 16, 2024

We discussed this a bit in the NumPy optimization team meeting yesterday. It seemed reasonable to everyone to build for fewer target architectures. When selecting the ones to ship, the expectation was that Haswell (first AVX2) and SkylakeX (first AVX512) would be important.

With target PRESCOTT (current default build flags on Linux):

$ CFLAGS="$CFLAGS -fvisibility=protected -Wno-uninitialized" make BUFFERSIZE=20 DYNAMIC_ARCH=1 USE_OPENMP=0 NUM_THREADS=64 \
    OBJCONV=$PWD/objconv/objconv SYMBOLPREFIX=scipy_ LIBNAMEPREFIX=scipy_ FIXED_LIBNAME=1 \
    TARGET=PRESCOTT
$ ls -lh libscipy_openblas.so
35M

With DYNAMIC_ARCH disabled:

$ CFLAGS="$CFLAGS -fvisibility=protected -Wno-uninitialized" make BUFFERSIZE=20 DYNAMIC_ARCH=0 USE_OPENMP=0 NUM_THREADS=64 \
    OBJCONV=$PWD/objconv/objconv SYMBOLPREFIX=scipy_ LIBNAMEPREFIX=scipy_ FIXED_LIBNAME=1
$ ls -lh libscipy_openblas.so
15M

With a custom selection (Prescott baseline, plus 3 newer architectures):

$ CFLAGS="$CFLAGS -fvisibility=protected -Wno-uninitialized" make BUFFERSIZE=20 DYNAMIC_ARCH=1 USE_OPENMP=0 NUM_THREADS=64 \
    OBJCONV=$PWD/objconv/objconv SYMBOLPREFIX=scipy_ LIBNAMEPREFIX=scipy_ FIXED_LIBNAME=1 \
    TARGET=PRESCOTT DYNAMIC_LIST="HASWELL SKYLAKEX SAPPHIRERAPIDS"
$ ls -lh libscipy_openblas.so
21M

Note: see Makefile.system for details on DYNAMIC_CORE/DYNAMIC_LIST to see how architectures are selected.

So it's about 1.5 MB shared library size extra per architecture that is built for. The compression factor for the shared library is about 3.5x. Meaning that the current contribution of libopenblas.so to x86-64 numpy/scipy wheels is currently ~9.5 MB, and if we'd go from 15 to say 5 architectures, we'd reduce wheel sizes by about 4 MB.

@rgommers
Copy link
Collaborator

Given the current traffic for numpy/scipy on PyPI, such a 4 MB reduction would saves about 17 PB/year (petabytes - how often do you get to use those:)) of download volume. I think we should make a selection of architectures based on what we know, then do some performance testing, and ship the reduced-size wheels unless we find a serious performance concern.

@mattip
Copy link
Collaborator Author

mattip commented Apr 16, 2024

Makes sense to me. We should keep the PRESCOTT target for low-end processors, and add a few others based on some measure of middle-end and high-end processors. The aarch64 wheels, with only a few kernels shipped, are much smaller than the x86_64 ones.

@rgommers
Copy link
Collaborator

Something I only recently learned about is psABI level (https://gitlab.com/x86-psABIs/x86-64-ABI). This is what Linux distros have been using recently to select and deploy different optimization levels. The levels are:

image

For NumPy, SSE3 has been part of the baseline for quite a while now, so we're kinda halfway to x86-64-v2. v2 is still "very old machines", v3 probably roughly lines up with Haswell, and v4 with SkylakeX.

@rgommers
Copy link
Collaborator

For NumPy we pick different levels per target, making the performance/size tradeoff per set of functions based on benchmarking or knowledge of what instructions are used:

Generating multi-targets for "argfunc.dispatch.h" 
  Enabled targets: AVX512_SKX, AVX2, SSE42, baseline
Generating multi-targets for "x86_simd_argsort.dispatch.h" 
  Enabled targets: AVX512_SKX, AVX2
Generating multi-targets for "x86_simd_qsort.dispatch.h" 
  Enabled targets: AVX512_SKX, AVX2
Generating multi-targets for "x86_simd_qsort_16bit.dispatch.h" 
  Enabled targets: AVX512_SPR, AVX512_ICL
Generating multi-targets for "highway_qsort.dispatch.h" 
  Enabled targets: 
Generating multi-targets for "highway_qsort_16bit.dispatch.h" 
  Enabled targets: 
Generating multi-targets for "loops_arithm_fp.dispatch.h" 
  Enabled targets: FMA3__AVX2, baseline
Generating multi-targets for "loops_arithmetic.dispatch.h" 
  Enabled targets: AVX512_SKX, AVX512F, AVX2, SSE41, baseline
...
Generating multi-targets for "_simd.dispatch.h" 
  Enabled targets: SSE42, AVX2, FMA3, FMA3__AVX2, AVX512F, AVX512_SKX, baseline

Baseline is SSE3 as the highest level, which matches Prescott.

We should hopefully have some more benchmarking results to decide on this soon.

@mattip
Copy link
Collaborator Author

mattip commented Jul 22, 2024

Is there any resolution of a recommended set of DYNAMIC_LIST=??? that we can add for x86_64?

@mattip
Copy link
Collaborator Author

mattip commented Jul 23, 2024

I spent some time looking through the wikipedia page about avx-512. It seems there are many more instruction variations than actual chips that support them. AMD Zen 4 (ryzen 7000) has AVX-512f, but it seems Intel has stopped shipping consumer chips with AVX-512 variations. They are available for server (XEON) level processors.

There is also some talk about use of AVX-512 actually harming overall performance: due to thermal and power considerations. Although the vectored operations are faster, they come at a cost.

@martin-frbg
Copy link

From my (very) limited understanding of Apple products, the only relevant "flavor" of AVX512 support in terms of OpenBLAS' TARGET names would be SKYLAKEX for the 2017 iMac Pro, 2019 Mac Pro and 2020 MacBook Pro.

@mattip
Copy link
Collaborator Author

mattip commented Jul 25, 2024

@ev-br ran the numpy linalg benchmarks against all the possible values of DYNAMIC_CORE (fo linux x86_64) on a AWS m5 machine, using a script written by @czgdp1807. Here are the results as text.

From what I can tell, we can see 5 different groupings with similar performance on these benchmarks:

  • COOPERLAKE, SKYLAKEX, SAPPHIRERAPIDS (corresponding to AVX512)
  • HASWELL (AVX2)
  • SANDYBRIDGE (SSE4.2, AVX, VT-x, VT-d)
  • NEHALEM, ATOM (SSE4.2, VT-x, VT-d)
  • PENRYN, CORE2, DUNNINGTON, PRESCOTT, KATMAI, NORTHWOOD, COPPERMINE, BANIAS (SSE3)

Does that make sense? Are there other benchmarks we should explore?

Next steps: If we are agreed these are the 5 groups, I should try to make a scipy-openblas wheel using DYNAMIC_LIST="PRESCOTT NEHALEM SANDYBRIDGE HASWELL SAPPHIRERAPIDS". @martin-frbg is that the correct way to specify these groupings? I mean, if I specify SAPPHIRERAPIDS will a COOPERLAKE or SKYLAKEX cpu hit those kernels? Once the wheel is available, we can check size reduction and re-run the benchmarks to see that the performance is similar.

@martin-frbg
Copy link

The main difference between SkylakeX and the newer AVX512 targets should be availability of BFLOAT16 instructions - if numpy/scipy does not benefit from them, you should stick to SKYLAKEX. Currently Cooperlake and SkylakeX would both fall back "downwards" to Haswell if no SkylakeX support is available (as Sapphire Rapids is the one with the most advanced instruction set, which may not be supported on the older models).

@mattip
Copy link
Collaborator Author

mattip commented Jul 25, 2024

Ahh, so the correct invocation would be DYNAMIC_LIST="PRESCOTT NEHALEM SANDYBRIDGE HASWELL SKYLAKEX" (substitute SKYLAKEX for SAPPHIRERAPIDS).

NumPy (and so SciPy) does not yet have a native BFLOAT16, but there are extension modules on PyPI that extend NumPy like this, however it does not percolate into linalg calls to OpenBLAS.

@ev-br
Copy link

ev-br commented Jul 25, 2024

For those who prefer figures to text: https://github.com/ev-br/numpy-benchmarks-openblas/blob/main/viz_perf_arch.svg, generated by https://github.com/ev-br/numpy-benchmarks-openblas/blob/main/viz.benchmark.outputs.ipynb

(basically, the figure is just what Matti said)

@rgommers
Copy link
Collaborator

That list looks right to me - the only thing I'm not sure about by looking at it is whether ATOM falls back to NEHALEM or not.

@martin-frbg
Copy link

yes it does (see driver/others/dynamic c), further fallback is to PRESCOTT if NEHALEM is not available

@mattip
Copy link
Collaborator Author

mattip commented Jul 26, 2024

@ev-br could you (on the m5 machine)

  • download the build artifact wheels-ubuntu-latest-x86_64-1-manylinux from the PR
  • unzip it into a temprorary directory
  • unzip the wheel inside it as well
  • compare the size of the scipy_openblas64/lib/libscipy_openblas64_.so to the libscipy_openblas64*.so` in /numpy.libs, it should 21Mb vs. 35Mb
  • copy over the new shared object on top of the one in the virtual env (note the name is mangled to make sure the proper shared object is loaded into numpy)
  • rerun benchmarks, there should now be 5 clear groupings with no real change in performance.

@ev-br
Copy link

ev-br commented Jul 28, 2024

Here are the relative timings: https://github.com/ev-br/numpy-benchmarks-openblas/blob/reduced_kernel_list/results/benchmarks.output.txt
and the raw benchmark data are in this commit: ev-br/numpy-benchmarks-openblas@d04e34d

Results are generated with pretty much the procedure of #144 (comment), only accounting for the fact that I'm building with scipy-openblas32, so I've downloaded and replaced the relevant .so in dist-packages/scipy-openblas32/libs. And the size of the .so is indeed about 21Mb.

The relative timings indeed seem to form pretty much the same groups, @czgdp1807 would you be able to double-check
that absolute timings did not change appreciably between #144 (comment) and ev-br/numpy-benchmarks-openblas@d04e34d ?

EDIT: edited to give the direct link to the txt file with relative timings.

@mattip
Copy link
Collaborator Author

mattip commented Jul 28, 2024

Is there a way to query OpenBLAS and get back the list of dynamic kernels built into the library?

@martin-frbg
Copy link

No, there is no function for that. The fallback mechanism is supposed to ensure that you always get some runnable kernel for your hardware, and you can set OPENBLAS_VERBOSE to get runtime feedback on which it is. As the list can get quite long (especially on x86_64 with the optional DYNAMIC_OLDER), I would not want to add it to the output of openblas_get_config.

@rgommers
Copy link
Collaborator

Awesome. @mattip probably still just in time for numpy 2.1.0rc1? That would be very nice to get in.

@mattip
Copy link
Collaborator Author

mattip commented Jul 29, 2024

Any new version of the wheels needs aarch64 wheels, which is currently failing. So something like #169 is required to unblock that.

@martin-frbg
Copy link

can you recheck #169 with Mousius' fix for your gcc10.2 build (and the target list typo corrected) please ?

@mattip
Copy link
Collaborator Author

mattip commented Jul 29, 2024

can you recheck #169 with Mousius' fix for your gcc10.2 build (and the target list typo corrected) please ?

That seems to work, thanks

@czgdp1807
Copy link

czgdp1807 commented Aug 7, 2024

Regarding this, I noticed one problem with bench_linalg.Eindot.time_einsum_ij_jk_a_b with absolute timings of COPPERLAKE. In https://github.com/ev-br/numpy-benchmarks-openblas/blob/main/results/benchmarks.output.txt, I see 1.23024 with a spread of 0.0542. However in https://github.com/ev-br/numpy-benchmarks-openblas/blob/reduced_kernel_list/results/benchmarks.output.txt, its 1.52858 with a spread of 0.01885. So overall, a difference of 24% with respect to main. I think it's significant. For more detailed results for each of the kernels, I will have to parse the results and compare the differences. I think main branch has only txt results, correct. Like there are no .json files? Analysing results for each of the kernels for both the branches is difficult to do manually (because there are 83 * number_of_kernels comparisons to be made). For COPPERLAKE I found the above difference for absolute timings, so there might be more such differences.

@mattip
Copy link
Collaborator Author

mattip commented Aug 7, 2024

Hmm. On that benchmark in the many-kernel run the spread fastest-to-slowest ratio is 1:1.12, and on the reduced-kernel run it is 1:1.03. However on other benchmarks, like time_dot_trans_a_atc, the spread ratio is similar between the two runs. I wonder why that particular benchmark changed significantly?

@czgdp1807
Copy link

czgdp1807 commented Aug 8, 2024

So I added this script to parse and compare the benchmarking results produced by my repository - https://github.com/czgdp1807/numpy-benchmarks-openblas.

The results in this gist.

The question is how much difference between the absolute timings in these two files are acceptable. I use 10% but can be increased 20%, 30%. If we are willing to accept 38% difference in the absolute timings then we are good to go. This is what I get when I set threshold to 0.38. This means all the absolute timings in reduced_kernel branch are within 38% range of main branch.

(scipy-dev) 15:54:04:~/Quansight/numpy-benchmarks-openblas % python CompareAndParseMarkdownResults.py --file1=../benchmarks.output.txt --file2=../benchmarks.output_reduced_kernel.txt --threshold=0.38
Machine info for file ../benchmarks.output.txt is {'arch': 'x86_64', 'cpu': 'Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz', 'machine': 'asv machine', 'num_cpu': '2', 'os': 'Linux 6.8.0-1009-aws', 'ram': '8008632', 'python': '3.12', 'Cython': '', 'build': '', 'packaging': ''}

Machine info for file ../benchmarks.output_reduced_kernel.txt is {'arch': 'x86_64', 'cpu': 'Intel(R) Xeon(R) Platinum 8259CL CPU @ 2.50GHz', 'machine': 'ip-172-31-25-34', 'num_cpu': '2', 'os': 'Linux 6.8.0-1009-aws', 'ram': '8008640', 'python': '3.12', 'Cython': '', 'build': '', 'packaging': ''}

@rgommers
Copy link
Collaborator

rgommers commented Aug 8, 2024

I can't say I understand the results yet. Two questions:

  1. Why are there two numbers under perf_ratio? This is hard to interpret for me.
  2. Why isn't the ratio 1.0 for the architectures that were kept in this list: DYNAMIC_LIST="PRESCOTT NEHALEM SANDYBRIDGE HASWELL SKYLAKEX"? Aren't the identical kernels supposed to be executed for those?

@czgdp1807
Copy link

So here is the explanation behind the results in this gist.

Consider the following snippet from the above gist. First explaining what each tuple containing 2 floating point number means here. The number in each tuple on the left comes from main branch's results/benchmarks.output.txt and the numbers on the right comes from reduced_kernel branch's results/benchmarks.output.txt. For example (0.00169936, 0.00198691) under mean for HASWELL means 0.00169936 is mean from main and 0.00198691 is mean from reduced_kernel. mean is nothing but absolute timing. Similarly for spread and perf_ratios.

Now if you calculate percentage change from main to reduced_kernel it will come out to be abs(0.00198691 - 0.00169936)/0.00169936 comes out to be 0.169 or 16.9%. I hope this is making sense. The threshold is of 10%. So any absolute timing change of greater than 10% is included in my gist. The percentage is calculated using just mean.

The results in my gist basically mean that for each benchmark (in the following snippet its bench_linalg.Eindot.time_dot_d_dot_b_c) some kernels' absolute timing deviate by more than a certain threshold. The left one coming from main and the right one coming from reduced_kernel.

bench_linalg.Eindot.time_dot_d_dot_b_c

arch mean spread perf_ratios
SAPPHIRERAPIDS (0.00169783, 0.00215896) (4.24e-05, 3.94e-05) (1.0, 1.08659)
HASWELL (0.00169936, 0.00198691) (3.28e-05, 3.865e-05) (1.0009, 1.0)

@rgommers
Copy link
Collaborator

rgommers commented Aug 8, 2024

Okay, so then I get mean, but perf_ratios is something different, because it doesn't show that 16.9%. That's still a "difference in performance between the current and the best architecture" - where "best" is still arbitrary based on absolute timings.

For that gist, if the mean column is absolute timing and the average of the mean of the first run (first number in the tuples) and the average of the second run are different like this:

image

that doesn't mean that performance of the first run is better - it's just variation due to the machine being more loaded or something like that. If so, averaging a few runs should fix that.


We had something like this already a few months ago (plus the changes after than to order the architecture names on the x-axis always in the same order, and the buggy point for P2 dropped):

image

That's what we can draw conclusions from. When a second set of points would be shown for the same benchmark (with the openblas build with 5 kernels), then the data points for the kernels that are still included (e.g., Haswell) should overlap (modulo noise - and if not, something is wrong with the builds), and the ones that were dropped should be slightly slower. I don't think we can from the current data.


I think we can either decide to ignore the absolute timing data, because it's not meaningful for a single run, and just deploy the new builds (unless someone can explain why the code paths taken aren't identical). Or we can redo the benchmarks and generate the desired plots.

@czgdp1807
Copy link

For that gist, if the mean column is absolute timing and the average of the mean of the first run (first number in the tuples) and the average of the second run are different like this:

So there is no difference between main and reduced_kernel? I thought that the timings coming from main are the ones where we are using all 15 kernels but in reduced_kernel we are just using 5 kernels.

@rgommers
Copy link
Collaborator

rgommers commented Aug 8, 2024

I thought that the timings coming from main are the ones where we are using all 15 kernels but in reduced_kernel we are just using 5 kernels.

Yes indeed, but if both the 15-kernel and 5-kernel builds contain HASWELL, then the "mean" result for HASWELL should be identical. Variation in absolute timing is due to something external influencing the absolute timings. It's a cloud environment and the benchmarks for 15 were run far apart from those for 5, so perhaps there was something else going on on the machine. 10% variation isn't strange, but it doesn't mean anything.

@czgdp1807
Copy link

czgdp1807 commented Aug 8, 2024

Ah! I see. Exactly. 100% correct. Well then I think absolute timings shouldn't be compared. What should be compare is perf_ratios (they reflect relative timings I guess). I can do it by a minor change.

In fact, if I compare perf_ratios (see czgdp1807/numpy-benchmarks-openblas@de9c56a) then the deviation of 10% is as follows,

Click to see `perf_ratio` comparison between `main` and `reduced_kernel`

bench_linalg.Eindot.time_matmul_d_matmul_b_c

arch mean spread perf_ratios
NORTHWOOD (0.00190728, 0.00209914) (3.035e-05, 6.255e-05) (1.13067, 1.0091)
NEHALEM (0.00190955, 0.00208044) (3.26e-05, 3.86e-05) (1.13201, 1.0001)

('bench_linalg.Einsum.time_einsum_noncon_outer',"<class'numpy.float32'>")

arch mean spread perf_ratios
COOPERLAKE (0.00242309, 0.00293519) (8.345e-05, 0.00012445) (1.00105, 1.11775)
KATMAI (0.00246848, 0.00304582) (0.00012245, 0.0001641) (1.01981, 1.15988)

('bench_linalg.Einsum.time_einsum_noncon_outer',"<class'numpy.float64'>")

arch mean spread perf_ratios
PENRYN (0.00505452, 0.00592042) (0.00024495, 0.00026975) (1.11693, 1.0)
CORE2 (0.00513161, 0.00597742) (0.0002461, 0.00030465) (1.13397, 1.00963)

('bench_linalg.Linalg.time_det',"'complex128'")

arch mean spread perf_ratios
CORE2 (0.367204, 0.347862) (0.00124, 0.002455) (5.54315, 4.98735)

('bench_linalg.Linalg.time_det',"'int16'")

arch mean spread perf_ratios
CORE2 (0.0948102, 0.0905149) (0.000398, 0.000204) (3.07572, 2.76414)

('bench_linalg.Linalg.time_det',"'int32'")

arch mean spread perf_ratios
DUNNINGTON (0.0941703, 0.0891724) (0.000336, 0.0002815) (3.0469, 2.71364)

('bench_linalg.Linalg.time_pinv',"'complex128'")

arch mean spread perf_ratios
CORE2 (5.01568, 4.53378) (0.0033, 0.00125) (3.54598, 3.12886)
PENRYN (5.0174, 4.54075) (0.00405, 0.00195) (3.5472, 3.13368)
DUNNINGTON (5.02215, 4.53558) (0.0049, 0.00545) (3.55055, 3.1301)

('bench_linalg.Linalg.time_pinv',"'complex64'")

arch mean spread perf_ratios
DUNNINGTON (4.63324, 3.95039) (0.0022, 0.0023) (3.52593, 2.93189)
PENRYN (4.63416, 3.96162) (0.00195, 0.00285) (3.52663, 2.94022)
CORE2 (4.64078, 3.95499) (0.00165, 0.00415) (3.53167, 2.9353)

('bench_linalg.Linalg.time_svd',"'complex128'")

arch mean spread perf_ratios
DUNNINGTON (4.09252, 3.55301) (0.00205, 0.0023) (3.28673, 2.7962)
PENRYN (4.09357, 3.55384) (0.0013, 0.00215) (3.28758, 2.79685)
CORE2 (4.09359, 3.54911) (0.004, 0.0038) (3.28759, 2.79313)

('bench_linalg.Linalg.time_svd',"'complex64'")

arch mean spread perf_ratios
DUNNINGTON (4.09305, 3.55684) (0.0021, 0.00295) (3.27557, 2.77797)
CORE2 (4.09412, 3.5488) (0.0014, 0.0031) (3.27643, 2.77168)
PENRYN (4.09826, 3.55646) (0.0028, 0.0056) (3.27974, 2.77767)

('bench_linalg.LinalgNorm.time_norm',"'complex128'")

arch mean spread perf_ratios
DUNNINGTON (0.00316832, 0.00436316) (5.115e-05, 5.745e-05) (1.10445, 1.26898)
BANIAS (0.00318585, 0.00422341) (5.51e-05, 0.0002009) (1.11056, 1.22834)

bench_linalg.LinalgSmallArrays.time_det_small_array

arch mean spread perf_ratios
DUNNINGTON (4.78878e-06, 4.2362e-06) (1.465e-08, 3.825e-08) (1.17728, 1.0425)

@czgdp1807
Copy link

Only NEHALEM is present whose perf_ratio exceeds 10% deviation. May be because of some noise should be fine I think. Rest of the 4 kernels in the selected grouping are within 10%.

@mattip
Copy link
Collaborator Author

mattip commented Aug 8, 2024

OK, I spent a while going over the two results. It seems to me everything is as expected. There were 5 clear groupings in the main results, and these 5 clear groupings remain in the reduced results. The same core type stayed in the same group. This indicates to me that the way we chose the 5 core types is correct, and we should move forward with the change in openblas-libs #166

@mattip mattip changed the title Shrink macos x86_64 blas library size Shrink x86_64 blas library size Aug 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
5 participants