SGEMM is tangled and cumbersome to edit #86

jerinphilip · 2022-03-28T16:12:33Z

There are several problems with the existing sgemm integration. The main problem I find is an if-else/ifdef-else ladder determines BLAS provider where we verbose provide precedence and also check for existence.

This begins in CMakeLists.txt. Then trickles down into CPP source. sgemm routes through an sgemm which follows the BLAS defined API to MKL to CBLAS to ONNX SGEMM under ifdefs and multiple wraps. There is also a ProdBatched giving way to a ProdBatchedOld as another level of indirection?

At two ifdefs / ifs are acceptable. At more than two some form of switch/dispatch needs to appear to reduce headhurt. The suggestion to do f32 sgemm with ruy in #79 (review) is also the addition of a fourth provider.

There are ODR compatible ways for multiple sgemm to exist, and what seems to perhaps better in my opinion is a provider::sgemm(...) with sgemm following a standard API. It is possible to allow multiple providers (MKL, OpenBLAS, Accelerate etc) to exist in an ODR compatible way and give them ranks and a mechanism to prioritize which one at runtime (maybe decide at compile-time even).

The text was updated successfully, but these errors were encountered:

jerinphilip · 2022-03-29T07:56:54Z

https://github.com/pytorch/pytorch/blob/master/caffe2/utils/math_cpu.cc uses an EIGEN fallback. Otherwise, it takes the MKL (for GemmBatched), some BLAS provider (for explicit Batch + Gemm).

https://github.com/microsoft/onnxjs/blob/8143e052c05c4b08798889c08a88c760b0238eab/src/wasm-ops/gemm.cpp#L22-L79 should be the same as https://github.com/pytorch/pytorch/blob/3cbf308a1fcc13aa9411740ce0a56fc18eb5e302/caffe2/utils/math_cpu.cc#L80. The commented PyTorch code indicates Eigen to be used as cross-platform fallback. We can omit USE_ONNX_SGEMM and simply call it USE_EIGEN to skip cloning more submodules than necessary?

There are ODR compatible ways for multiple sgemm to exist, and what seems to perhaps better in my opinion is a provider::sgemm(...) with sgemm following a standard API.

There probably isn't. It's a C-API and will conflict between OpenBLAS/MKL. I'm stupid. 🤦 We should still be able to flatten a bit of nesting though and redundant layers in between.

jerinphilip · 2022-04-06T18:22:46Z

I have managed to do something at https://github.com/jerinphilip/sgemm/blob/c1e0f921b06882e75bedce5381772191b5d81a46/src/gemm.h, mixed feelings on whether the state of source is better or worse. Good news is there's the necessary foundation to measure and improve now. I have also managed to skip onnx-sgemm to directly tap into Eigen instead.

ThinkPad (AVX2, MKL):

+ ./benchmark --batchSize 1000 --provider eigen

real    1m29.110s
user    1m10.299s
sys     0m18.329s

+ ./benchmark --batchSize 1000 --provider ruy

real    1m7.825s
user    0m52.234s
sys     0m15.150s

+ ./benchmark --batchSize 1000 --provider mkl

real    0m43.771s
user    0m31.833s
sys     0m11.817s

Oracle-Cloud ARM:

+ ./benchmark --batchSize 100 --provider blas

real    0m9.388s
user    0m17.695s
sys     0m19.640s

+ ./benchmark --batchSize 100 --provider ruy

real    0m11.816s
user    0m19.854s
sys     0m24.024s

+ ./benchmark --batchSize 100 --provider eigen

real    3m7.278s
user    3m14.561s
sys     0m24.816s

jerinphilip · 2022-04-07T16:11:13Z

Android (Termux)

$ bash ../scripts/run.sh
+ ./benchmark --batchSize 1000 --provider eigen

real    5m27.436s
user    1m35.707s
sys     0m20.999s
+ ./benchmark --batchSize 1000 --provider ruy

real    5m12.531s
user    1m25.261s
sys     0m30.092s

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SGEMM is tangled and cumbersome to edit #86

SGEMM is tangled and cumbersome to edit #86

jerinphilip commented Mar 28, 2022

jerinphilip commented Mar 29, 2022 •

edited

Loading

jerinphilip commented Apr 6, 2022

jerinphilip commented Apr 7, 2022

SGEMM is tangled and cumbersome to edit #86

SGEMM is tangled and cumbersome to edit #86

Comments

jerinphilip commented Mar 28, 2022

jerinphilip commented Mar 29, 2022 • edited Loading

jerinphilip commented Apr 6, 2022

jerinphilip commented Apr 7, 2022

jerinphilip commented Mar 29, 2022 •

edited

Loading