Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

(Question) Performance on SIMD #391

Open
hardBSDk opened this issue May 19, 2024 · 3 comments
Open

(Question) Performance on SIMD #391

hardBSDk opened this issue May 19, 2024 · 3 comments
Labels
question Further information is requested

Comments

@hardBSDk
Copy link

hardBSDk commented May 19, 2024

How do you plan to optimize serial processing on SIMD?

Another thing, how your project bypassed this law?

@chadbrewbaker
Copy link

chadbrewbaker commented May 19, 2024

  1. I reached out to Geoff Langdale to see if he would be interested in porting his SIMD superoptimizer for an interaction combinator implementation. I have one on mothballs that uses e-graphs https://github.com/chadbrewbaker/egg/blob/finset/tests/finset.rs

Another in Haskell https://github.com/chadbrewbaker/endoscope/blob/master/src/Endoscope.hs - janky and shells out to z3py for some stuff but generic over all semigroups.

My advice is to keep things 32bit in the generic case for now so you can bit-blast the full 4GB range of a reference implementation and a candidate implementation to prove correctness.

As I understand MOJO - you might be able to pick up some of Chris Lattner's MLIR tricks for free if you can target his MLIR.

A lot of the MOJO tricks are just making the runtime ergonomic for benchmarking such as tuning buffer sizes for a new arch.

  1. You can't bypass Amdhal's law, but you can do my grand-advisor Gustafson's trick of using smaller problem blocks in parallel so you are hitting less levels of cache latency. https://en.wikipedia.org/wiki/Gustafson%27s_law

For many problems you have to use cache-oblivious algorithms and succinct data structures. The compiler can only get you so far.

  • 32bit operation bit-blaster to compare candidate and reference 4GB ranges of two functions.
  • IO buffer size auto-benchmarking.
  • lifting of simple/small interaction graphs to combinatorial species e-graphs.
  • Berger COZ style solver - adding intentional latency to experiment how slowdown effects runtime. Slowdown insensitive code not worth optimizing as something else is a bottleneck

@developedby developedby added the question Further information is requested label May 19, 2024
@developedby
Copy link
Member

About sequential simd operations, I haven't thought about it yet, maybe victor has some ideas. But it most likely will be done at some point.
We plan on working very hard to optimize all these common programming practices and have a fast single-thread performance.
We're only on the first version, the only thing we've really tried to optimize so far is proper distribution of work across threads.

@hardBSDk
Copy link
Author

@developedby Amazing to know that you aim to improve the single-thread performance!

I saw the website benchmark and the performance is only good with many GPU cores.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

3 participants