-
Notifications
You must be signed in to change notification settings - Fork 12.8k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use more floating point intrinsics #5697
Conversation
Achieves up to 5x speed up! However, the intrinsics seem to do bad things to the stack, especially sin, cos and exp (rust-lang#5686 has discussion). Also, add f{32,64,loat}::powi, and reorganise the delegation code so that functions have the #[inline(always)] annotation, and reduce the repetition of delegate!(..).
It looks like a lot of the intrinsics are implemented on x86 by just calling libc, which doesn't do our stack switch correctly. I imagine there are some other places where LLVM could be betraying us similarly, like with the |
Since there are stack problems and we don't know which platforms/llvm-versions require stack for their intrinsics, wouldn't it make sense to use the awesome new inline-assembly functionality and just emit the few instructions by hand? As you can see from the following overly-simplified test program, the implementation wouldn't be huge, and we could fall back to libc/rust on platforms without an asm implementation.
Ps. The assembly and intrinsic versions are about as quick on my macbook, the asm one maybe 2-3% faster, but that is probably just noise. |
Ok, I did some digging into LLVM. It seems that a patch to (I also tried some Googling, but either I don't know the right jargon, or no one else has met this problem.) |
(I forgot to mention, but this seems to give an average speedup across all the new intrinsics of about 40%, and at least 400% for some, e.g. |
On a similar topic, have we considered implementing the intrinsics like https://gist.github.com/JensNockert/6da886103fc30c852032 this instead of adding each one to the compiler? Maybe I am missing something, but it seems to work? Ps. I made a pull-request if anyone wants to look at how I implemented it in libcore #5724, cross-crate inlining works fine as well. |
I r+'ed @JensNockert patch in #5724 that simplifies the intrinsic declarations. That patch also adds the optimization to use both @dbaupp It does seem like a general solution to this (including #5702) would require patching LLVM. I'm not sure whether we could convince them it's their bug though. The full stack growth solution as implemented by gcc includes a link-time step for rewriting calls to external library functions that do not participate in stack growth. In this case, these calls would theoretically be modified by the linker to call That said, we are already diverging from the gcc stack growth protocol, having implemented our own support for Mac, Windows and Android. @pcwalton is now working on further LLVM changes to stack growth that are not implemented by gcc (and which we have no indication that LLVM upstream will want). I'd suggest we solve this on our end unless that becomes impossible. @JensNockert all things equal I would prefer to delegate to LLVM instead of writing assembly. There was a report recently about all the bad assembly in the world that reinforced this point to me. |
@brson I changed my mind, I fully agree with you. |
@brson, ah ok, it's annoying that it looks like rust might permanently have a custom LLVM. I'll leave the changes to @pcwalton since he's already working on it, and I don't have much to offer in terms of C++ anyway. (@JensNockert nice work!) I'll close this, but if/when the stack changes get made, I'll be happy to fix and reopen. |
…lton This implements the fixed_stack_segment for items with the rust-intrinsic abi, and then uses it to make f32 and f64 use intrinsics where appropriate, but without overflowing stacks and killing canaries (cf. #5686 and #5697). Hopefully. @pcwalton, the fixed_stack_segment implementation involved mirroring its implementation in `base.rs` in `trans_closure`, but without adding the `set_no_inline` (reasoning: that would defeat the purpose of intrinsics), which is possibly incorrect. I'm a little hazy about how the underlying structure works, so I've annotated the 4 that have caused problems so far, but there's no guarantee that the other intrinsics are entirely well-behaved. Anyway, it has good results (the following are just summing the result of each function for 1 up to 100 million): ``` $ ./intrinsics-perf.sh f32 func new old speedup sin 0.80 2.75 3.44 cos 0.80 2.76 3.45 sqrt 0.56 2.73 4.88 ln 1.01 2.94 2.91 log10 0.97 2.90 2.99 log2 1.01 2.95 2.92 exp 0.90 2.85 3.17 exp2 0.92 2.87 3.12 pow 6.95 8.57 1.23 geometric mean: 2.97 $ ./intrinsics-perf.sh f64 func new old speedup sin 12.08 14.06 1.16 cos 12.04 13.67 1.14 sqrt 0.49 2.73 5.57 ln 4.11 5.59 1.36 log10 5.09 6.54 1.28 log2 2.78 5.10 1.83 exp 2.00 3.97 1.99 exp2 1.71 3.71 2.17 pow 5.90 7.51 1.27 geometric mean: 1.72 ``` So about 3x faster on average for f32, and 1.7x for f64. This isn't exactly apples to apples though, since this patch also adds #[inline(always)] to all the function definitions too, which possibly gives a speedup. (fwiw, GitHub is showing 93c0888 after d9c54f8 (since I cherry-picked the latter from #5697), but git's order is the other way.)
…anishearth redundant_pattern_matching: avoid non-`const fn` calls in const contexts changelog: Avoid suggesting non-`const fn` calls in const contexts in [`redundant_pattern_matching`] Fixes rust-lang#5697
See #5686 before any sort of review. (Opened for discussion, etc.)
Uses the LLVM intrinsics for several of the floating point operations. Unfortunately, these seem to cause problems with the stack (e.g. https://gist.github.com/dbaupp/5300202). The most dangerous ones seem to be
sin
,cos
andexp
, so their intrinsic definitions aresin_intr
etc, unsuffixed is still libc.A testcase for all of the 1 argument ones:
(asm)
Crashes during the
cos_intr
run with: