Use more floating point intrinsics #5697

huonw · 2013-04-03T16:24:37Z

See #5686 before any sort of review. (Opened for discussion, etc.)

Uses the LLVM intrinsics for several of the floating point operations. Unfortunately, these seem to cause problems with the stack (e.g. https://gist.github.com/dbaupp/5300202). The most dangerous ones seem to be sin, cos and exp, so their intrinsic definitions are sin_intr etc, unsuffixed is still libc.

A testcase for all of the 1 argument ones:

use core::f64::*;
type T = f64;

macro_rules! run (
    ($f:ident) => {
        {
            let mut sum = 0.0;
            for core::uint::range(1, 100000000) |i| {
                sum += $f(i as T);
            }
            io::println(fmt!("%f", sum as float));
        }
    }
)

fn main() {
    run!(abs);
    run!(exp2);
    run!(floor);
    run!(ln);
    run!(log10);
    run!(log2);
    run!(sqrt);
    run!(cos_intr);
    run!(sin_intr);
    run!(exp_intr);
}

(asm)

Crashes during the cos_intr run with:

intrinsics-bench: /home/huon/rust/src/rt/rust_stack.cpp:53: void check_stack_canary(stk_seg*): Assertion `stk->canary == canary_value && "Somebody killed the canary"' failed.
Aborted

… from LLVM 3.3.

Achieves up to 5x speed up! However, the intrinsics seem to do bad things to the stack, especially sin, cos and exp (rust-lang#5686 has discussion). Also, add f{32,64,loat}::powi, and reorganise the delegation code so that functions have the #[inline(always)] annotation, and reduce the repetition of delegate!(..).

brson · 2013-04-03T17:48:51Z

It looks like a lot of the intrinsics are implemented on x86 by just calling libc, which doesn't do our stack switch correctly. sqrt though does use the sqrtss instruction on x86. So to use these I think we need to examine every platform/intrinsic combination to see whether they are going to emit assembly or a libc call and then use conditional compilation to only use the intrinsics when we know they will be lowered to assembly.

I imagine there are some other places where LLVM could be betraying us similarly, like with the fmod instruction.

auroranockert · 2013-04-03T20:37:34Z

Since there are stack problems and we don't know which platforms/llvm-versions require stack for their intrinsics, wouldn't it make sense to use the awesome new inline-assembly functionality and just emit the few instructions by hand?

As you can see from the following overly-simplified test program, the implementation wouldn't be huge, and we could fall back to libc/rust on platforms without an asm implementation.

mod intrinsics {
  #[abi = "rust-intrinsic"]
  pub extern "rust-intrinsic" {
      fn sqrtf64(x: f64) -> f64;
  }
}

mod assembly {
  pub fn sqrtf64(x: f64) -> f64 {
    let mut r = 0.0f64;
    unsafe { asm!("sqrtsd $0, $1" : "=x"(r) : "x"(x)); }
    r
  }
}

fn main() {
  let mut s_a: f64 = 0.0;
  let mut s_i: f64 = 0.0;

  for core::uint::range(0, 1000000000) |i| {
    s_i += unsafe { intrinsics::sqrtf64(i as f64) };
    s_a += unsafe { assembly::sqrtf64(i as f64) };
  }

  io::println(fmt!("Intrinsics: %?, Assembly: %?", s_i, s_a));
}

Ps. The assembly and intrinsic versions are about as quick on my macbook, the asm one maybe 2-3% faster, but that is probably just noise.

huonw · 2013-04-04T01:31:57Z

Ok, I did some digging into LLVM. It seems that a patch to ExpandNode (the floating point operations within that) or (more usefully) ExpandLibCall that managed the C stack would fix these problems without having to resort to fragile architecture specific stuff.

(I also tried some Googling, but either I don't know the right jargon, or no one else has met this problem.)

huonw · 2013-04-04T01:37:09Z

(I forgot to mention, but this seems to give an average speedup across all the new intrinsics of about 40%, and at least 400% for some, e.g. sqrt, which makes some microbenchmarks on par with/slightly faster than gcc and clang at -O3!)

auroranockert · 2013-04-04T13:05:39Z

On a similar topic, have we considered implementing the intrinsics like https://gist.github.com/JensNockert/6da886103fc30c852032 this instead of adding each one to the compiler?

Maybe I am missing something, but it seems to work?

Ps. I made a pull-request if anyone wants to look at how I implemented it in libcore #5724, cross-crate inlining works fine as well.

brson · 2013-04-04T21:50:37Z

I r+'ed @JensNockert patch in #5724 that simplifies the intrinsic declarations. That patch also adds the optimization to use both sqrt and floor intrinsics in the math modules, so will close #5686.

@dbaupp It does seem like a general solution to this (including #5702) would require patching LLVM. I'm not sure whether we could convince them it's their bug though. The full stack growth solution as implemented by gcc includes a link-time step for rewriting calls to external library functions that do not participate in stack growth. In this case, these calls would theoretically be modified by the linker to call __morestack_nosplit. Rust doesn't rely on this though, mostly because such linkers don't exist everywhere.

That said, we are already diverging from the gcc stack growth protocol, having implemented our own support for Mac, Windows and Android. @pcwalton is now working on further LLVM changes to stack growth that are not implemented by gcc (and which we have no indication that LLVM upstream will want).

I'd suggest we solve this on our end unless that becomes impossible.

@JensNockert all things equal I would prefer to delegate to LLVM instead of writing assembly. There was a report recently about all the bad assembly in the world that reinforced this point to me.

auroranockert · 2013-04-04T21:56:51Z

@brson I changed my mind, I fully agree with you.

huonw · 2013-04-04T23:07:22Z

@brson, ah ok, it's annoying that it looks like rust might permanently have a custom LLVM. I'll leave the changes to @pcwalton since he's already working on it, and I don't have much to offer in terms of C++ anyway.

(@JensNockert nice work!)

I'll close this, but if/when the stack changes get made, I'll be happy to fix and reopen.

@pcwalton

…lton This implements the fixed_stack_segment for items with the rust-intrinsic abi, and then uses it to make f32 and f64 use intrinsics where appropriate, but without overflowing stacks and killing canaries (cf. #5686 and #5697). Hopefully. @pcwalton, the fixed_stack_segment implementation involved mirroring its implementation in `base.rs` in `trans_closure`, but without adding the `set_no_inline` (reasoning: that would defeat the purpose of intrinsics), which is possibly incorrect. I'm a little hazy about how the underlying structure works, so I've annotated the 4 that have caused problems so far, but there's no guarantee that the other intrinsics are entirely well-behaved. Anyway, it has good results (the following are just summing the result of each function for 1 up to 100 million): ``` $ ./intrinsics-perf.sh f32 func new old speedup sin 0.80 2.75 3.44 cos 0.80 2.76 3.45 sqrt 0.56 2.73 4.88 ln 1.01 2.94 2.91 log10 0.97 2.90 2.99 log2 1.01 2.95 2.92 exp 0.90 2.85 3.17 exp2 0.92 2.87 3.12 pow 6.95 8.57 1.23 geometric mean: 2.97 $ ./intrinsics-perf.sh f64 func new old speedup sin 12.08 14.06 1.16 cos 12.04 13.67 1.14 sqrt 0.49 2.73 5.57 ln 4.11 5.59 1.36 log10 5.09 6.54 1.28 log2 2.78 5.10 1.83 exp 2.00 3.97 1.99 exp2 1.71 3.71 2.17 pow 5.90 7.51 1.27 geometric mean: 1.72 ``` So about 3x faster on average for f32, and 1.7x for f64. This isn't exactly apples to apples though, since this patch also adds #[inline(always)] to all the function definitions too, which possibly gives a speedup. (fwiw, GitHub is showing 93c0888 after d9c54f8 (since I cherry-picked the latter from #5697), but git's order is the other way.)

…anishearth redundant_pattern_matching: avoid non-`const fn` calls in const contexts changelog: Avoid suggesting non-`const fn` calls in const contexts in [`redundant_pattern_matching`] Fixes rust-lang#5697

huonw added 3 commits April 4, 2013 02:54

librustc: add rint and nearbyint intrinsics to go with ceil and trunc…

2d52c0a

… from LLVM 3.3.

testsuite: update shootout-nbody to use the intrinsic sqrt

e656de5

huonw mentioned this pull request Apr 3, 2013

Use sqrt intrinsic for floating point sqrt #5686

Closed

brson mentioned this pull request Apr 3, 2013

Floating point modulus (and probably other llvm instrs) calls libc without stack switch #5702

Closed

huonw closed this Apr 4, 2013

huonw mentioned this pull request Apr 20, 2013

Allow intrinsics to use #[fixed_stack_segment], and use it for numeric intrinsics #5975

Closed

huonw deleted the core-num-intrinsics branch December 4, 2014 02:02

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Use more floating point intrinsics #5697

Use more floating point intrinsics #5697

huonw commented Apr 3, 2013

brson commented Apr 3, 2013

auroranockert commented Apr 3, 2013

huonw commented Apr 4, 2013

huonw commented Apr 4, 2013

auroranockert commented Apr 4, 2013

brson commented Apr 4, 2013

auroranockert commented Apr 4, 2013

huonw commented Apr 4, 2013

Use more floating point intrinsics #5697

Use more floating point intrinsics #5697

Conversation

huonw commented Apr 3, 2013

brson commented Apr 3, 2013

auroranockert commented Apr 3, 2013

huonw commented Apr 4, 2013

huonw commented Apr 4, 2013

auroranockert commented Apr 4, 2013

brson commented Apr 4, 2013

auroranockert commented Apr 4, 2013

huonw commented Apr 4, 2013