Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use more floating point intrinsics #5697

Closed
wants to merge 3 commits into from

Conversation

huonw
Copy link
Member

@huonw huonw commented Apr 3, 2013

See #5686 before any sort of review. (Opened for discussion, etc.)

Uses the LLVM intrinsics for several of the floating point operations. Unfortunately, these seem to cause problems with the stack (e.g. https://gist.github.com/dbaupp/5300202). The most dangerous ones seem to be sin, cos and exp, so their intrinsic definitions are sin_intr etc, unsuffixed is still libc.

A testcase for all of the 1 argument ones:

use core::f64::*;
type T = f64;

macro_rules! run (
    ($f:ident) => {
        {
            let mut sum = 0.0;
            for core::uint::range(1, 100000000) |i| {
                sum += $f(i as T);
            }
            io::println(fmt!("%f", sum as float));
        }
    }
)

fn main() {
    run!(abs);
    run!(exp2);
    run!(floor);
    run!(ln);
    run!(log10);
    run!(log2);
    run!(sqrt);
    run!(cos_intr);
    run!(sin_intr);
    run!(exp_intr);
}

(asm)

Crashes during the cos_intr run with:

intrinsics-bench: /home/huon/rust/src/rt/rust_stack.cpp:53: void check_stack_canary(stk_seg*): Assertion `stk->canary == canary_value && "Somebody killed the canary"' failed.
Aborted

Achieves up to 5x speed up! However, the intrinsics seem to do bad
things to the stack, especially sin, cos and exp (rust-lang#5686 has
discussion).

Also, add f{32,64,loat}::powi, and reorganise the delegation code so
that functions have the #[inline(always)] annotation, and reduce the
repetition of delegate!(..).
@brson
Copy link
Contributor

brson commented Apr 3, 2013

It looks like a lot of the intrinsics are implemented on x86 by just calling libc, which doesn't do our stack switch correctly. sqrt though does use the sqrtss instruction on x86. So to use these I think we need to examine every platform/intrinsic combination to see whether they are going to emit assembly or a libc call and then use conditional compilation to only use the intrinsics when we know they will be lowered to assembly.

I imagine there are some other places where LLVM could be betraying us similarly, like with the fmod instruction.

@auroranockert
Copy link
Contributor

Since there are stack problems and we don't know which platforms/llvm-versions require stack for their intrinsics, wouldn't it make sense to use the awesome new inline-assembly functionality and just emit the few instructions by hand?

As you can see from the following overly-simplified test program, the implementation wouldn't be huge, and we could fall back to libc/rust on platforms without an asm implementation.

mod intrinsics {
  #[abi = "rust-intrinsic"]
  pub extern "rust-intrinsic" {
      fn sqrtf64(x: f64) -> f64;
  }
}

mod assembly {
  pub fn sqrtf64(x: f64) -> f64 {
    let mut r = 0.0f64;
    unsafe { asm!("sqrtsd $0, $1" : "=x"(r) : "x"(x)); }
    r
  }
}

fn main() {
  let mut s_a: f64 = 0.0;
  let mut s_i: f64 = 0.0;

  for core::uint::range(0, 1000000000) |i| {
    s_i += unsafe { intrinsics::sqrtf64(i as f64) };
    s_a += unsafe { assembly::sqrtf64(i as f64) };
  }

  io::println(fmt!("Intrinsics: %?, Assembly: %?", s_i, s_a));
}

Ps. The assembly and intrinsic versions are about as quick on my macbook, the asm one maybe 2-3% faster, but that is probably just noise.

@huonw
Copy link
Member Author

huonw commented Apr 4, 2013

Ok, I did some digging into LLVM. It seems that a patch to ExpandNode (the floating point operations within that) or (more usefully) ExpandLibCall that managed the C stack would fix these problems without having to resort to fragile architecture specific stuff.

(I also tried some Googling, but either I don't know the right jargon, or no one else has met this problem.)

@huonw
Copy link
Member Author

huonw commented Apr 4, 2013

(I forgot to mention, but this seems to give an average speedup across all the new intrinsics of about 40%, and at least 400% for some, e.g. sqrt, which makes some microbenchmarks on par with/slightly faster than gcc and clang at -O3!)

@auroranockert
Copy link
Contributor

On a similar topic, have we considered implementing the intrinsics like https://gist.github.com/JensNockert/6da886103fc30c852032 this instead of adding each one to the compiler?

Maybe I am missing something, but it seems to work?

Ps. I made a pull-request if anyone wants to look at how I implemented it in libcore #5724, cross-crate inlining works fine as well.

@brson
Copy link
Contributor

brson commented Apr 4, 2013

I r+'ed @JensNockert patch in #5724 that simplifies the intrinsic declarations. That patch also adds the optimization to use both sqrt and floor intrinsics in the math modules, so will close #5686.

@dbaupp It does seem like a general solution to this (including #5702) would require patching LLVM. I'm not sure whether we could convince them it's their bug though. The full stack growth solution as implemented by gcc includes a link-time step for rewriting calls to external library functions that do not participate in stack growth. In this case, these calls would theoretically be modified by the linker to call __morestack_nosplit. Rust doesn't rely on this though, mostly because such linkers don't exist everywhere.

That said, we are already diverging from the gcc stack growth protocol, having implemented our own support for Mac, Windows and Android. @pcwalton is now working on further LLVM changes to stack growth that are not implemented by gcc (and which we have no indication that LLVM upstream will want).

I'd suggest we solve this on our end unless that becomes impossible.

@JensNockert all things equal I would prefer to delegate to LLVM instead of writing assembly. There was a report recently about all the bad assembly in the world that reinforced this point to me.

@auroranockert
Copy link
Contributor

@brson I changed my mind, I fully agree with you.

@huonw
Copy link
Member Author

huonw commented Apr 4, 2013

@brson, ah ok, it's annoying that it looks like rust might permanently have a custom LLVM. I'll leave the changes to @pcwalton since he's already working on it, and I don't have much to offer in terms of C++ anyway.

(@JensNockert nice work!)

I'll close this, but if/when the stack changes get made, I'll be happy to fix and reopen.

@huonw huonw closed this Apr 4, 2013
bors added a commit that referenced this pull request Apr 20, 2013
…lton

This implements the fixed_stack_segment for items with the rust-intrinsic abi, and then uses it to make f32 and f64 use intrinsics where appropriate, but without overflowing stacks and killing canaries (cf. #5686 and #5697). Hopefully.

@pcwalton, the fixed_stack_segment implementation involved mirroring its implementation in `base.rs` in `trans_closure`, but without adding the `set_no_inline` (reasoning: that would defeat the purpose of intrinsics), which is possibly incorrect.

I'm a little hazy about how the underlying structure works, so I've annotated the 4 that have caused problems so far, but there's no guarantee that the other intrinsics are entirely well-behaved.

Anyway, it has good results (the following are just summing the result of each function for 1 up to 100 million):

```
$ ./intrinsics-perf.sh f32
func   new   old   speedup
sin    0.80  2.75  3.44
cos    0.80  2.76  3.45
sqrt   0.56  2.73  4.88
ln     1.01  2.94  2.91
log10  0.97  2.90  2.99
log2   1.01  2.95  2.92
exp    0.90  2.85  3.17
exp2   0.92  2.87  3.12
pow    6.95  8.57  1.23

   geometric mean: 2.97

$ ./intrinsics-perf.sh f64
func   new   old   speedup
sin    12.08  14.06  1.16
cos    12.04  13.67  1.14
sqrt   0.49  2.73  5.57
ln     4.11  5.59  1.36
log10  5.09  6.54  1.28
log2   2.78  5.10  1.83
exp    2.00  3.97  1.99
exp2   1.71  3.71  2.17
pow    5.90  7.51  1.27

   geometric mean: 1.72
```

So about 3x faster on average for f32, and 1.7x for f64. This isn't exactly apples to apples though, since this patch also adds #[inline(always)] to all the function definitions too, which possibly gives a speedup.

(fwiw, GitHub is showing 93c0888 after d9c54f8 (since I cherry-picked the latter from #5697), but git's order is the other way.)
@huonw huonw deleted the core-num-intrinsics branch December 4, 2014 02:02
flip1995 added a commit to flip1995/rust that referenced this pull request Jun 23, 2020
…anishearth

redundant_pattern_matching: avoid non-`const fn` calls in const contexts

changelog: Avoid suggesting non-`const fn` calls in const contexts in [`redundant_pattern_matching`]

Fixes rust-lang#5697
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants