Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Imprecise floating point operations (fast-math) #21690

Open
Tracked by #2
mpdn opened this issue Jan 27, 2015 · 75 comments
Open
Tracked by #2

Imprecise floating point operations (fast-math) #21690

mpdn opened this issue Jan 27, 2015 · 75 comments
Labels
A-floating-point Area: Floating point numbers and arithmetic A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-enhancement Category: An issue proposing an enhancement or a PR with one. C-feature-request Category: A feature request, i.e: not implemented / a PR. I-slow Issue: Problems and improvements with respect to performance of generated code. T-lang Relevant to the language team, which will review and decide on the PR/issue.

Comments

@mpdn
Copy link
Contributor

mpdn commented Jan 27, 2015

There should be a way to use imprecise floating point operations like GCC's and Clang's -ffast-math. The simplest way to do this would be to do like GCC and Clang and implement a command line flag, but I think a better way to do this would be to create a f32fast and f64fast type that would then call the fast LLVM math functions. This way you can easily mix fast and "slow" floating point operations.

I think this could be implemented as a library if LLVM assembly could be used in the asm macro.

@kmcallister kmcallister added the A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. label Jan 28, 2015
@kmcallister
Copy link
Contributor

Inline IR was discussed on #15180. Another option is extern "llvm-intrinsic" { ... } which I vaguely think we had at some point. If we added more intrinsics to std::intrinsics would that be sufficient?

@huonw huonw added the I-slow Issue: Problems and improvements with respect to performance of generated code. label Jan 28, 2015
@mpdn
Copy link
Contributor Author

mpdn commented Jan 28, 2015

Yeah, adding it as a function in std::intrinsics could definitely work as well.

There are a few different fast math flags, but the fast flag is probably the most important as it implies all the other flags. Adding all of them would be unreasonable if using intrinsic functions, but I don't think all of them are necessary.

@emberian emberian self-assigned this Mar 25, 2015
@bluss
Copy link
Member

bluss commented Aug 17, 2015

This forum thread has examples of loops that llvm can vectorize well for integers, but doesn't for floats (a dot product).

@bluss bluss changed the title Imprecise floating point operations Imprecise floating point operations (fast-math) Dec 20, 2015
@emberian emberian removed their assignment Jan 5, 2016
@kornelski
Copy link
Contributor

kornelski commented Jun 8, 2017

I've prototyped it using a newtype: https://gitlab.com/kornelski/ffast-math (https://play.rust-lang.org/?gist=d516771d1d002f740cc9bf6eb5cacdf0&version=nightly&backtrace=0)

It works in simple cases, but the newtype solution is insufficient:

  • it doesn't work with floating-point literals. That's a huge pain when converting programs to this newtype.
  • it doesn't work with the as operator, and a trait to make that possible has been rejected before.
  • the wrapper type and extra level of indirection affects inlining of code using it. I've found some large functions where the newtype was slower than regular float, but not because of float math, but because other structs and calls around it weren't as optimized. I wasn't able to reproduce it in simple cases, so I'm not sure what exactly is going on.

So I'm very keen on seeing it supported natively in Rust.

@bluss
Copy link
Member

bluss commented Jun 8, 2017

@pornel The issue #24963 had a test case where a newtype impacted vectorization. That example was fixed (great!), sounds like the bug is probably still visible in similar code.

@pedrocr
Copy link

pedrocr commented Jun 8, 2017

I've tried -ffast-math in my C vs Rust benchmark of some graphics code:

https://github.com/pedrocr/rustc-math-bench

In the C code it's a ~20% improvement in clang but no benefit with GCC. In both cases it returns a wrong result and the math is extremely simple (multiplying a vector by a matrix). According to this:

https://stackoverflow.com/questions/38978951/can-ffast-math-be-safely-used-on-a-typical-project#38981307

-ffast-math is generally too unsafe for normal usage as it implies some strange things (e.g., NaN checks always return false). So it seems sensible to have a way to opt-in only to the more benign ones.

@kornelski
Copy link
Contributor

kornelski commented Jun 8, 2017

@pedrocr Your benchmark has a loss of precision in sum regardless of fast-math mode. Both slow and fast give wrong result compared to summation using double sum.

With double for the sum and you'll get correct result, even with -ffast-math.

You get significantly different sum with float sum, because fast-math gives you a small systemic rounding error, which accumulates over 100 million additions.

All values from matrix multiplication are the same to at least 6 digits (I've diffed printf("%f", out[i]) of all values and they're all the same).

@pedrocr
Copy link

pedrocr commented Jun 8, 2017

@pornel thanks, fixed here:

pedrocr/rustc-math-bench@8169fa3

The benchmark results are fine though, the sum is only used as a checksum. Here are the averages of three runs in ms/megapixel:

Compiler -O3 -march=native -O3 -march=native -ffast-math
clang 3.8.0-2ubuntu4 6,91 5,40 (-22%)
gcc 5.4.0-6ubuntu1~16.04.4 5,71 5,85 (+2%)

So as I mentioned before clang/llvm gets a good benefit from ffast-math but not gcc. I'd say making sure things like is_normal() still work is very important but at least on llvm it helps to be able to enable ffast-math.

@pedrocr
Copy link

pedrocr commented Jun 8, 2017

I've suggested it would make sense to expose -ffast-math using the target-feature mechanisms:

https://internals.rust-lang.org/t/pre-rfc-stabilization-of-target-feature/5176/23

@kornelski
Copy link
Contributor

Rust has fast math intrinsics, so the fast math behavior could be limited to a specific type or selected functions, without forcing the whole program into it.

@pedrocr
Copy link

pedrocr commented Jun 9, 2017

A usable solution for my use cases would probably be to have the vector types in the simd crate be the types that allow the opt-in to ffast-math. That way there's only one type I need to conciously convert the code to for speedups. But for the general solution of in normal code having to swap types seems cumbersome. But maybe just doing return val as f32 when val is an f32fast type isn't that bad.

@Mark-Simulacrum Mark-Simulacrum added the C-feature-request Category: A feature request, i.e: not implemented / a PR. label Jul 22, 2017
@Mark-Simulacrum Mark-Simulacrum added C-enhancement Category: An issue proposing an enhancement or a PR with one. and removed C-enhancement Category: An issue proposing an enhancement or a PR with one. labels Jul 26, 2017
@pedrocr
Copy link

pedrocr commented Aug 10, 2017

Created a pre-RFC discussion on internals to try and get a discussion on the best way to do this:

https://internals.rust-lang.org/t/pre-rfc-whats-the-best-way-to-implement-ffast-math/5740

@robsmith11
Copy link

Is there a current recommended approach to using fast-math optimizations in rust nightly?

@jeffvandyke
Copy link

jeffvandyke commented Oct 24, 2019

If it helps, a good benchmark comparison article between C++ and Rust floating point optimizations (link) inside loops was written recently (Oct 19), with a good Hacker News discussion exploring this concept.

Personally, I think the key is that without specifying any (EDIT: floating-point specific) flags (and after using iterators), by default clang and gcc do more optimizations on float math than Rust currently does.

(EDIT: It seems that -fvectorize -Ofast was specified for clang to get gcc-comparable results, see proceeding comment)

An important key for any discussion on opmitized float math should keep this in mind: vectorization isn't always less precise - a commenter pointed out that a vectorized floating point sum is actually more accurate than the un-vectorized version. Also see Stack Overflow: https://stackoverflow.com/a/7455442

I'm curious what criteria for vectorization clang (or gcc) uses for figuring out floating point optimization. I'm not enough of an expert in these areas to know specifics though. I'm also not sure what precision guarantees Rust makes for floating point math.

@pedrocr
Copy link

pedrocr commented Oct 24, 2019

Personally, I think the key is that without specifying any flags (and after using iterators), by default clang and gcc do more optimizations on float math than Rust currently does.

That's not the case in the article. The clang compilation was using -Ofast which apparently enables -ffast-math.

@Centril Centril added the T-lang Relevant to the language team, which will review and decide on the PR/issue. label Oct 24, 2019
@JRF63
Copy link
Contributor

JRF63 commented Nov 30, 2020

-ffast-math analysis also applies to vector intrinsics. I don't think we want to create a _fast version of every floating-point intrinsic for every architecture. Turns out this was clang lowering _mm_cmpord_ps to native LLVM IR fcmp.

Relevant part on LLVM. I read that as function calls (that return float types?) on floating-points or floating-point vectors being amenable to fast-math optimizations.

@sollyucko
Copy link
Contributor

How would the attribute work if with_attribute called with_slow_sqrt or vice versa?

@jgarvin
Copy link

jgarvin commented Aug 25, 2021

@LiamTheProgrammer All the operations is vast because it includes intrinsics. Anything here with __mm512, __mm512d, __mm256, __mm256d, __mm128, __mm128d etc.

Anecdotally in the C++ code bases I've worked on that used fast math they've generally done the inverse -- use fast math everywhere then by hand annotate specific functions where it would be problematic with pragmas to disable the optimization for those functions. This works in practice because If you care strongly about floating point performance you're probably also using vector operations, and vector operations tend to mitigate precision problems (e.g. to implement vectorized dot product you have 8 separate accumulators effectively instead of just the one in a scalar implementation).

It's almost guaranteed that in a system where you have to bless particular operations that somebody is going to ship a crate with a newtype wrapper for floats that just lists every single operation and that will become the way most people end up using the operations.

@Artoria2e5
Copy link
Contributor

Artoria2e5 commented Sep 21, 2021

Adding a wishlist item when it eventually becomes true: Rust should get a way to temporarily avoid reordering in a certain part of an expression. Fortran compilers usually reorder as they please, but avoid breaking up parenthesized stuff. (I mean, I would love to have it in C too…)

@workingjubilee
Copy link
Member

However these optimizations are approached, the consequences can be quite dire: shared objects emitted by some compilers with this style of optimizations can, upon being linked to, change the floating point CSR for the entire program and leave it that way, altering results in unrelated code. While Rust is more commonly a standalone binary and thus largely in control of its execution and not interfering with others, rather than a .so, .dll, or other dynamic library, the day that people frequently link against Rust dylibs is not far off:

https://bugzilla.redhat.com/show_bug.cgi?id=1127544

@Artoria2e5
Copy link
Contributor

Saw that thread too! Still not sure whether it affects Rust at all — crtfastmath.o is a linkage thing and the decision seems to be due to the compiler driver, not the backend codegen.

@thomcc
Copy link
Member

thomcc commented Sep 21, 2021

Everywhere I've worked using Rust, people have been linking against Rust dylibs, so it's already fairly common in my experience, just not when the program itself is written in Rust. That said, I don't think this needs to be worried that much about so long as we don't do the wrong thing when calling the linker.

I think this is a very tricky problem, and the right solution might be to have different scoped-enabled fast math options. For a while I had an RFC I was working on that was going to propose enabling them in a manner similar to target_feature, which I think would be a good model for it. It explicitly allowed you to say "this function requires strict floating point semantics, even if they required allowed in the caller", but by default you'd inherit them from the caller...

There are a lot of cases where this can benefit stuff in portable_simd, and without some way of doing this, portable handling of edge cases like NaNs could easily make things... substantially slower (for example, fixing rust-lang/stdarch#1155 sped up my code in a tight loop by 50%, and the code had... more to do than just min)

That said, I'm very sympathetic to the point made here: https://twitter.com/johnregehr/status/1440090854383706116, that defining "fast-math" in terms optimizations performed rather than semantics is just going to be a mess.

@sollyucko
Copy link
Contributor

Drawing some inspiration from the CakeML paper, perhaps we could have an annotation to mark possible values (ranges and inf/NaN), and have an annotation to allow any value in the range spanned by every combination of real number and floating point evaluation (this should allow widening and fma, I think? - it could require some tolerance for rounding in the wrong direction; perhaps returning an adjacent floating point value (1 ulp away) should be allowed), as well as some way to specify looser error bounds (e.g. within this factor of either end of the range described).
Additionally, to deal with the issue of a subexpression being optimized differently in different places, perhaps it would have to be treated as a variable (“let binding”) when optimizing the outer expression, and meanwhile itself optimized independently. Also, should calling the same function multiple times with the same value be guaranteed to return the same output? This would probably require more attention around function inlining.

@workingjubilee
Copy link
Member

Still not sure whether it affects Rust at all — crtfastmath.o is a linkage thing and the decision seems to be due to the compiler driver, not the backend codegen.

Directly? No. I am more noting it as a potential consequence of missteps: we shouldn't allow Rust code to be a "bad citizen" and change the results in unaffected programs, so we should be careful about e.g. allowing changing the FPCSR as an optimization.

I agree with a scope-oriented model that would allow defining functions that have stricter or looser FP semantics, with a broadly similar model to target_feature. I envision something more equivalent to a particular kind of closure, but either way it would inherently describe the boundaries of such optimizations, and yes, it would be a massive boon to the portable SIMD planning, since such sites are likely going to see such annotations anyways.

@RalfJung
Copy link
Member

https://simonbyrne.github.io/notes/fastmath/ makes some good points about the perils of fast-math.

@JRF63
Copy link
Contributor

JRF63 commented Nov 16, 2021

There's quite a lot of comments about rounding modes. Is it about GCC and other backends? Pretty sure LLVM's fast-math flags doesn't even touch that so there shouldn't be any problem of fast-math enabled Rust libraries messing up other code that will link to it.

Besides couldn't we already do the really dangerous floating-point environment stuff like

#![feature(link_llvm_intrinsics)]
extern {
    #[link_name="llvm.set.rounding"]
    fn set_rounding(mode: i32);
}

and also via the CSR intrinsics in core::arch?

@stephanemagnenat
Copy link

stephanemagnenat commented Jun 13, 2023

Just a note from a user here, attempting to progress on Rust CV by optimizing the Akaze visual feature detector. The lack of even opt-in location-specific fast math (such as fadd_fast and fmul_fast) on stable hinders the use of Rust in some key algorithms for computer vision, robotics and augmented reality applications. For example, in some cases, the same simple filters are 5-7 times slower than they could be (see this comparison). An alternative is to use SIMD directly, but the portable initiative has not landed yet and it is more work to rewrite the code in SIMD than simply writing reasonable loops that get auto-vectorized.

I hope that such a user's perspective can be considered when discussing the dangers of fast math. Because for the language adoption in several modern fields, there is also something to lose by not having something like fmul_fast and fadd_fast (as unsafe operations for example) on stable.

@jedbrown
Copy link
Contributor

Are you able to probe (perhaps most easily in the C code since clang exposes flags for each) which of the flags are necessary? In particular, some like -ffinite-math-only must be unsafe in Rust, while others like -ffp-contract=fast can be made safe (with suitable intrinsics on stable).

@stephanemagnenat
Copy link

stephanemagnenat commented Jun 13, 2023

Ok, so using the C code in the comparison, enabling -Ofast leads to factor 19 improvement compared to -O3! The smallest subset that leads to the improvement is -fno-signed-zeros -fassociative-math together. Removing either of them cancels it.

The comparison code is likely an extreme case as it looks like the compiler could really inline a lot.

@workingjubilee
Copy link
Member

@stephanemagnenat I assume you're using x86? What about with RUSTFLAGS=-Ctarget-feature=+fma?

@Juke-Crusader
Copy link

Does anyone know if the developers even remember that we need -ffast-math?

@stephanemagnenat
Copy link

stephanemagnenat commented Jul 18, 2023

@workingjubilee yes I'm using x86-64. Using that bench, I got worst results using the -Ctarget-feature=+fma (except for the C version that sees a 8% improvement):

cargo +nightly bench
[...]
test tests::bench_alice_convolution_parallel ... bench:      36,066 ns/iter (+/- 1,526)
test tests::bench_alice_convolution_serial   ... bench:       1,511 ns/iter (+/- 2)
test tests::bench_bjorn3_convolution         ... bench:       4,078 ns/iter (+/- 12)
test tests::bench_dodomorandi_convolution    ... bench:       3,459 ns/iter (+/- 17)
test tests::bench_pcpthm_convolution         ... bench:       1,507 ns/iter (+/- 5)
test tests::bench_zicog_convolution          ... bench:       1,595 ns/iter (+/- 3)
test tests::bench_zicog_convolution_fast     ... bench:       1,597 ns/iter (+/- 4)
test tests::bench_zicog_convolution_safe     ... bench:       3,456 ns/iter (+/- 12)
test tests::bench_zso_convolution            ... bench:      13,078 ns/iter (+/- 7)
test tests::bench_zso_convolution_ffi        ... bench:       1,209 ns/iter (+/- 48)

vs

RUSTFLAGS=-Ctarget-feature=+fma cargo +nightly bench
[...]
test tests::bench_alice_convolution_parallel ... bench:      36,376 ns/iter (+/- 1,729)
test tests::bench_alice_convolution_serial   ... bench:       6,499 ns/iter (+/- 74)
test tests::bench_bjorn3_convolution         ... bench:       4,057 ns/iter (+/- 24)
test tests::bench_dodomorandi_convolution    ... bench:       3,463 ns/iter (+/- 28)
test tests::bench_pcpthm_convolution         ... bench:       6,602 ns/iter (+/- 25)
test tests::bench_zicog_convolution          ... bench:       3,723 ns/iter (+/- 71)
test tests::bench_zicog_convolution_fast     ... bench:       6,787 ns/iter (+/- 40)
test tests::bench_zicog_convolution_safe     ... bench:       3,437 ns/iter (+/- 15)
test tests::bench_zso_convolution            ... bench:      13,073 ns/iter (+/- 86)
test tests::bench_zso_convolution_ffi        ... bench:       1,120 ns/iter (+/- 38)

Using an AMD Ryzen 9 7950X CPU. This is somewhat surprising. The +fma seems to break std::simd parallelization.

@workingjubilee
Copy link
Member

That's... Very Weird, given that usually it significantly improves it.

@stephanemagnenat
Copy link

That's... Very Weird, given that usually it significantly improves it.

I fully agree, I don't think I made a mistake but that's always a possibility. It would be interesting for others to try to replicate this little experiment, it is very easy to do: just clone the repo, and run the benchmark, with and without the +fma flag.

@Demindiro
Copy link

@stephanemagnenat I observe the same slowdown with -C target-cpu=native (CPU is 5800X)

$ cargo bench
...
test tests::bench_alice_convolution_parallel ... bench:      16,976 ns/iter (+/- 2,312)
test tests::bench_alice_convolution_serial   ... bench:       1,803 ns/iter (+/- 83)
test tests::bench_bjorn3_convolution         ... bench:       5,006 ns/iter (+/- 183)
test tests::bench_dodomorandi_convolution    ... bench:       4,033 ns/iter (+/- 124)
test tests::bench_pcpthm_convolution         ... bench:       1,570 ns/iter (+/- 23)
test tests::bench_zicog_convolution          ... bench:       1,800 ns/iter (+/- 34)
test tests::bench_zicog_convolution_fast     ... bench:       1,803 ns/iter (+/- 41)
test tests::bench_zicog_convolution_safe     ... bench:       4,207 ns/iter (+/- 106)
test tests::bench_zso_convolution            ... bench:      17,090 ns/iter (+/- 525)
test tests::bench_zso_convolution_ffi        ... bench:       1,750 ns/iter (+/- 21)

RUSTFLAGS='-C target-cpu=native' cargo bench
...
test tests::bench_alice_convolution_parallel ... bench:      18,092 ns/iter (+/- 2,504)
test tests::bench_alice_convolution_serial   ... bench:       5,406 ns/iter (+/- 94)
test tests::bench_bjorn3_convolution         ... bench:       5,298 ns/iter (+/- 111)
test tests::bench_dodomorandi_convolution    ... bench:       4,285 ns/iter (+/- 19)
test tests::bench_pcpthm_convolution         ... bench:       7,642 ns/iter (+/- 73)
test tests::bench_zicog_convolution          ... bench:       5,214 ns/iter (+/- 42)
test tests::bench_zicog_convolution_fast     ... bench:       4,979 ns/iter (+/- 90)
test tests::bench_zicog_convolution_safe     ... bench:       4,447 ns/iter (+/- 129)
test tests::bench_zso_convolution            ... bench:      17,767 ns/iter (+/- 799)
test tests::bench_zso_convolution_ffi        ... bench:       1,613 ns/iter (+/- 14)

@Demindiro
Copy link

... which in hindsight is obvious, as it enables the feature 🤦

@StHagel
Copy link

StHagel commented Jun 14, 2024

While I do think that adding an option to enable fast-math in Rust is definitely desireable, I don't like the idea of making a new type for it however.

I would rather make it an optional compiler flag, that is not set by default in --release. This way I can run my existing code with fast-math enabled if I want to and not use fast-math if I don't want to. Adding a new type would require me to either change all f64s to f64fast in my entire codebase or go through every function and think about whether it makes sense to use f64fast here or not and add var as f64 and var as f64fast all over the place.

@NobodyXu
Copy link
Contributor

Putting it in profile, allowing each crate to set it, and allowing the binary crate to override it per-crate seems to make sense.

You could then enable it for your library crate if you know it is safe , and for binary crate they can disable it if it turns out doesn't work, or enable it if they know what they are doing.

@RalfJung
Copy link
Member

RalfJung commented Jun 14, 2024

Making it a compile flag that applies to other crates sounds like a terrible idea. When you download a crate from the internet, you can't know whether it was written in a way that is compatible with fast-math semantics. It is very important not to apply fast-math semantics to code that assumes IEEE semantics.

We could have a crate-level attribute meaning "all floating-point ops in this crate are fast-math", but under no circumstances should you be able to force fast-math on other people's code. That would ultimately even undermine Rust's safety promise.

@StHagel
Copy link

StHagel commented Jun 14, 2024

We could have a crate-level attribute meaning "all floating-point ops in this crate are fast-math", but under no circumstances should you be able to force fast-math on other people's code.

That sounds like a good path to go on in my eyes. Being able to set an attribute in the Cargo.toml which basically means "This crate is fast-math-safe". Compiling your code with fast-math on would then check every dependency whether it is fast-math-safe or not and compile it accordingly.

@usamoi
Copy link
Contributor

usamoi commented Jun 14, 2024

I use core::intrinsics::f*_fast or core::intrinsics::f*_algebraic to hint compiler for auto vectorization and it totally works. The only thing that I care about is these functions are gated with core_intrinsics, which seems quite awkward.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-floating-point Area: Floating point numbers and arithmetic A-LLVM Area: Code generation parts specific to LLVM. Both correctness bugs and optimization-related issues. C-enhancement Category: An issue proposing an enhancement or a PR with one. C-feature-request Category: A feature request, i.e: not implemented / a PR. I-slow Issue: Problems and improvements with respect to performance of generated code. T-lang Relevant to the language team, which will review and decide on the PR/issue.
Projects
None yet
Development

No branches or pull requests