LIBM - sin, cos, ln, exp and friends. #126

andy-thomason · 2021-05-31T18:13:27Z

This is a placeholder for implementing libm functions in stdsimd.

I would appreciate help and advice to integrate with existing work and create a wide range of tests.

programmerjake · 2021-05-31T20:02:19Z

We'd want to translate sin/cos/etc. to LLVM's built-in functions since there are some architectures with native instructions for those (Libre-SOC's SimpleV will have them, SPIR-V has them).

Also, mul_add should (arguably) translate to llvm's fma intrinsic: #102

andy-thomason · 2021-06-01T08:18:06Z

It is worth noting that LLVM's default behaviour is to call libm on each scalar element,
which is hundreds (possibly thousands) of times slower.

Whilst many cpus have had microcoded trig functions, they were not widely used by game
developers as they tend to have very low throughput. It would be interesting to compare the two.
It is very hard to beat densely packed FMAs.

Time will tell, I guess.

andy-thomason · 2021-06-01T08:20:48Z

Worth looking at the intel trig intrinsics.

https://software.intel.com/sites/landingpage/IntrinsicsGuide/#techs=SVML&expand=5236,5236&text=_mm256_sin_ps

If we are to have a race.

programmerjake · 2021-06-01T08:41:44Z

It is worth noting that LLVM's default behaviour is to call libm on each scalar element,
which is hundreds (possibly thousands) of times slower.

Yup, part of #109 is adding support for vector-math to llvm so llvm will generate calls to vector-math when there aren't good native instructions available.

Whilst many cpus have had microcoded trig functions, they were not widely used by game
developers as they tend to have very low throughput. It would be interesting to compare the two.

In the case of rustc targeting SPIR-V or Libre-SOC's SimpleV, using sin intrinsics is by far the best approach since many GPUs (including Libre-SOC's cores) do have high-performance hardware for calculating sin/cos/etc. faster than using a bunch of FMAs.

It is very hard to beat densely packed FMAs.

True.

Time will tell, I guess.

workingjubilee

Hello! Thank you for your patience with review!

While I agree that we would want a hardware-accelerable sin(), I feel doubtful that such is a priority: for the vast majority of architectures, we can better-specify what we need using software. As such, I would consider "and hardware-accelerated where possible" to be the optimization here, and so we can start with strictly software.

I am more interested in adding tests to this that check that we are within some measure of sanity of the current scalar sin implementation, using the existing proptest framework. Unlike most of those tests, a fairly fuzzy "within margins" check is fine here. You can find most of those in macro form in https://github.com/rust-lang/stdsimd/blob/7b66032ed511229410b0879d384b06cfcacf0538/crates/core_simd/tests/ops_macros.rs

I am also curious if this is more or less precise than such, on average.

workingjubilee · 2021-06-29T20:04:35Z

crates/core_simd/src/libmf32.rs

+    // This does not seem to be implemented, yet.
+    #[inline]
+    fn mul_add(self, mul: Self, add: Self) -> Self {
+        self * mul + add
+    }


Suggested change

// This does not seem to be implemented, yet.

#[inline]

fn mul_add(self, mul: Self, add: Self) -> Self {

self * mul + add

}

A hardware-accelerable version of this was implemented in #138.

Is that version compiled to libm calls when hardware support isn't guaranteed to be available? Feels like a dubious tradeoff for this if so, especially given that this is already computing an approximated fsin.

Yes, because it is IEEE-754 compliant.

In general since we have a path forward for fixing the libm problem in LLVM we shouldn't get hung up on it elsewhere.

workingjubilee · 2021-06-29T20:12:19Z

crates/core_simd/src/libmf32.rs

+    /// assert!((x.sin() - SimdF32::<8>::splat((1.0_f32).sin())).abs().horizontal_max() < 0.0000002);
+    /// ```
+    #[inline]
+    pub fn sin(&self) -> Self {


Hm. Would this implementation be shared for f64, same values but different types? If so, we could macro this for applying to both, though my math brain is feeling deeply skeptical that such is how it works at the moment. It's been A Minute since I did trigonometry.

calebzulawski · 2021-06-29T21:20:20Z

@workingjubilee I think I disagree with that sentiment--we already have sin and cos platform intrinsics, right? And we could choose to link against libmvec (which is limited but does provide trig) right now if we wanted to, which would support the vast majority of use cases. I'm not sure it makes sense when that seems to be a strictly better choice to me.

workingjubilee · 2021-06-29T21:53:55Z

@workingjubilee I think I disagree with that sentiment--we already have sin and cos platform intrinsics, right? And we could choose to link against libmvec (which is limited but does provide trig) right now if we wanted to, which would support the vast majority of use cases. I'm not sure it makes sense when that seems to be a strictly better choice to me.

Ah, well, an implementation of linkage against libmvec would also be fine then, I suppose?

andy-thomason · 2021-06-30T09:24:15Z

As an update.

Sorry about the lack of progress, but the day job is intervening.

I now have a full set of f32 and most of the f64 implementations.

I still need to wrap them in a trait and add optional limit checks and
low-value approximations (to make ULP fans happy).

Linking with a shared library (if it is) will work but is not desirable as the call overhead and
lack of inlinability will substantially reduce performance.

This is what happens currently with the scalar libm, and the performance is a couple of orders
of magnitude off optimal.

andy-thomason · 2021-07-01T17:26:38Z

Could I get some help creating a test strategy for this?

I now have a more or less complete set (draft) of all the libm functions use by std.

With a few exceptions (such as tan) they have about the same error profile as the
std (glibc) versions.

I have a number of tests which include tables of accurate results generated using BigDecimal
which we can compare the std functions and our functions for errors.

We would ideally iterate over the generated table comparing the vector results with the
accurate results and ensuring that the error falls within bounds.

Another test would be to compare most of the range of values with the std version to
check that this is in bounds, also.

I am hoping to publish two versions of this, a fast version and a conforming version.

The fast versions use single polynomial evaluations (like my stub example) and the
conforming versions use segmented evaluations and have extra checks for small
values and overflows. Other permutations are possible.

The current tests use the same function and test for f32 and f64, which is not the case
for libm functions. The f64 versions have more terms and may even have different modalities.

Any help would be welcome.

The current scalar generated functions are here:

https://github.com/extendr/doctor-syn/tree/accurate-evaluation/tests

workingjubilee · 2022-01-19T20:16:00Z

Hello! I am really sorry for not getting back to you earlier: we got sidetracked by "wait, how do we get things moving into std?" and "wait, how do we deal with any of the libm problem at all?" on our end. As a result, we have moved float functions that may be scalarized by LLVM to call libm (and thus depend on runtime support) into a new crate that implements a new trait, added in this commit:
ecc00ef

That trait will become available as std::simd::StdFloat with an upcoming (soon?) sync PR into libstd. Ideally we will find a more lasting solution, but if you want to call fn mul_add conditionally, that is where that function is currently exposed.

I will try to get back to you very shortly on the test strategy question.

andy-thomason · 2022-01-20T09:43:13Z

Thanks, @workingjubilee, I've been following your progress and you have been doing great things these last few months.

I now have the generator as a tool - https://github.com/doctor-syn/doctor-syn/tree/main/libmgen. This generates Rust and C
code, but could easilly also generate SIMD code.

The tool generates accurately generated tables of (x, y) values (to 40+ digits) for testing. Accuracy of f64 functions
are typically good to n * 10^-16 where n, a small integer, depends on the input value - floating point numbers
are more accurate when small. We could make more accurate versions at some performance cost - for example using
gather and look-up tables - and could make versions which give good accuracy for small input values (sin(x) = x for small x).

We don't handle range limits and NANs at the moment, but it is easy to add these. LLVM's round, min and max
are suboptimal, no doubt due to compatibility with libm, which has strange NAN behaviour - max(NAN, x) = x.

Use cases are quite variable from games which are usually latency sensitive to analysis which is throughput sensitive.

workingjubilee · 2022-02-24T23:41:05Z

Ah, apparently "try" did not happen very fast.

Interesting, the tests currently fail on my local in debug, but only for the asinh:

---- test_libm32::asinh_f32 stdout ----
thread 'test_libm32::asinh_f32' panicked at '
x     =[ -0.9978027343750000,  -0.9977722167968750,  -0.9977416992187500,  -0.9977111816406250]
e     =[  0.0000000000000000,  -0.0000001192092896,  -0.0000000596046448,   0.0000002980232239]
limit =[  0.0000002384185791,   0.0000002384185791,   0.0000002384185791,   0.0000002384185791]
vector=[ -0.8798189759254456,  -0.8797975182533264,  -0.8797759413719177,  -0.8797540068626404]
scalar=[ -0.8798189759254456,  -0.8797973990440369,  -0.8797758817672729,  -0.8797543048858643]
vector=[000000000000bf613bd1, 000000000000bf613a69, 000000000000bf6138ff, 000000000000bf61378f]
scalar=[000000000000bf613bd1, 000000000000bf613a67, 000000000000bf6138fe, 000000000000bf613794]
vector_fn=|x: f32x4| x.asinh()', crates/std_float/src/test_libm32.rs:651:5
stack backtrace:
   0: rust_begin_unwind
             at /rustc/532d3cda90b8a729cd982548649d32803d265052/library/std/src/panicking.rs:584:5
   1: core::panicking::panic_fmt
             at /rustc/532d3cda90b8a729cd982548649d32803d265052/library/core/src/panicking.rs:142:14
   2: std_float::test_libm32::asinh_f32
             at ./src/test_libm32.rs:651:5
   3: std_float::test_libm32::asinh_f32::{{closure}}
             at ./src/test_libm32.rs:647:1
   4: core::ops::function::FnOnce::call_once
             at /rustc/532d3cda90b8a729cd982548649d32803d265052/library/core/src/ops/function.rs:227:5
   5: core::ops::function::FnOnce::call_once
             at /rustc/532d3cda90b8a729cd982548649d32803d265052/library/core/src/ops/function.rs:227:5

Doesn't look much different on my Ryzen with optimizations on:

---- test_libm32::asinh_f32 stdout ----
thread 'test_libm32::asinh_f32' panicked at '
x     =[ -0.9978027343750000,  -0.9977722167968750,  -0.9977416992187500,  -0.9977111816406250]
e     =[  0.0000000000000000,  -0.0000001192092896,  -0.0000000596046448,   0.0000002980232239]
limit =[  0.0000002384185791,   0.0000002384185791,   0.0000002384185791,   0.0000002384185791]
vector=[ -0.8798189759254456,  -0.8797975182533264,  -0.8797759413719177,  -0.8797540068626404]
scalar=[ -0.8798189759254456,  -0.8797973990440369,  -0.8797758817672729,  -0.8797543048858643]
vector=[000000000000bf613bd1, 000000000000bf613a69, 000000000000bf6138ff, 000000000000bf61378f]
scalar=[000000000000bf613bd1, 000000000000bf613a67, 000000000000bf6138fe, 000000000000bf613794]
vector_fn=|x: f32x4| x.asinh()', crates/std_float/src/test_libm32.rs:651:5
stack backtrace:
   0: rust_begin_unwind
             at /rustc/532d3cda90b8a729cd982548649d32803d265052/library/std/src/panicking.rs:584:5
   1: core::panicking::panic_fmt
             at /rustc/532d3cda90b8a729cd982548649d32803d265052/library/core/src/panicking.rs:142:14
   2: std_float::test_libm32::asinh_f32
   3: core::ops::function::FnOnce::call_once
             at /rustc/532d3cda90b8a729cd982548649d32803d265052/library/core/src/ops/function.rs:227:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace.

andy-thomason · 2022-02-28T11:41:11Z

Thanks for looking at this @workingjubilee .

For the moment, I'll just push the limit up a bit. We can do some more work on accuracy once the complete set is in place.

workingjubilee · 2022-03-08T19:04:09Z

Thanks for looking at this @workingjubilee .

For the moment, I'll just push the limit up a bit. We can do some more work on accuracy once the complete set is in place.

Yeah, I agree that accuracy-tuning is not as important as having the option for an implementation that is "std-invariant".

Thank you for adjusting for formatting!

workingjubilee · 2022-03-08T19:06:03Z

---- test_libm::ln stdout ----
thread 'main' panicked at '
x     =[ -1.0000000000000000,  -1.0000000000000000,  -1.0000000000000000,  -1.0000000000000000]
e     =[                 NaN,                  NaN,                  NaN,                  NaN]
limit =[  0.0000002384185791,   0.0000002384185791,   0.0000002384185791,   0.0000002384185791]
vector=[                 NaN,                  NaN,                  NaN,                  NaN]
scalar=[                 NaN,                  NaN,                  NaN,                  NaN]
vector=[000000000000ffc00000, 000000000000ffc00000, 000000000000ffc00000, 000000000000ffc00000]
scalar=[0000000000007fbfffff, 0000000000007fbfffff, 0000000000007fbfffff, 0000000000007fbfffff]
vector_fn=|x: vector_type| x.ln()', crates/std_float/src/test_libm.rs:416:5
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

---- test_libm::ln stdout ----
thread 'main' panicked at '
x     =[ -1.0000000000000000,  -1.0000000000000000,  -1.0000000000000000,  -1.0000000000000000]
e     =[                 NaN,                  NaN,                  NaN,                  NaN]
limit =[  0.0000002384185791,   0.0000002384185791,   0.0000002384185791,   0.0000002384185791]
vector=[                 NaN,                  NaN,                  NaN,                  NaN]
scalar=[                 NaN,                  NaN,                  NaN,                  NaN]
vector=[000000000000ffc00000, 000000000000ffc00000, 000000000000ffc00000, 000000000000ffc00000]
scalar=[0000000000007fbfffff, 0000000000007fbfffff, 0000000000007fbfffff, 0000000000007fbfffff]
vector_fn=|x: vector_type| x.ln()', crates/std_float/src/test_libm.rs:416:5
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace

Hmm, @programmerjake is there anything weird about MIPS we should be alert to here? That's the only one we're failing this test on.

programmerjake · 2022-03-08T20:23:24Z

Hmm, @programmerjake is there anything weird about MIPS we should be alert to here? That's the only one we're failing this test on.

I'd guess it might have something to do with old MIPS having the signalling/quiet NaN bit inverted from what's in later IEEE 754 standards?

programmerjake · 2022-03-08T20:27:34Z

if we're checking for bit equality, floats should only check like so:

fn check_eq(a: f32, b: f32) -> bool {
    if a.is_nan() || b.is_nan() {
        a.is_nan() && b.is_nan()
    } else {
        a.to_bits() == b.to_bits()
    }
}

andy-thomason · 2022-03-09T10:47:46Z

I was hoping to replicate NaNs exactly, but there seem to be no standards in libm variants.

MIPs uses signalling NaNs and some libs use NaN in place of Inf.

For example:

log2(-1.0) Should be (non-signalling) NaN -> gnu uses -NaN
log2(0.0) Should be -Inf

So I have to agree with @programmerjake that we should just test for NaNness

We also have some issues with denormals on some CPUs which affect std::fxx::MAX and MIN_POSITIVE.
I generally prefer if libs give the same answer on all platforms, rather than specialising them.

workingjubilee · 2022-03-09T20:10:38Z

Hm. Which CPUs are having the denormal problems?

andy-thomason · 2022-03-12T00:36:48Z

Mostly i586 variants failing. Most likely a problem with the os libm, but could be using 80387 instructions.

workingjubilee · 2022-03-12T16:45:04Z

Oh, yes. Those are going to attempt to use the x87 FPU for f64 at the very least, maybe for f32s as well.

andy-thomason · 2022-03-14T12:21:03Z

I've added some initial benchmarks:

running 3 tests
test sin_f32 ... bench: 4,289 ns/iter (+/- 21)
test sin_f32x16 ... bench: 292 ns/iter (+/- 78)
test sin_f32x4 ... bench: 608 ns/iter (+/- 18)

These show about a 20x performance increase over libm.
Larger vector sizes allow interleaving of instructions even though we
only have f32x8 registers (ymms) on my laptop.

You need to build with:

RUSTFLAGS="-C target-cpu=native" cargo bench

On modern hardware or the code gen is apalling.

andy-thomason · 2022-03-14T12:25:07Z

I would request:

Simd::from(scalar) instead of splat -> this would enable functions to take Into<Simd> everywhere.
Much larger vector sizes f32x64 and f32x128 to gain more interleave.

workingjubilee · 2022-04-27T20:33:48Z

We discussed the idea of using Into<Simd> and we decided that, even if we expose a trait that offers such a behavior (splat scalars and the identity function on vectors), we don't necessarily want to conflate the usage of From::from (which is very highly general) with the "autosplat" implication like that.

andy-thomason · 2022-05-03T09:57:39Z

Hi @workingjubilee

Thanks for the feedback. I'll try to attend one of your Monday meetings soon, as it has been a while.

I'll try to dedicate some days for this, but my dance card is full of training and crypto!

burrbull · 2022-08-06T09:59:23Z

Hello.
burrbull/sleef-rs#30
I try to port sleef crate from packed_simd on portable_simd but I get error in release:

  Compiling sleef v0.1.0 (/home/runner/work/sleef-rs/sleef-rs)
error: failed to parse bitcode for LTO module: Explicit gep type does not match pointee type of pointer operand (Producer: 'LLVM14.0.6-rust-1.64.0-nightly' Reader: 'LLVM 14.0.6-rust-1.64.0-nightly')
Error: failed to parse bitcode for LTO module: Explicit gep type does not match pointee type of pointer operand (Producer: 'LLVM14.0.6-rust-1.64.0-nightly' Reader: 'LLVM 14.0.6-rust-1.64.0-nightly')
error: could not compile `sleef` due to previous error
Error: The process '/home/runner/.cargo/bin/cargo' failed with exit code 101

In dev mode all compiles and works.
No any unsafe in code.

UPD. Fixed. Looks like issues in gather_or_default

andy-thomason · 2022-08-07T19:03:56Z

It would be interesting to benchmark against sleef. I had some good conversations with the author some time ago,
but of course sleef is a very different thing!

It would also be useful to merge this branch and any help to do that would be welcome.

burrbull · 2022-08-09T05:13:51Z

https://github.com/burrbull/sleef-rs/runs/7741369313?check_suite_focus=true

"_core" are your impls

burrbull · 2022-08-09T05:15:19Z

Your implementation is very fast when "fma" is enabled, but what about accuracy?

andy-thomason · 2022-08-09T10:37:22Z

This implementation has the same accuracy as libm and several times the performance. The scalar version is
also quite significantly faster.

The tests test against libm implementations of various kinds which vary in accuracy. It is very hard to get
the last bit of accuracy because of the "Tablemakers paradox". At the expense of a table lookup - which
becomes exponentially more costly in multithreaded processes - you can get an extra bit of accuracy
but we choose not to do this.

As with all SIMD code, the only way to get it to run efficiently is to use modern hardware, ie. -C target-cpu=native.
Older CPUs do not fully support SIMD operations or have such short vectors that there is no benefit.

It is a shame that this library only supports up to 512 bit vectors as longer ones exist and cause better performance
even on shorter register machines due to instruction interleaving.

programmerjake · 2022-08-09T10:59:07Z

It is a shame that this library only supports up to 512 bit vectors

That's incorrect...it supports Simd<u64, 64> which is 4096 bits. it supports all element types up to length 64 (we're planning on increasing that later).
https://rust.godbolt.org/z/fa7PMvrza

andy-thomason · 2022-08-09T14:29:50Z

Excellent.

miguelraz · 2023-07-14T14:17:15Z

Bump - how can we push out this work over the finish line?

Additionally, I see we're missing a sincos implementation.

andy-thomason force-pushed the feature/libm branch from 2196eda to 7ca1bfb Compare May 31, 2021 18:15

andy-thomason mentioned this pull request May 31, 2021

Create a Vector Math library to allow SimdF32::sin and similar to work in core #109

Open

workingjubilee requested changes Jun 29, 2021

View reviewed changes

workingjubilee force-pushed the master branch 4 times, most recently from cc78a1c to ab8eec7 Compare October 22, 2021 01:21

andy-thomason force-pushed the feature/libm branch from b15503c to ca31165 Compare January 20, 2022 12:50

andy-thomason added 2 commits February 14, 2022 15:49

Rebase

f6a47fa

Fix cast issues.

e29bd9b

andy-thomason force-pushed the feature/libm branch from b674a39 to e29bd9b Compare February 14, 2022 16:15

andy-thomason added 3 commits February 14, 2022 16:50

Log2 test

6143e0f

Tweak log2 terms

be3b9a4

Nan and Inf checks and more.

497b5a1

andy-thomason added 4 commits March 7, 2022 14:24

32 bit and 64 bit impls passing.

0f51ca4

fmt

1e90dca

Supress clippy warnings.

245cb5f

Fix windows tests.

a1d0487

andy-thomason added 5 commits March 7, 2022 19:20

Improved atan2 with odd evaluation.

86d9a08

Experiments with CI failures.

ded1208

Ignore sign on NaN checks.

5a5472c

Loosen accuracy requirements for 586.

db6f458

Ignore NaN test in log2.

6939ee7

Update NaN check.

0bafe8c

Remove hypot and ln infinity tests.

621ecb7

andy-thomason mentioned this pull request Mar 12, 2022

Math support in core rust-lang/rfcs#2505

Open

Some initial benchmarks.

84a7d0f

More benches

29281fd

LIBM - sin, cos, ln, exp and friends. #126

Are you sure you want to change the base?

LIBM - sin, cos, ln, exp and friends. #126

Conversation

andy-thomason commented May 31, 2021

programmerjake commented May 31, 2021

andy-thomason commented Jun 1, 2021

andy-thomason commented Jun 1, 2021

programmerjake commented Jun 1, 2021

workingjubilee left a comment • edited Loading

Choose a reason for hiding this comment

workingjubilee Jun 29, 2021

Choose a reason for hiding this comment

thomcc Jun 29, 2021 • edited Loading

Choose a reason for hiding this comment

calebzulawski Jun 29, 2021

Choose a reason for hiding this comment

workingjubilee Jun 29, 2021

Choose a reason for hiding this comment

calebzulawski commented Jun 29, 2021

workingjubilee commented Jun 29, 2021

andy-thomason commented Jun 30, 2021

andy-thomason commented Jul 1, 2021

workingjubilee commented Jan 19, 2022 • edited Loading

andy-thomason commented Jan 20, 2022

workingjubilee commented Feb 24, 2022

andy-thomason commented Feb 28, 2022

workingjubilee commented Mar 8, 2022

workingjubilee commented Mar 8, 2022

programmerjake commented Mar 8, 2022

programmerjake commented Mar 8, 2022

andy-thomason commented Mar 9, 2022

workingjubilee commented Mar 9, 2022

andy-thomason commented Mar 12, 2022

workingjubilee commented Mar 12, 2022

andy-thomason commented Mar 14, 2022 • edited Loading

andy-thomason commented Mar 14, 2022

workingjubilee commented Apr 27, 2022

andy-thomason commented May 3, 2022

burrbull commented Aug 6, 2022 • edited Loading

andy-thomason commented Aug 7, 2022

burrbull commented Aug 9, 2022 • edited Loading

burrbull commented Aug 9, 2022

andy-thomason commented Aug 9, 2022

programmerjake commented Aug 9, 2022

andy-thomason commented Aug 9, 2022

miguelraz commented Jul 14, 2023

workingjubilee left a comment •

edited

Loading

thomcc Jun 29, 2021 •

edited

Loading

workingjubilee commented Jan 19, 2022 •

edited

Loading

andy-thomason commented Mar 14, 2022 •

edited

Loading

burrbull commented Aug 6, 2022 •

edited

Loading

burrbull commented Aug 9, 2022 •

edited

Loading