-
Notifications
You must be signed in to change notification settings - Fork 82
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
LIBM - sin, cos, ln, exp and friends. #126
base: master
Are you sure you want to change the base?
Conversation
2196eda
to
7ca1bfb
Compare
We'd want to translate Also, |
It is worth noting that LLVM's default behaviour is to call libm on each scalar element, Whilst many cpus have had microcoded trig functions, they were not widely used by game Time will tell, I guess. |
Worth looking at the intel trig intrinsics. If we are to have a race. |
Yup, part of #109 is adding support for vector-math to llvm so llvm will generate calls to vector-math when there aren't good native instructions available.
In the case of rustc targeting SPIR-V or Libre-SOC's SimpleV, using sin intrinsics is by far the best approach since many GPUs (including Libre-SOC's cores) do have high-performance hardware for calculating sin/cos/etc. faster than using a bunch of FMAs.
True.
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hello! Thank you for your patience with review!
While I agree that we would want a hardware-accelerable sin()
, I feel doubtful that such is a priority: for the vast majority of architectures, we can better-specify what we need using software. As such, I would consider "and hardware-accelerated where possible" to be the optimization here, and so we can start with strictly software.
I am more interested in adding tests to this that check that we are within some measure of sanity of the current scalar sin
implementation, using the existing proptest framework. Unlike most of those tests, a fairly fuzzy "within margins" check is fine here. You can find most of those in macro form in https://github.com/rust-lang/stdsimd/blob/7b66032ed511229410b0879d384b06cfcacf0538/crates/core_simd/tests/ops_macros.rs
I am also curious if this is more or less precise than such, on average.
crates/core_simd/src/libmf32.rs
Outdated
// This does not seem to be implemented, yet. | ||
#[inline] | ||
fn mul_add(self, mul: Self, add: Self) -> Self { | ||
self * mul + add | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// This does not seem to be implemented, yet. | |
#[inline] | |
fn mul_add(self, mul: Self, add: Self) -> Self { | |
self * mul + add | |
} |
A hardware-accelerable version of this was implemented in #138.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is that version compiled to libm calls when hardware support isn't guaranteed to be available? Feels like a dubious tradeoff for this if so, especially given that this is already computing an approximated fsin.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, because it is IEEE-754 compliant.
In general since we have a path forward for fixing the libm problem in LLVM we shouldn't get hung up on it elsewhere.
crates/core_simd/src/libmf32.rs
Outdated
/// assert!((x.sin() - SimdF32::<8>::splat((1.0_f32).sin())).abs().horizontal_max() < 0.0000002); | ||
/// ``` | ||
#[inline] | ||
pub fn sin(&self) -> Self { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm. Would this implementation be shared for f64, same values but different types? If so, we could macro this for applying to both, though my math brain is feeling deeply skeptical that such is how it works at the moment. It's been A Minute since I did trigonometry.
@workingjubilee I think I disagree with that sentiment--we already have |
Ah, well, an implementation of linkage against libmvec would also be fine then, I suppose? |
As an update. Sorry about the lack of progress, but the day job is intervening. I now have a full set of f32 and most of the f64 implementations. I still need to wrap them in a trait and add optional limit checks and Linking with a shared library (if it is) will work but is not desirable as the call overhead and This is what happens currently with the scalar libm, and the performance is a couple of orders |
Could I get some help creating a test strategy for this? I now have a more or less complete set (draft) of all the libm functions use by std. With a few exceptions (such as tan) they have about the same error profile as the I have a number of tests which include tables of accurate results generated using BigDecimal We would ideally iterate over the generated table comparing the vector results with the Another test would be to compare most of the range of values with the std version to I am hoping to publish two versions of this, a fast version and a conforming version. The fast versions use single polynomial evaluations (like my stub example) and the The current tests use the same function and test for f32 and f64, which is not the case Any help would be welcome. The current scalar generated functions are here: https://github.com/extendr/doctor-syn/tree/accurate-evaluation/tests |
cc78a1c
to
ab8eec7
Compare
Hello! I am really sorry for not getting back to you earlier: we got sidetracked by "wait, how do we get things moving into That trait will become available as I will try to get back to you very shortly on the test strategy question. |
Thanks, @workingjubilee, I've been following your progress and you have been doing great things these last few months. I now have the generator as a tool - https://github.com/doctor-syn/doctor-syn/tree/main/libmgen. This generates Rust and C The tool generates accurately generated tables of (x, y) values (to 40+ digits) for testing. Accuracy of f64 functions We don't handle range limits and NANs at the moment, but it is easy to add these. LLVM's round, min and max Use cases are quite variable from games which are usually latency sensitive to analysis which is throughput sensitive. |
b15503c
to
ca31165
Compare
b674a39
to
e29bd9b
Compare
Ah, apparently "try" did not happen very fast. Interesting, the tests currently fail on my local in debug, but only for the asinh: ---- test_libm32::asinh_f32 stdout ----
thread 'test_libm32::asinh_f32' panicked at '
x =[ -0.9978027343750000, -0.9977722167968750, -0.9977416992187500, -0.9977111816406250]
e =[ 0.0000000000000000, -0.0000001192092896, -0.0000000596046448, 0.0000002980232239]
limit =[ 0.0000002384185791, 0.0000002384185791, 0.0000002384185791, 0.0000002384185791]
vector=[ -0.8798189759254456, -0.8797975182533264, -0.8797759413719177, -0.8797540068626404]
scalar=[ -0.8798189759254456, -0.8797973990440369, -0.8797758817672729, -0.8797543048858643]
vector=[000000000000bf613bd1, 000000000000bf613a69, 000000000000bf6138ff, 000000000000bf61378f]
scalar=[000000000000bf613bd1, 000000000000bf613a67, 000000000000bf6138fe, 000000000000bf613794]
vector_fn=|x: f32x4| x.asinh()', crates/std_float/src/test_libm32.rs:651:5
stack backtrace:
0: rust_begin_unwind
at /rustc/532d3cda90b8a729cd982548649d32803d265052/library/std/src/panicking.rs:584:5
1: core::panicking::panic_fmt
at /rustc/532d3cda90b8a729cd982548649d32803d265052/library/core/src/panicking.rs:142:14
2: std_float::test_libm32::asinh_f32
at ./src/test_libm32.rs:651:5
3: std_float::test_libm32::asinh_f32::{{closure}}
at ./src/test_libm32.rs:647:1
4: core::ops::function::FnOnce::call_once
at /rustc/532d3cda90b8a729cd982548649d32803d265052/library/core/src/ops/function.rs:227:5
5: core::ops::function::FnOnce::call_once
at /rustc/532d3cda90b8a729cd982548649d32803d265052/library/core/src/ops/function.rs:227:5 Doesn't look much different on my Ryzen with optimizations on: ---- test_libm32::asinh_f32 stdout ----
thread 'test_libm32::asinh_f32' panicked at '
x =[ -0.9978027343750000, -0.9977722167968750, -0.9977416992187500, -0.9977111816406250]
e =[ 0.0000000000000000, -0.0000001192092896, -0.0000000596046448, 0.0000002980232239]
limit =[ 0.0000002384185791, 0.0000002384185791, 0.0000002384185791, 0.0000002384185791]
vector=[ -0.8798189759254456, -0.8797975182533264, -0.8797759413719177, -0.8797540068626404]
scalar=[ -0.8798189759254456, -0.8797973990440369, -0.8797758817672729, -0.8797543048858643]
vector=[000000000000bf613bd1, 000000000000bf613a69, 000000000000bf6138ff, 000000000000bf61378f]
scalar=[000000000000bf613bd1, 000000000000bf613a67, 000000000000bf6138fe, 000000000000bf613794]
vector_fn=|x: f32x4| x.asinh()', crates/std_float/src/test_libm32.rs:651:5
stack backtrace:
0: rust_begin_unwind
at /rustc/532d3cda90b8a729cd982548649d32803d265052/library/std/src/panicking.rs:584:5
1: core::panicking::panic_fmt
at /rustc/532d3cda90b8a729cd982548649d32803d265052/library/core/src/panicking.rs:142:14
2: std_float::test_libm32::asinh_f32
3: core::ops::function::FnOnce::call_once
at /rustc/532d3cda90b8a729cd982548649d32803d265052/library/core/src/ops/function.rs:227:5
note: Some details are omitted, run with `RUST_BACKTRACE=full` for a verbose backtrace. |
Thanks for looking at this @workingjubilee . For the moment, I'll just push the limit up a bit. We can do some more work on accuracy once the complete set is in place. |
Yeah, I agree that accuracy-tuning is not as important as having the option for an implementation that is "std-invariant". Thank you for adjusting for formatting! |
---- test_libm::ln stdout ----
thread 'main' panicked at '
x =[ -1.0000000000000000, -1.0000000000000000, -1.0000000000000000, -1.0000000000000000]
e =[ NaN, NaN, NaN, NaN]
limit =[ 0.0000002384185791, 0.0000002384185791, 0.0000002384185791, 0.0000002384185791]
vector=[ NaN, NaN, NaN, NaN]
scalar=[ NaN, NaN, NaN, NaN]
vector=[000000000000ffc00000, 000000000000ffc00000, 000000000000ffc00000, 000000000000ffc00000]
scalar=[0000000000007fbfffff, 0000000000007fbfffff, 0000000000007fbfffff, 0000000000007fbfffff]
vector_fn=|x: vector_type| x.ln()', crates/std_float/src/test_libm.rs:416:5
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace ---- test_libm::ln stdout ----
thread 'main' panicked at '
x =[ -1.0000000000000000, -1.0000000000000000, -1.0000000000000000, -1.0000000000000000]
e =[ NaN, NaN, NaN, NaN]
limit =[ 0.0000002384185791, 0.0000002384185791, 0.0000002384185791, 0.0000002384185791]
vector=[ NaN, NaN, NaN, NaN]
scalar=[ NaN, NaN, NaN, NaN]
vector=[000000000000ffc00000, 000000000000ffc00000, 000000000000ffc00000, 000000000000ffc00000]
scalar=[0000000000007fbfffff, 0000000000007fbfffff, 0000000000007fbfffff, 0000000000007fbfffff]
vector_fn=|x: vector_type| x.ln()', crates/std_float/src/test_libm.rs:416:5
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace Hmm, @programmerjake is there anything weird about MIPS we should be alert to here? That's the only one we're failing this test on. |
I'd guess it might have something to do with old MIPS having the signalling/quiet NaN bit inverted from what's in later IEEE 754 standards? |
if we're checking for bit equality, floats should only check like so: fn check_eq(a: f32, b: f32) -> bool {
if a.is_nan() || b.is_nan() {
a.is_nan() && b.is_nan()
} else {
a.to_bits() == b.to_bits()
}
} |
I was hoping to replicate NaNs exactly, but there seem to be no standards in libm variants. MIPs uses signalling NaNs and some libs use NaN in place of Inf. For example: log2(-1.0) Should be (non-signalling) NaN -> gnu uses -NaN So I have to agree with @programmerjake that we should just test for NaNness We also have some issues with denormals on some CPUs which affect std::fxx::MAX and MIN_POSITIVE. |
Hm. Which CPUs are having the denormal problems? |
Mostly i586 variants failing. Most likely a problem with the os libm, but could be using 80387 instructions. |
Oh, yes. Those are going to attempt to use the x87 FPU for f64 at the very least, maybe for f32s as well. |
I've added some initial benchmarks: running 3 tests These show about a 20x performance increase over libm. You need to build with:
On modern hardware or the code gen is apalling. |
I would request:
|
We discussed the idea of using |
Thanks for the feedback. I'll try to attend one of your Monday meetings soon, as it has been a while. I'll try to dedicate some days for this, but my dance card is full of training and crypto! |
Hello.
In dev mode all compiles and works. UPD. Fixed. Looks like issues in |
It would be interesting to benchmark against sleef. I had some good conversations with the author some time ago, It would also be useful to merge this branch and any help to do that would be welcome. |
https://github.com/burrbull/sleef-rs/runs/7741369313?check_suite_focus=true "_core" are your impls |
Your implementation is very fast when "fma" is enabled, but what about accuracy? |
This implementation has the same accuracy as libm and several times the performance. The scalar version is The tests test against libm implementations of various kinds which vary in accuracy. It is very hard to get As with all SIMD code, the only way to get it to run efficiently is to use modern hardware, ie. It is a shame that this library only supports up to 512 bit vectors as longer ones exist and cause better performance |
That's incorrect...it supports |
Excellent. |
Bump - how can we push out this work over the finish line? Additionally, I see we're missing a |
This is a placeholder for implementing libm functions in stdsimd.
I would appreciate help and advice to integrate with existing work and create a wide range of tests.