Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

runtime: new implementations for nearest lib calls #2171

Merged
merged 1 commit into from
Aug 31, 2020

Conversation

MaxGraey
Copy link
Contributor

@MaxGraey MaxGraey commented Aug 28, 2020

More efficient implementations for wasmtime_f32_nearest and wasmtime_f64_nearest based on musl's rint and rintf implementations.

new / old comparison: https://godbolt.org/z/Gxz3bP

Also instruction's metrics for new approach with if / else branch for handling -0.0:

Iterations:        100
Instructions:      1900
Total Cycles:      1611
Total uOps:        2900


Dispatch Width:    6
uOps Per Cycle:    1.80
IPC:               1.18
Block RThroughput: 4.8

and with new approach but using copysign at the end for handling -0.0:

Iterations:        100
Instructions:      1800
Total Cycles:      1308
Total uOps:        2200

Dispatch Width:    6
uOps Per Cycle:    1.68
IPC:               1.38
Block RThroughput: 3.7

Benchmark results

Upd So I chose the second approach. Also it branchless on ARM32

Upd 2
Another possible approach:

pub extern "C" fn nearest(x: f64) -> f64 {
    let i = x.to_bits();
    let e = i >> 52 & 0x7ff_u64;
    if e >= 0x3ff_u64 + 52 {
      x
    } else {
      (x.abs() + TOINT_64 - TOINT_64).copysign(x)
    }
}

But this approach has lower IPC

@MaxGraey MaxGraey changed the title New implementations for nearest lib calls runtime: new implementations for nearest lib calls Aug 28, 2020
@alexcrichton
Copy link
Member

Thanks for this! Do you have some wall-clock benchmarks as well which show the improvement?

@MaxGraey
Copy link
Contributor Author

no, I just compared metrics from llvm-mca. Btw it will be great to have some online benchmark tools like quick-bench.com which allow you bench on different architectures and version of compiler

@MaxGraey
Copy link
Contributor Author

MaxGraey commented Aug 29, 2020

Here benchmark results on MacBook Pro (15-inch 2019, 2,3 GHz 8-cores i9):

benchmark code

test nearest_abs_copysign ... bench:      29,973 ns/iter (+/- 9,049)
test nearest_branch       ... bench:      35,935 ns/iter (+/- 4,135)
test nearest_copysign     ... bench:      33,115 ns/iter (+/- 1,961)
test nearest_original     ... bench:     102,740 ns/iter (+/- 22,607)

nearest_original is original function in wasmtime
nearest_copysign is proposed in current PR.

So it seems new proposed approach in 3 times faster

And btw all this was expected from llvm-mca metrics. I choose nearest_copysign due to potentially it could have better performance on ARMs

@MaxGraey
Copy link
Contributor Author

MaxGraey commented Aug 29, 2020

@alexcrichton I'm wondering is it possible use sse4.1 _mm_round_pd(ps) intrinsic for this in wasmtime? And fallback to current polyfill for rest of architectures

@bjorn3
Copy link
Contributor

bjorn3 commented Aug 30, 2020

The NearestF32 and NearestF64 libcalls are already a fallback for when there is no hardware instruction to do this. The old style x86 backend already uses roundss and roundsd:

// Rounding. The recipe looks at the opcode to pick an immediate.
for inst in &[nearest, floor, ceil, trunc] {
e.enc_both_isap(inst.bind(F32), rec_furmi_rnd.opcodes(&ROUNDSS), use_sse41);
e.enc_both_isap(inst.bind(F64), rec_furmi_rnd.opcodes(&ROUNDSD), use_sse41);
}
The new style x86_64 backend has a todo for this:
// TODO use ROUNDSS/ROUNDSD after sse4.1.

@MaxGraey
Copy link
Contributor Author

@bjorn3 Good to know. Thanks!

@MaxGraey
Copy link
Contributor Author

MaxGraey commented Aug 30, 2020

Also added sse 4.1 intrinsic to gist.

Upd
cargo bench:

test nearest_abs_copysign ... bench:      31,500 ns/iter (+/- 1,883)
test nearest_branch       ... bench:      35,911 ns/iter (+/- 5,852)
test nearest_copysign     ... bench:      32,282 ns/iter (+/- 10,079)
test nearest_original     ... bench:     106,932 ns/iter (+/- 13,186)
test nearest_sse41        ... bench:      41,642 ns/iter (+/- 2,501)  ;; fallback to soft implementation

RUSTFLAGS='-C target-cpu=native' cargo bench:

test nearest_abs_copysign ... bench:      29,554 ns/iter (+/- 7,914)
test nearest_branch       ... bench:      44,846 ns/iter (+/- 4,056)
test nearest_copysign     ... bench:      33,609 ns/iter (+/- 3,196)
test nearest_original     ... bench:      52,212 ns/iter (+/- 6,702)
test nearest_sse41        ... bench:      19,542 ns/iter (+/- 1,766)  ;; real usage of `roundpd` on x86_64

Copy link
Member

@sunfishcode sunfishcode left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me; just two possible optimization ideas:

crates/runtime/src/libcalls.rs Outdated Show resolved Hide resolved
crates/runtime/src/libcalls.rs Outdated Show resolved Hide resolved
@MaxGraey MaxGraey force-pushed the new-nearest-functions branch from 833c27f to 356dd1e Compare August 30, 2020 18:59
@MaxGraey
Copy link
Contributor Author

Squashed commits

use approach with copysign for handling negative zero


format


refactor for better branch prediction


move copysign back to internal branch


format


fix


use abs instead branches


better comments


switch arms for better branch prediction
@MaxGraey MaxGraey force-pushed the new-nearest-functions branch from a7512f0 to 9e58da4 Compare August 31, 2020 16:02
@sunfishcode
Copy link
Member

Great, thanks!

@sunfishcode sunfishcode merged commit a8f7041 into bytecodealliance:main Aug 31, 2020
@MaxGraey MaxGraey deleted the new-nearest-functions branch August 31, 2020 16:39
afonso360 added a commit to afonso360/wasmtime that referenced this pull request Jul 7, 2022
 As @MaxGraey pointed out (thanks!) in bytecodealliance#4397, `round` has different
 behavior from `nearest`. And it looks like the native rust
 implementation is still pending stabilization.

 Right now we duplicate the wasmtime implementation, merged in bytecodealliance#2171.

 However, we definitely should switch to the rust native version
 when it is available.
cfallin pushed a commit that referenced this pull request Jul 7, 2022
As @MaxGraey pointed out (thanks!) in #4397, `round` has different
 behavior from `nearest`. And it looks like the native rust
 implementation is still pending stabilization.

 Right now we duplicate the wasmtime implementation, merged in #2171.

 However, we definitely should switch to the rust native version
 when it is available.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants