runtime: new implementations for nearest lib calls #2171

MaxGraey · 2020-08-28T21:00:11Z

More efficient implementations for wasmtime_f32_nearest and wasmtime_f64_nearest based on musl's rint and rintf implementations.

new / old comparison: https://godbolt.org/z/Gxz3bP

Also instruction's metrics for new approach with if / else branch for handling -0.0:

Iterations:        100
Instructions:      1900
Total Cycles:      1611
Total uOps:        2900


Dispatch Width:    6
uOps Per Cycle:    1.80
IPC:               1.18
Block RThroughput: 4.8

and with new approach but using copysign at the end for handling -0.0:

Iterations:        100
Instructions:      1800
Total Cycles:      1308
Total uOps:        2200

Dispatch Width:    6
uOps Per Cycle:    1.68
IPC:               1.38
Block RThroughput: 3.7

Benchmark results

Upd So I chose the second approach. Also it branchless on ARM32

Upd 2
Another possible approach:

pub extern "C" fn nearest(x: f64) -> f64 {
    let i = x.to_bits();
    let e = i >> 52 & 0x7ff_u64;
    if e >= 0x3ff_u64 + 52 {
      x
    } else {
      (x.abs() + TOINT_64 - TOINT_64).copysign(x)
    }
}

But this approach has lower IPC

alexcrichton · 2020-08-29T17:54:03Z

Thanks for this! Do you have some wall-clock benchmarks as well which show the improvement?

MaxGraey · 2020-08-29T17:58:07Z

no, I just compared metrics from llvm-mca. Btw it will be great to have some online benchmark tools like quick-bench.com which allow you bench on different architectures and version of compiler

MaxGraey · 2020-08-29T20:37:30Z

Here benchmark results on MacBook Pro (15-inch 2019, 2,3 GHz 8-cores i9):

benchmark code

test nearest_abs_copysign ... bench:      29,973 ns/iter (+/- 9,049)
test nearest_branch       ... bench:      35,935 ns/iter (+/- 4,135)
test nearest_copysign     ... bench:      33,115 ns/iter (+/- 1,961)
test nearest_original     ... bench:     102,740 ns/iter (+/- 22,607)

nearest_original is original function in wasmtime
nearest_copysign is proposed in current PR.

So it seems new proposed approach in 3 times faster

And btw all this was expected from llvm-mca metrics. I choose nearest_copysign due to potentially it could have better performance on ARMs

MaxGraey · 2020-08-29T22:37:03Z

~~@alexcrichton I'm wondering is it possible use sse4.1 _mm_round_pd(ps) intrinsic for this in wasmtime? And fallback to current polyfill for rest of architectures~~

bjorn3 · 2020-08-30T08:12:43Z

The NearestF32 and NearestF64 libcalls are already a fallback for when there is no hardware instruction to do this. The old style x86 backend already uses roundss and roundsd:

wasmtime/cranelift/codegen/meta/src/isa/x86/encodings.rs

Lines 1341 to 1345 in 5c5a30f

    
           // Rounding. The recipe looks at the opcode to pick an immediate. 
        
           for inst in &[nearest, floor, ceil, trunc] { 
        
               e.enc_both_isap(inst.bind(F32), rec_furmi_rnd.opcodes(&ROUNDSS), use_sse41); 
        
               e.enc_both_isap(inst.bind(F64), rec_furmi_rnd.opcodes(&ROUNDSD), use_sse41); 
        
           }

The new style x86_64 backend has a todo for this:

wasmtime/cranelift/codegen/src/isa/x64/lower.rs

Line 1749 in 8ac4bd1

// TODO use ROUNDSS/ROUNDSD after sse4.1.

MaxGraey · 2020-08-30T08:16:04Z

@bjorn3 Good to know. Thanks!

MaxGraey · 2020-08-30T08:58:56Z

Also added sse 4.1 intrinsic to gist.

Upd
cargo bench:

test nearest_abs_copysign ... bench:      31,500 ns/iter (+/- 1,883)
test nearest_branch       ... bench:      35,911 ns/iter (+/- 5,852)
test nearest_copysign     ... bench:      32,282 ns/iter (+/- 10,079)
test nearest_original     ... bench:     106,932 ns/iter (+/- 13,186)
test nearest_sse41        ... bench:      41,642 ns/iter (+/- 2,501)  ;; fallback to soft implementation

RUSTFLAGS='-C target-cpu=native' cargo bench:

test nearest_abs_copysign ... bench:      29,554 ns/iter (+/- 7,914)
test nearest_branch       ... bench:      44,846 ns/iter (+/- 4,056)
test nearest_copysign     ... bench:      33,609 ns/iter (+/- 3,196)
test nearest_original     ... bench:      52,212 ns/iter (+/- 6,702)
test nearest_sse41        ... bench:      19,542 ns/iter (+/- 1,766)  ;; real usage of `roundpd` on x86_64

sunfishcode

This looks good to me; just two possible optimization ideas:

crates/runtime/src/libcalls.rs

MaxGraey · 2020-08-30T19:00:16Z

Squashed commits

use approach with copysign for handling negative zero format refactor for better branch prediction move copysign back to internal branch format fix use abs instead branches better comments switch arms for better branch prediction

sunfishcode · 2020-08-31T16:39:09Z

Great, thanks!

@MaxGraey

As @MaxGraey pointed out (thanks!) in bytecodealliance#4397, `round` has different behavior from `nearest`. And it looks like the native rust implementation is still pending stabilization. Right now we duplicate the wasmtime implementation, merged in bytecodealliance#2171. However, we definitely should switch to the rust native version when it is available.

@MaxGraey

As @MaxGraey pointed out (thanks!) in #4397, `round` has different behavior from `nearest`. And it looks like the native rust implementation is still pending stabilization. Right now we duplicate the wasmtime implementation, merged in #2171. However, we definitely should switch to the rust native version when it is available.

MaxGraey changed the title ~~New implementations for nearest lib calls~~ runtime: new implementations for nearest lib calls Aug 28, 2020

sunfishcode approved these changes Aug 30, 2020

View reviewed changes

crates/runtime/src/libcalls.rs Outdated Show resolved Hide resolved

crates/runtime/src/libcalls.rs Outdated Show resolved Hide resolved

MaxGraey force-pushed the new-nearest-functions branch from 833c27f to 356dd1e Compare August 30, 2020 18:59

new implementations for nearest lib calls

9e58da4

use approach with copysign for handling negative zero format refactor for better branch prediction move copysign back to internal branch format fix use abs instead branches better comments switch arms for better branch prediction

MaxGraey force-pushed the new-nearest-functions branch from a7512f0 to 9e58da4 Compare August 31, 2020 16:02

sunfishcode merged commit a8f7041 into bytecodealliance:main Aug 31, 2020

MaxGraey deleted the new-nearest-functions branch August 31, 2020 16:39

MaxGraey mentioned this pull request Jul 7, 2022

cranelift: Implement float rounding operations in interpreter #4397

Merged

afonso360 mentioned this pull request Jul 7, 2022

cranelift: Use round_ties_even for nearest in interpreter #4413

Merged

alexcrichton mentioned this pull request Jan 17, 2023

Resolve libcall relocations for older CPUs #5567

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

runtime: new implementations for nearest lib calls #2171

runtime: new implementations for nearest lib calls #2171

MaxGraey commented Aug 28, 2020 •

edited

Loading

alexcrichton commented Aug 29, 2020

MaxGraey commented Aug 29, 2020

MaxGraey commented Aug 29, 2020 •

edited

Loading

MaxGraey commented Aug 29, 2020 •

edited

Loading

bjorn3 commented Aug 30, 2020

MaxGraey commented Aug 30, 2020

MaxGraey commented Aug 30, 2020 •

edited

Loading

sunfishcode left a comment

MaxGraey commented Aug 30, 2020

sunfishcode commented Aug 31, 2020

runtime: new implementations for nearest lib calls #2171

runtime: new implementations for nearest lib calls #2171

Conversation

MaxGraey commented Aug 28, 2020 • edited Loading

alexcrichton commented Aug 29, 2020

MaxGraey commented Aug 29, 2020

MaxGraey commented Aug 29, 2020 • edited Loading

MaxGraey commented Aug 29, 2020 • edited Loading

bjorn3 commented Aug 30, 2020

MaxGraey commented Aug 30, 2020

MaxGraey commented Aug 30, 2020 • edited Loading

sunfishcode left a comment

Choose a reason for hiding this comment

MaxGraey commented Aug 30, 2020

sunfishcode commented Aug 31, 2020

MaxGraey commented Aug 28, 2020 •

edited

Loading

MaxGraey commented Aug 29, 2020 •

edited

Loading

MaxGraey commented Aug 29, 2020 •

edited

Loading

MaxGraey commented Aug 30, 2020 •

edited

Loading