Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Abnormal long execute time running wasm code with Wasmtime #7246

Closed
unicornt opened this issue Oct 16, 2023 · 18 comments
Closed

Abnormal long execute time running wasm code with Wasmtime #7246

unicornt opened this issue Oct 16, 2023 · 18 comments
Labels
bug Incorrect behavior in the current implementation that needs fixing

Comments

@unicornt
Copy link

Test Case

report.zip

Steps to Reproduce

time wasmtime case.wasm

Actual Results

Wasmtime takes an abnormal long time to execute this wasm file and no exception is thrown.

Expected Results

Other runtime use much less time or report an out of memory access exception

image

Versions and Environment

Wasmtime version or commit: 15.0.0

Operating system: Ubuntu 20.04

Architecture: x86_64

@unicornt unicornt added the bug Incorrect behavior in the current implementation that needs fixing label Oct 16, 2023
@unicornt unicornt changed the title Abnormal long execute time using Wasmtime Abnormal long execute time running wasm code with Wasmtime Oct 16, 2023
@afonso360
Copy link
Contributor

afonso360 commented Oct 16, 2023

It looks like wasmtime wasn't compiled in release mode, would you be able to re-test using the --release flag when compiling wasmtime?

@unicornt
Copy link
Author

It looks like wasmtime wasn't compiled in release mode, would you be able to re-test using the --release flag when compiling wasmtime?

Thank you for the reply! I tried to build wasmtime with --release flag, but it turned out to be useless.

image

@alexcrichton
Copy link
Member

Would you be able to help diagnose what's happening locally? I'm unable to reproduce this and the runtime for this module isn't too bad locally. If you could run

perf record -k mono wasmtime run --profile jitdump -C case=no case.wasm
perf inject --jit --input perf.data --output perf.jit.data
perf report --input perf.jit.data --no-children

and then copy/paste the top 10-20 functions into this issue that'll help explain where all the time is going.

One thing I've noticed is that this module is trying to write to file descriptor 0 inside the wasm itself. Wasmtime is returning EBADF because of that (0 is stdin, not stdout). Other runtimes likely aren't performing such a check in their WASI implementations. This provides a point of divergence between runtimes which may explain why runtimes are behaving differently.

Nevertheless though I'm not sure what's taking 20+ seconds for you in Wasmtime. Locally this module executes in 1-3 seconds for me. I unfortunately don't have access to native x86_64 hardware though and so this could be specific to that.

@afonso360
Copy link
Contributor

Nevertheless though I'm not sure what's taking 20+ seconds for you in Wasmtime. Locally this module executes in 1-3 seconds for me. I unfortunately don't have access to native x86_64 hardware though and so this could be specific to that.

I'm also unable to reproduce this on x86, it takes ~3 seconds for me. With 98.86% of the time being spent on wasm[0]::function[3].

@unicornt
Copy link
Author

Nevertheless though I'm not sure what's taking 20+ seconds for you in Wasmtime. Locally this module executes in 1-3 seconds for me. I unfortunately don't have access to native x86_64 hardware though and so this could be specific to that.

I'm also unable to reproduce this on x86, it takes ~3 seconds for me. With 98.86% of the time being spent on wasm[0]::function[3].

I got similar output that 99.7% of the time is spent on function[3]. However, the total time is still 20+ seconds.

image

Here is perf stat output of wasmtime and wasmedge(AOT mod). Hope it can help to find out why it happened!

image

@bjorn3
Copy link
Contributor

bjorn3 commented Oct 17, 2023

An IPC of 0.08 is abnornally low. Could you show the annotated assembly of wasm[0]::function[3] as reported by perf report?

@unicornt
Copy link
Author

An IPC of 0.08 is abnornally low. Could you show the annotated assembly of wasm[0]::function[3] as reported by perf report?

Is this what you want? Sorry I'm not familiar with perf tools.

image

@bjorn3
Copy link
Contributor

bjorn3 commented Oct 17, 2023

Yes. The amount of time spent in what should be trivial instructions (mov between registers and xor) is weird.

@unicornt
Copy link
Author

Yes. The amount of time spent in what should be trivial instructions (mov between registers and xor) is weird.

Do you have any idea about why it happened? Anything I can do to figure it out?

@alexcrichton
Copy link
Member

Could you share the output of lscpu? This might be specific to your particular CPU and some edge case related to cache sizes or something like that perhaps.

@unicornt
Copy link
Author

Here is my output of lspcpu

➜  ~ lscpu
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Byte Order:                         Little Endian
Address sizes:                      39 bits physical, 48 bits virtual
CPU(s):                             16
On-line CPU(s) list:                0-15
Thread(s) per core:                 2
Core(s) per socket:                 8
Socket(s):                          1
NUMA node(s):                       1
Vendor ID:                          GenuineIntel
CPU family:                         6
Model:                              165
Model name:                         Intel(R) Core(TM) i7-10700F CPU @ 2.90GHz
Stepping:                           5
CPU MHz:                            2900.000
CPU max MHz:                        4800.0000
CPU min MHz:                        800.0000
BogoMIPS:                           5799.77
Virtualization:                     VT-x
L1d cache:                          256 KiB
L1i cache:                          256 KiB
L2 cache:                           2 MiB
L3 cache:                           16 MiB
NUMA node0 CPU(s):                  0-15
Vulnerability Gather data sampling: Mitigation; Microcode
Vulnerability Itlb multihit:        KVM: Mitigation: VMX disabled
Vulnerability L1tf:                 Not affected
Vulnerability Mds:                  Not affected
Vulnerability Meltdown:             Not affected
Vulnerability Mmio stale data:      Mitigation; Clear CPU buffers; SMT vulnerable
Vulnerability Retbleed:             Mitigation; Enhanced IBRS
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; Enhanced IBRS, IBPB conditional, RSB filling, PBRSB-eIBRS SW sequence
Vulnerability Srbds:                Mitigation; Microcode
Vulnerability Tsx async abort:      Not affected
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscal
                                    l nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulq
                                    dq dtes64 monitor ds_cpl vmx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_timer a
                                    es xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault invpcid_single ssbd ibrs ibpb stibp ibrs_enhanced tpr_shadow vn
                                    mi flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt intel_pt x
                                    saveopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp pku ospke md_clear flush_l1d arch_
                                    capabilities

@unicornt
Copy link
Author

Hi, I tried the same code on another machine. The result is almost the same, including the weird time of mov and xor.

The OS of this machine is ubuntu 20.04 and version of wasmtime is 15.0.0.

Here is the output of lscpu on this machine.

Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Byte Order:                         Little Endian
Address sizes:                      39 bits physical, 48 bits virtual
CPU(s):                             6
On-line CPU(s) list:                0-5
Thread(s) per core:                 1
Core(s) per socket:                 6
Socket(s):                          1
NUMA node(s):                       1
Vendor ID:                          GenuineIntel
CPU family:                         6
Model:                              158
Model name:                         Intel(R) Core(TM) i5-9500T CPU @ 2.20GHz
Stepping:                           10
CPU MHz:                            2200.000
CPU max MHz:                        3700.0000
CPU min MHz:                        800.0000
BogoMIPS:                           4399.99
Virtualization:                     VT-x
L1d cache:                          192 KiB
L1i cache:                          192 KiB
L2 cache:                           1.5 MiB
L3 cache:                           9 MiB
NUMA node0 CPU(s):                  0-5
Vulnerability Gather data sampling: Mitigation; Microcode
Vulnerability Itlb multihit:        KVM: Mitigation: VMX disabled
Vulnerability L1tf:                 Mitigation; PTE Inversion; VMX conditional cache flushes, SMT disabled
Vulnerability Mds:                  Mitigation; Clear CPU buffers; SMT disabled
Vulnerability Meltdown:             Mitigation; PTI
Vulnerability Mmio stale data:      Mitigation; Clear CPU buffers; SMT disabled
Vulnerability Retbleed:             Mitigation; IBRS
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; IBRS, IBPB conditional, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds:                Mitigation; Microcode
Vulnerability Tsx async abort:      Mitigation; TSX disabled
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx
                                     fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs 
                                    bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulqdq dtes64 monitor ds_cpl
                                     vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_de
                                    adline_timer aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_sin
                                    gle pti ssbd ibrs ibpb stibp tpr_shadow vnmi flexpriority ept vpid ept_ad fsgsbase tsc_adjus
                                    t bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt intel_pt xsaveopt xsavec x
                                    getbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l
                                    1d arch_capabilities

@alexcrichton
Copy link
Member

Ok thanks again for both reporting this and helping us follow-up! @elliottt was able to reproduce on hardware he has which resulted in us bottoming out in #7283 as the cause of this issue. I'm going to close this in favor of that issue to keep the issue discussion a bit shorter, but if you'd like to confirm locally @unicornt you can comment out these lines in a custom Wasmtime build and the performance of this program should be better. Note though that it's just a proof-of-concept of what the slowdown is and is not a full complete fix, that'll get tracked by #7283

@yfaming
Copy link

yfaming commented Oct 19, 2023

I'm using a MacBook Pro with a 2.6 GHz 6-Core Intel Core i7 CPU. I attempted to run the .wasm module provided by @unicornt , and it was indeed quite slow, taking around 27s. I also tried wasmer, and it took about 7s.

Following @alexcrichton's advice, I commented out the last 4 lines in cranelift/codegen/src/opts/remat.isle, recompiled wasmtime, and ran it again. The execution time dropped to less than 1s. Amazing!

@hungryzzz
Copy link
Contributor

hungryzzz commented Oct 20, 2023

@alexcrichton Hi, I try to reduce the above wasm bytecode to get a buggy case as shortest as possible, the reduced results are attached. The execute time in Wasmer and Wasmtime still have a big difference:

  • Wasmer: 0.67s
  • Wasmtime: 12.24s

But I find that the performance regression is not only related to the #7283, because if I change the constant in f64.const 5.43231e-312 to f64.const 5.4, the execution time will back to normal(but indeed, the loop-invariant code will be sinked into loop in the corresponding binary generated by Wasmtime):

  • Wasmer: 0.70s
  • Wasmtime: 0.34s

By the way, I also check the binary generated by LLVM, I find that the whole computation will be optimized to empty(I think because there is nothing depends on the result of the loop).

I think it is also a strange case I cannot understand and explain...

;; reduce.wat
(module
  (type (;0;) (func (param i32)))
  (type (;1;) (func))
  (type (;2;) (func (result i32)))
  (type (;3;) (func (param i32) (result i32)))
  (import "wasi_snapshot_preview1" "proc_exit" (func (;0;) (type 0)))
  (func (;2;) (type 1)
    call 2
    i32.const 0
    call 0
    unreachable)
  (func (;2;) (type 1)
    (local f64 f64 i32)
    f64.const 5.43231e-312 ;; Magic Number!!!  
    f64.const 1.0
    f64.mul
    local.set 1
    loop  ;; label = @1
      local.get 1
      local.get 0
      f64.add
      local.set 0
      local.get 2   ;; condition variable i
      i32.const 1    
      i32.add        
      local.tee 2   ;; i++
      i32.const 312500000
      i32.ne
      br_if 0 (;@1;)
    end
  )
  (table (;0;) 2 2 funcref)
  (memory (;0;) 8192 8192)
  (global (;0;) (mut i32) (i32.const 66576))
  (export "memory" (memory 0))
  (export "__indirect_function_table" (table 0))
  (export "_start" (func 1)))

Environment Version

  • Wasmer: wasmer 4.1.2 (98188df 2023-09-19)
    binary: wasmer-cli
    commit-hash: 98188df80096dfc7134ce202731878cf9d8f6f4e
    commit-date: 2023-09-19
    host: x86_64-unknown-linux-gnu
    compiler: singlepass,cranelift,llvm
  • Wasmtime: build from 8e00cc2 (release version)
  • CPU
➜  ~ lscpu
Architecture:                       x86_64
CPU op-mode(s):                     32-bit, 64-bit
Byte Order:                         Little Endian
Address sizes:                      39 bits physical, 48 bits virtual
CPU(s):                             6
On-line CPU(s) list:                0-5
Thread(s) per core:                 1
Core(s) per socket:                 6
Socket(s):                          1
NUMA node(s):                       1
Vendor ID:                          GenuineIntel
CPU family:                         6
Model:                              158
Model name:                         Intel(R) Core(TM) i5-9500T CPU @ 2.20GHz
Stepping:                           10
CPU MHz:                            2200.000
CPU max MHz:                        3700.0000
CPU min MHz:                        800.0000
BogoMIPS:                           4399.99
Virtualization:                     VT-x
L1d cache:                          192 KiB
L1i cache:                          192 KiB
L2 cache:                           1.5 MiB
L3 cache:                           9 MiB
NUMA node0 CPU(s):                  0-5
Vulnerability Gather data sampling: Mitigation; Microcode
Vulnerability Itlb multihit:        KVM: Mitigation: VMX disabled
Vulnerability L1tf:                 Mitigation; PTE Inversion; VMX conditional cache flushes, SMT disabled
Vulnerability Mds:                  Mitigation; Clear CPU buffers; SMT disabled
Vulnerability Meltdown:             Mitigation; PTI
Vulnerability Mmio stale data:      Mitigation; Clear CPU buffers; SMT disabled
Vulnerability Retbleed:             Mitigation; IBRS
Vulnerability Spec rstack overflow: Not affected
Vulnerability Spec store bypass:    Mitigation; Speculative Store Bypass disabled via prctl and seccomp
Vulnerability Spectre v1:           Mitigation; usercopy/swapgs barriers and __user pointer sanitization
Vulnerability Spectre v2:           Mitigation; IBRS, IBPB conditional, STIBP disabled, RSB filling, PBRSB-eIBRS Not affected
Vulnerability Srbds:                Mitigation; Microcode
Vulnerability Tsx async abort:      Mitigation; TSX disabled
Flags:                              fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscal
                                    l nx pdpe1gb rdtscp lm constant_tsc art arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc cpuid aperfmperf pni pclmulq
                                    dq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 sdbg fma cx16 xtpr pdcm pcid sse4_1 sse4_2 x2apic movbe popcnt tsc_deadline_tim
                                    er aes xsave avx f16c rdrand lahf_lm abm 3dnowprefetch cpuid_fault epb invpcid_single pti ssbd ibrs ibpb stibp tpr_shadow vnmi
                                     flexpriority ept vpid ept_ad fsgsbase tsc_adjust bmi1 avx2 smep bmi2 erms invpcid mpx rdseed adx smap clflushopt intel_pt xsa
                                    veopt xsavec xgetbv1 xsaves dtherm ida arat pln pts hwp hwp_notify hwp_act_window hwp_epp md_clear flush_l1d arch_capabilities

Also, I use the inline assembly to run the binary generated by Wasmer and Wasmtime and compare the execution time, I also find the difference in execution time. Maybe it is helpful to your debugging so I paste it here.

  • test_wasmtime: 12.59s
  • test_wasmer: 0.35s
#include <stdint.h>


double X = 1.0;

uint64_t test_wasmtime() {
  __asm__ __volatile__(
"vxorpd %%xmm5,%%xmm5,%%xmm6\n\t"
"xor    %%r11d,%%r11d\n\t"
"START1%=:\n\t"
"add    $0x1,%%r11d\n\t"
"movabs $0x100000264d8,%%rsi\n\t"
"vmovq  %%rsi,%%xmm0\n\t"
"vmulsd %0,%%xmm0,%%xmm7\n\t"
"vaddsd %%xmm6,%%xmm7,%%xmm6\n\t"
"cmp    $0x12a05f20,%%r11d\n\t"
"jne    START1%=\n\t":"=m"(X)::"rsi","r11");
  return 0;
}

uint64_t test_wasmer() {
  __asm__ __volatile__(
"movabs $0x100000264d8,%%r11\n\t"
"movq   %%r11,%%xmm9\n\t"
"mulsd  %0,%%xmm9\n\t"
"xorpd  %%xmm7,%%xmm7\n\t"
"xor    %%r8d,%%r8d\n\t"
"START2%=:\n\t"
"movdqa %%xmm9,%%xmm15\n\t"
"addsd  %%xmm7,%%xmm15\n\t"
"add    $0x1,%%r8d\n\t"
"cmp    $0x12a05f20,%%r8d\n\t"
"setne  %%dil\n\t"
"movzbl %%dil,%%edi\n\t"
"test   %%edi,%%edi\n\t"
"je     EXIT2%=\n\t"
"movdqa %%xmm15,%%xmm7\n\t"
"jmp   START2%=\n\t"
"EXIT2%=:\n\t":"=m"(X)::"r11","r8","rdi");
  return 0;
}

int main() {
  test_wasmtime();
  // test_wasmer();
  return 0;
}

reduce.zip

@alexcrichton
Copy link
Member

Thanks for the follow-up @hungryzzz, although I've got a few questions for you. It's known that Wasmtime is slow with respect to this computation on its main branch (as of right now and the commit you said you were using). #7306 fixes the problem though by keeping one of the computations here out of the loop. In your test_wasmtime function, for example, the loop has both a vmulsd and a vaddsd, but the vmulsd computation is loop-invariant and shouldn't be in the loop.

Generating the x64 assembly with #7306 I get:

0000000000000040 <wasm[0]::function[2]>:
      40: 55                            pushq   %rbp
      41: 48 89 e5                      movq    %rsp, %rbp
      44: 4c 8b 57 08                   movq    8(%rdi), %r10
      48: 4d 8b 12                      movq    (%r10), %r10
      4b: 49 39 e2                      cmpq    %rsp, %r10
      4e: 0f 87 38 00 00 00             ja      0x8c <wasm[0]::function[2]+0x4c>
      54: c5 d1 57 e5                   vxorpd  %xmm5, %xmm5, %xmm4
      58: 45 31 c9                      xorl    %r9d, %r9d
      5b: 48 ba d8 64 02 00 00 01 00 00 movabsq $1099511784664, %rdx    # imm = 0x100000264D8
      65: c4 e1 f9 6e fa                vmovq   %rdx, %xmm7
      6a: c5 c3 59 35 1e 00 00 00       vmulsd  30(%rip), %xmm7, %xmm6  # 0x90 <wasm[0]::function[2]+0x50>
      72: 41 83 c1 01                   addl    $1, %r9d
      76: c5 cb 58 e4                   vaddsd  %xmm4, %xmm6, %xmm4
      7a: 41 81 f9 20 5f a0 12          cmpl    $312500000, %r9d        # imm = 0x12A05F20
      81: 0f 85 eb ff ff ff             jne     0x72 <wasm[0]::function[2]+0x32>
      87: 48 89 ec                      movq    %rbp, %rsp
      8a: 5d                            popq    %rbp
      8b: c3                            retq
      8c: 0f 0b                         ud2

where here the conditional jump at 0x81 goes to 0x72 meaning that the vmulsd is lifted outside of the loop as expected. I unfortunately do not have an x64 machine to test the relative performance before and after #7306, but I wanted to ask whether you tested this? When testing the performance of this program I'd recommend using #7306 as the version of Wasmtime to use. If you're using that though and still seeing performance problems let us know!

@hungryzzz
Copy link
Contributor

hungryzzz commented Oct 22, 2023

@alexcrichton Hi, the performance regression disappeared in the latest Wasmtime.

But I think the performance regression in the old version Wasmtime is not only related to the wrong code sinking, because if I change the magic number 5.43231e-312 to 5.4(or other normal floating point numbers), the execution time will back to normal. So I think the performance regression is the result of both wrong code sinking and this magic number 5.43231e-312.

After searching, we find the reason from this paper On Subnormal Floating Point and Abnormal Timing.

We identify a timing channel in the floating point instructions of modern x86 processors: the running time of floating point addition and multiplication instructions can vary by two orders of magnitude depending on their operands.

This paper finds that floating point operands ranging from 4.9e-324 ~ 2.23e-308 will result in execution times of two orders of magnitude, so if I use other floating point numbers, the wrong code sinking to loop does not bring such a big performance regression. But you're right, #7306 will fix this performance regression.

;; reduce.wat
(module
  (type (;0;) (func (param i32)))
  (type (;1;) (func))
  (type (;2;) (func (result i32)))
  (type (;3;) (func (param i32) (result i32)))
  (import "wasi_snapshot_preview1" "proc_exit" (func (;0;) (type 0)))
  (func (;2;) (type 1)
    call 2
    i32.const 0
    call 0
    unreachable)
  (func (;2;) (type 1)
    (local f64 f64 i32)
    f64.const 5.43231e-312 ;; Magic Number!!!  
    f64.const 1.0
    f64.mul
    local.set 1
    loop  ;; label = @1
      local.get 1
      local.get 0
      f64.add
      local.set 0
      local.get 2   ;; condition variable i
      i32.const 1    
      i32.add        
      local.tee 2   ;; i++
      i32.const 312500000
      i32.ne
      br_if 0 (;@1;)
    end
  )
  (table (;0;) 2 2 funcref)
  (memory (;0;) 8192 8192)
  (global (;0;) (mut i32) (i32.const 66576))
  (export "memory" (memory 0))
  (export "__indirect_function_table" (table 0))
  (export "_start" (func 1)))

@alexcrichton
Copy link
Member

Fascinating! Not what I would have expected but helps explain the results for sure!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Incorrect behavior in the current implementation that needs fixing
Projects
None yet
Development

No branches or pull requests

6 participants