Is the compiler optimizing out some of the benchmarks? #667

elichai · 2019-09-27T17:41:47Z

I've been playing with the benchmarks and I started thinking about the optimizer.
Now most of the benchmarks reuse the variables from past iterations(thanks @apoelstra for showing me that:) ) but the compiler might still be able to figure out it can remove at least the last iteration.

another thing is the jacobi symbol calculation, that has 0 side effect, so I would assume the optimizer will just remove it (which it seems to do so).
I'm not sure what to make out of these results and I think I'll need to try and read the ASM itself to figure out what's going on.

But before:

$ ./bench_internal 
scalar_add: min 0.00771us / avg 0.00811us / max 0.00934us
scalar_negate: min 0.00274us / avg 0.00277us / max 0.00280us
scalar_sqr: min 0.0296us / avg 0.0301us / max 0.0308us
scalar_mul: min 0.0305us / avg 0.0311us / max 0.0323us
scalar_inverse: min 8.65us / avg 8.76us / max 9.03us
scalar_inverse_var: min 2.08us / avg 2.14us / max 2.21us
field_normalize: min 0.00718us / avg 0.00744us / max 0.00797us
field_normalize_weak: min 0.00294us / avg 0.00316us / max 0.00360us
field_sqr: min 0.0140us / avg 0.0147us / max 0.0153us
field_mul: min 0.0185us / avg 0.0189us / max 0.0195us
field_inverse: min 4.09us / avg 4.12us / max 4.17us
field_inverse_var: min 2.10us / avg 2.11us / max 2.15us
field_sqrt: min 4.07us / avg 4.09us / max 4.14us
group_double_var: min 0.123us / avg 0.124us / max 0.126us
group_add_var: min 0.283us / avg 0.284us / max 0.287us
group_add_affine: min 0.242us / avg 0.244us / max 0.249us
group_add_affine_var: min 0.203us / avg 0.204us / max 0.208us
group_jacobi_var: min 0.179us / avg 0.186us / max 0.197us
wnaf_const: min 0.118us / avg 0.120us / max 0.124us
ecmult_wnaf: min 0.447us / avg 0.454us / max 0.465us
hash_sha256: min 0.252us / avg 0.257us / max 0.265us
hash_hmac_sha256: min 1.01us / avg 1.02us / max 1.06us
hash_rfc6979_hmac_sha256: min 5.56us / avg 5.72us / max 6.10us
context_verify: min 2652us / avg 2732us / max 2909us
context_sign: min 28.1us / avg 31.1us / max 33.2us
num_jacobi: min 0.0881us / avg 0.0893us / max 0.0913us

$ ./bench_internal 
scalar_add: min 0.00793us / avg 0.00834us / max 0.00901us
scalar_negate: min 0.00285us / avg 0.00290us / max 0.00294us
scalar_sqr: min 0.0302us / avg 0.0310us / max 0.0321us
scalar_mul: min 0.0312us / avg 0.0318us / max 0.0323us
scalar_inverse: min 8.99us / avg 9.21us / max 9.47us
scalar_inverse_var: min 2.16us / avg 2.26us / max 2.39us
field_normalize: min 0.00753us / avg 0.00767us / max 0.00792us
field_normalize_weak: min 0.00305us / avg 0.00313us / max 0.00319us
field_sqr: min 0.0147us / avg 0.0152us / max 0.0158us
field_mul: min 0.0194us / avg 0.0196us / max 0.0200us
field_inverse: min 4.20us / avg 4.25us / max 4.34us
field_inverse_var: min 2.15us / avg 2.19us / max 2.26us
field_sqrt: min 4.16us / avg 4.19us / max 4.26us
group_double_var: min 0.126us / avg 0.128us / max 0.130us
group_add_var: min 0.291us / avg 0.293us / max 0.300us
group_add_affine: min 0.250us / avg 0.251us / max 0.252us
group_add_affine_var: min 0.208us / avg 0.213us / max 0.223us
group_jacobi_var: min 0.182us / avg 0.196us / max 0.204us
wnaf_const: min 0.124us / avg 0.126us / max 0.129us
ecmult_wnaf: min 0.456us / avg 0.464us / max 0.476us
hash_sha256: min 0.261us / avg 0.266us / max 0.274us
hash_hmac_sha256: min 1.04us / avg 1.06us / max 1.08us
hash_rfc6979_hmac_sha256: min 5.72us / avg 5.76us / max 5.87us
context_verify: min 2720us / avg 2789us / max 2922us
context_sign: min 28.4us / avg 29.6us / max 31.9us
num_jacobi: min 0.0867us / avg 0.0882us / max 0.0906us

After:

$ ./bench_internal 
scalar_add: min 0.00950us / avg 0.00985us / max 0.0104us
scalar_negate: min 0.00338us / avg 0.00343us / max 0.00365us
scalar_sqr: min 0.0303us / avg 0.0311us / max 0.0325us
scalar_mul: min 0.0319us / avg 0.0325us / max 0.0335us
scalar_inverse: min 9.05us / avg 9.15us / max 9.39us
scalar_inverse_var: min 2.13us / avg 2.20us / max 2.35us
field_normalize: min 0.00853us / avg 0.00864us / max 0.00879us
field_normalize_weak: min 0.00394us / avg 0.00401us / max 0.00405us
field_sqr: min 0.0151us / avg 0.0154us / max 0.0161us
field_mul: min 0.0192us / avg 0.0198us / max 0.0204us
field_inverse: min 4.20us / avg 4.24us / max 4.34us
field_inverse_var: min 2.17us / avg 2.19us / max 2.22us
field_sqrt: min 4.18us / avg 4.25us / max 4.36us
group_double_var: min 0.128us / avg 0.129us / max 0.131us
group_add_var: min 0.294us / avg 0.298us / max 0.303us
group_add_affine: min 0.254us / avg 0.259us / max 0.268us
group_add_affine_var: min 0.212us / avg 0.215us / max 0.220us
group_jacobi_var: min 0.187us / avg 0.196us / max 0.205us
wnaf_const: min 0.124us / avg 0.126us / max 0.129us
ecmult_wnaf: min 0.469us / avg 0.475us / max 0.482us
hash_sha256: min 0.262us / avg 0.267us / max 0.272us
hash_hmac_sha256: min 1.05us / avg 1.06us / max 1.08us
hash_rfc6979_hmac_sha256: min 5.84us / avg 5.88us / max 5.97us
context_verify: min 2832us / avg 2899us / max 2999us
context_sign: min 29.9us / avg 31.5us / max 32.9us
num_jacobi: min 0.670us / avg 0.684us / max 0.699us

$ ./bench_internal 
scalar_add: min 0.00910us / avg 0.00965us / max 0.0108us
scalar_negate: min 0.00326us / avg 0.00331us / max 0.00337us
scalar_sqr: min 0.0297us / avg 0.0304us / max 0.0316us
scalar_mul: min 0.0309us / avg 0.0316us / max 0.0320us
scalar_inverse: min 8.87us / avg 8.97us / max 9.08us
scalar_inverse_var: min 2.06us / avg 2.17us / max 2.27us
field_normalize: min 0.00828us / avg 0.00843us / max 0.00863us
field_normalize_weak: min 0.00383us / avg 0.00388us / max 0.00393us
field_sqr: min 0.0144us / avg 0.0148us / max 0.0153us
field_mul: min 0.0185us / avg 0.0189us / max 0.0197us
field_inverse: min 4.15us / avg 4.17us / max 4.26us
field_inverse_var: min 2.09us / avg 2.11us / max 2.15us
field_sqrt: min 4.09us / avg 4.11us / max 4.14us
group_double_var: min 0.125us / avg 0.126us / max 0.128us
group_add_var: min 0.284us / avg 0.286us / max 0.287us
group_add_affine: min 0.245us / avg 0.246us / max 0.249us
group_add_affine_var: min 0.205us / avg 0.207us / max 0.210us
group_jacobi_var: min 0.185us / avg 0.192us / max 0.198us
wnaf_const: min 0.119us / avg 0.121us / max 0.124us
ecmult_wnaf: min 0.456us / avg 0.462us / max 0.478us
hash_sha256: min 0.248us / avg 0.257us / max 0.261us
hash_hmac_sha256: min 1.02us / avg 1.05us / max 1.09us
hash_rfc6979_hmac_sha256: min 5.58us / avg 5.63us / max 5.72us
context_verify: min 2652us / avg 2668us / max 2730us
context_sign: min 27.7us / avg 29.0us / max 31.4us
num_jacobi: min 0.650us / avg 0.656us / max 0.663us

It does seem like there's a big difference in num_jacobi (from 0.0882us before to 0.656us)

jonasnick · 2019-09-28T12:49:58Z

Interesting. field_normalize_weak seems to be significantly affected as well.

elichai · 2019-09-30T19:56:11Z

So I ran perf only on the jacobi benchmarks. with and without the ASM escaping and these are the results:
with escaping:

+   57.16%    57.08%  bench_internal  libgmp.so.10.3.2  [.] __gmpn_hgcd2_jacobi                                                                                                                                    
+   20.99%     0.00%  bench_internal  [unknown]         [.] 0x495641000143c33d                                                                                                                                     
+   20.99%     0.00%  bench_internal  libc-2.29.so      [.] __libc_start_main                                                                                                                                      
+   20.99%     0.00%  bench_internal  bench_internal    [.] main                                                                                                                                                   
+   20.99%     0.00%  bench_internal  bench_internal    [.] run_benchmark                                                                                                                                          
+   17.73%     0.57%  bench_internal  bench_internal    [.] bench_num_jacobi                                                                                                                                       
+   16.67%     1.29%  bench_internal  libgmp.so.10.3.2  [.] __gmpz_jacobi                                                                                                                                          
+   10.45%    10.41%  bench_internal  libgmp.so.10.3.2  [.] __gmpn_jacobi_base                                                                                                                                     
+    3.49%     3.24%  bench_internal  libgmp.so.10.3.2  [.] __gmpn_jacobi_n                                                                                                                                        
+    3.38%     3.30%  bench_internal  libgmp.so.10.3.2  [.] __gmpn_jacobi_2                                                                                                                                        
+    3.02%     2.76%  bench_internal  libgmp.so.10.3.2  [.] __gmpn_submul_1_coreisbr                                                                                                                               
+    2.63%     2.41%  bench_internal  libgmp.so.10.3.2  [.] __gmpn_matrix22_mul1_inverse_vector                                                                                                                    
+    2.47%     2.33%  bench_internal  libgmp.so.10.3.2  [.] __gmpz_import                                                                                                                                          
+    2.33%     2.33%  bench_internal  libc-2.29.so      [.] _int_free                                                                                                                                              
+    2.31%     2.07%  bench_internal  libgmp.so.10.3.2  [.] __gmpn_mul_1_coreisbr                                                                                                                                  
+    2.11%     2.09%  bench_internal  libc-2.29.so      [.] _int_malloc                                                                                                                                            
+    2.10%     2.00%  bench_internal  libc-2.29.so      [.] realloc                                                                                                                                                
+    1.78%     0.00%  bench_internal  [unknown]         [.] 0x0000000000000031                                                                                                                                     
+    1.70%     0.00%  bench_internal  [unknown]         [.] 0x34d15553d182f5f5                                                                                                                                     
+    1.48%     1.36%  bench_internal  libc-2.29.so      [.] _int_realloc                                                                                                                                           
+    1.42%     0.00%  bench_internal  [unknown]         [.] 0xdbdde3e7e9eff3f9                                                                                                                                     
+    1.42%     0.00%  bench_internal  [unknown]         [.] 0x301c1e78e20023fb                                                                                                                                     
+    1.18%     0.00%  bench_internal  [unknown]         [.] 0x5b0274c085480008                                                                                                                                     
+    1.18%     0.00%  bench_internal  [unknown]         [.] 0x0406020604020402                                                                                                                                     
+    1.15%     1.05%  bench_internal  libgmp.so.10.3.2  [.] __gmpn_copyi_core2                                                                                                                                     
+    1.15%     1.03%  bench_internal  libgmp.so.10.3.2  [.] __gmpz_inits                                                                                                                                           
+    1.14%     1.10%  bench_internal  libc-2.29.so      [.] malloc                                                                                                                                                 
+    1.03%     0.99%  bench_internal  libgmp.so.10.3.2  [.] __gmpz_clears                                                                                                                                          
+    0.89%     0.75%  bench_internal  libgmp.so.10.3.2  [.] __gmpz_realloc                                                                                                                                         
+    0.87%     0.00%  bench_internal  [unknown]         [.] 0x3b3d4347494f5359                                                                                                                                     
+    0.81%     0.00%  bench_internal  [unknown]         [.] 0000000000000000                                                                                                                                       
+    0.73%     0.00%  bench_internal  [unknown]         [.] 0x450b2616e79bd99c                                                                                                                                     
+    0.57%     0.00%  bench_internal  [unknown]         [.] 0x34d12f0e848242bb                                                                                                                                     
+    0.53%     0.39%  bench_internal  libc-2.29.so      [.] cfree@GLIBC_2.2.5                                                                                                                                      
+    0.47%     0.00%  bench_internal  [unknown]         [.] 0x0000000000000021                                                                                                                                     
+    0.42%     0.00%  bench_internal  [unknown]         [.] 0x6b87c1525966af45                                                                                                                                     
+    0.40%     0.00%  bench_internal  [unknown]         [.] 0x000000000000025c                                                                                                                                     
+    0.38%     0.38%  bench_internal  libc-2.29.so      [.] malloc_consolidate                                                                                                                                     
+    0.36%     0.24%  bench_internal  libgmp.so.10.3.2  [.] __gmp_default_allocate                                                                                                                                 
+    0.36%     0.26%  bench_internal  libgmp.so.10.3.2  [.] __gmp_default_reallocate                                                                                                                               
+    0.25%     0.00%  bench_internal  [unknown]         [.] 0x00005609b0ef0010                                                                                                                                     
+    0.18%     0.10%  bench_internal  bench_internal    [.] __gmpz_import@plt                                                                                                                                      
+    0.16%     0.12%  bench_internal  libc-2.29.so      [.] __memmove_avx_unaligned_erms                                                                                                                           
+    0.14%     0.14%  bench_internal  libgmp.so.10.3.2  [.] __gmp_default_free                                                                                                                                     
+    0.06%     0.06%  bench_internal  bench_internal    [.] __gmpz_jacobi@plt                                                                                                                                      
+    0.06%     0.06%  bench_internal  bench_internal    [.] __gmpz_inits@plt                                                                                                                                       
+    0.04%     0.04%  bench_internal  libc-2.29.so      [.] 0x00000000000254c4                                                                                                                                     
+    0.02%     0.02%  bench_internal  bench_internal    [.] __gmpz_clears@plt                                                                                                                                      
+    0.02%     0.00%  bench_internal  libc-2.29.so      [.] 0x00007f386150f4c0                                                                                                                                     
+    0.01%     0.01%  bench_internal  ld-2.29.so        [.] do_lookup_x                                                                                                                                            
+    0.01%     0.00%  bench_internal  [unknown]         [.] 0x00007f38617a7c70

without:

+   20.14%     0.00%  bench_internal  [unknown]         [.] 0x495641000144033d                                                                                                                                     
+   20.14%     0.00%  bench_internal  libc-2.29.so      [.] __libc_start_main                                                                                                                                      
+   20.14%     0.00%  bench_internal  bench_internal    [.] main                                                                                                                                                   
+   20.14%     0.00%  bench_internal  bench_internal    [.] run_benchmark                                                                                                                                          
+   15.52%    15.52%  bench_internal  libc-2.29.so      [.] _int_malloc                                                                                                                                            
+   14.86%    14.86%  bench_internal  libc-2.29.so      [.] _int_free                                                                                                                                              
+   12.65%    10.90%  bench_internal  libgmp.so.10.3.2  [.] __gmpz_import                                                                                                                                          
+   12.65%     0.00%  bench_internal  [unknown]         [.] 0x0000000000000031                                                                                                                                     
+   10.55%     9.59%  bench_internal  libc-2.29.so      [.] realloc                                                                                                                                                
+    8.48%     8.48%  bench_internal  libc-2.29.so      [.] _int_realloc                                                                                                                                           
+    6.11%     6.11%  bench_internal  libc-2.29.so      [.] malloc                                                                                                                                                 
+    5.78%     0.00%  bench_internal  [unknown]         [.] 0x3b3d4347494f5359                                                                                                                                     
+    5.76%     5.76%  bench_internal  libgmp.so.10.3.2  [.] __gmpn_copyi_core2                                                                                                                                     
+    5.29%     4.50%  bench_internal  libgmp.so.10.3.2  [.] __gmpz_inits                                                                                                                                           
+    5.29%     4.96%  bench_internal  libgmp.so.10.3.2  [.] __gmpz_realloc                                                                                                                                         
+    4.77%     4.60%  bench_internal  libgmp.so.10.3.2  [.] __gmpz_clears                                                                                                                                          
+    4.31%     2.54%  bench_internal  libc-2.29.so      [.] cfree@GLIBC_2.2.5                                                                                                                                      
+    4.16%     0.00%  bench_internal  [unknown]         [.] 0x5b0274c085480008                                                                                                                                     
+    4.16%     0.00%  bench_internal  [unknown]         [.] 0x0406020604020402                                                                                                                                     
+    3.94%     0.00%  bench_internal  [unknown]         [.] 0000000000000000                                                                                                                                       
+    3.20%     3.20%  bench_internal  bench_internal    [.] bench_num_jacobi                                                                                                                                       
+    3.19%     0.00%  bench_internal  [unknown]         [.] 0x0000000000000021                                                                                                                                     
+    2.86%     1.77%  bench_internal  libgmp.so.10.3.2  [.] __gmp_default_free                                                                                                                                     
+    2.26%     1.94%  bench_internal  libgmp.so.10.3.2  [.] __gmp_default_reallocate                                                                                                                               
+    1.75%     1.75%  bench_internal  libc-2.29.so      [.] malloc_consolidate                                                                                                                                     
+    1.44%     1.44%  bench_internal  libc-2.29.so      [.] __memmove_avx_unaligned_erms                                                                                                                           
+    1.26%     0.79%  bench_internal  libgmp.so.10.3.2  [.] __gmp_default_allocate                                                                                                                                 
+    0.96%     0.32%  bench_internal  bench_internal    [.] __gmpz_import@plt                                                                                                                                      
+    0.95%     0.17%  bench_internal  bench_internal    [.] __gmpz_clears@plt                                                                                                                                      
+    0.91%     0.00%  bench_internal  [unknown]         [.] 0x000055f5626b3010                                                                                                                                     
+    0.32%     0.32%  bench_internal  bench_internal    [.] __gmpz_inits@plt                                                                                                                                       
+    0.32%     0.32%  bench_internal  libc-2.29.so      [.] 0x00000000000254c0                                                                                                                                     
+    0.32%     0.00%  bench_internal  libc-2.29.so      [.] 0x00007f50fca414c4                                                                                                                                     
+    0.12%     0.12%  bench_internal  ld-2.29.so        [.] do_lookup_x                                                                                                                                            
+    0.12%     0.00%  bench_internal  [unknown]         [.] 0x00007f50fccd9c70                                                                                                                                     
+    0.05%     0.05%  bench_internal  ld-2.29.so        [.] strcmp                                                                                                                                                 
+    0.05%     0.00%  bench_internal  [unknown]         [.] 0x54495f0030312e6f                                                                                                                                     
+    0.01%     0.01%  bench_internal  [unknown]         [k] 0xffffffffb8e00b07                                                                                                                                     
+    0.01%     0.00%  bench_internal  ld-2.29.so        [.] _dl_start_user                                                                                                                                         
+    0.01%     0.00%  bench_internal  ld-2.29.so        [.] _dl_sysdep_start                                                                                                                                       
+    0.00%     0.00%  bench_internal  ld-2.29.so        [.] _start

It doesn't seem like it's doing anything except allocating memory

elichai · 2019-09-30T20:18:01Z

Looking at the assembly of field_normalize_weak. it generates quite a different instructions which i'm not sure what to make of them.
hypothetically it should work the same as we are using the output of the function (as the output for the next call).
I think that it might be trying to unroll some of the loop when the escaping isn't there (it's harder for it to unroll the loop when there's stuff there that it doesn't understand that can do anything it wants to the memory)

jonasnick

I'm getting similar numbers to @elichai. Interestingly with clang the jacobi computation isn't optimized out on master.

Unfortunately this PR doesn't solve the problem completely because the memory barrier as implemented here doesn't work with all compilers. In particular, it works with gcc and clang but not with Microsoft Visual C afaik. Ideally we would check the results of the individual benchmark computations in bench_teardown similar to how the other bench binaries work. Perhaps bench_internal could even print a note saying that the results are to be taken with a grain of salt if not on gcc/clang.

Why didn't you add the memory barrier to context_verify/sign? Doesn't seem to make a difference though.

jonasnick · 2019-10-07T12:31:08Z

src/bench_internal.c

@@ -27,6 +27,7 @@ typedef struct {
    int wnaf[256];
 } bench_inv;

+


nit: unnecessary newline?

jonasnick · 2019-10-07T12:32:05Z

src/bench.h

@@ -12,6 +12,10 @@
 #include <math.h>
 #include "sys/time.h"

+static void escape(void *p) {


nit: Perhaps rename to memory fence/barrier to make this more clear and add a comment what this is for?

Yeah i'll rename and add a comment, maybe even an inline hint?

elichai · 2019-10-07T12:50:57Z

Unfortunately this PR doesn't solve the problem completely because the memory barrier as implemented here doesn't work with all compilers. In particular, it works with gcc and clang but not with Microsoft Visual C afaik. Ideally we would check the results of the individual benchmark computations in bench_teardown similar to how the other bench binaries work. Perhaps bench_internal could even print a note saying that the results are to be taken with a grain of salt if not on gcc/clang.

I know almost nothing about windows/Microsoft's compiler.

Why didn't you add the memory barrier to context_verify/sign? Doesn't seem to make a difference though.

Wanted to get some feedback before I started going over the rest of the benchmarks, mostly wanted to make sure this doesn't affect the conclusions for BIP schnorr (I assume the jacobi symbol was tested as part of a full verification so the internal_benchmarks didn't affect those decisions)

real-or-random · 2019-10-07T14:34:47Z

Unfortunately this PR doesn't solve the problem completely because the memory barrier as implemented here doesn't work with all compilers. In particular, it works with gcc and clang but not with Microsoft Visual C afaik. Ideally we would check the results of the individual benchmark computations in bench_teardown similar to how the other bench binaries work. Perhaps bench_internal could even print a note saying that the results are to be taken with a grain of salt if not on gcc/clang.

I know almost nothing about windows/Microsoft's compiler.

https://github.com/google/benchmark/blob/7d97a057e16597e8020d0aca110480fe82c9ca67/include/benchmark/benchmark.h#L307 and the following lines may help

gmaxwell · 2019-10-08T01:48:19Z

elichai: the internal benchmarks are just that, internal benchmarks... no one should be using them for systems comparisons; this isn't an issue for bip schnorr.

I don't like this approach, it's very compiler specific and doesn't obviously preclude all sufficiently advanced optimizations-- and some platforms the memory barrier might do something stupidly expensive like flush the TLBs or something. It's better to set things up so that that data gets used.

I've sometimes seen microbenchmarks print a final value, which can sometimes help catch other bugs or changes. (e.g. you make some optimizations that shouldn't change behaviour and the final value changes). If that was required in order to get the results to print, it wouldn't be the end of the word.

For Jacobi simply carrying a running sum of the results would probably be fine.

As a side, the jacobi symbol in this code is currently GMP only, the QR test without GMP (e.g. as bitcoin ships) is done with a grievously slow constant time exponentiation ladder. ... that should probably be someone's priority, even if it's not an outright deployment blocker.

elichai · 2019-10-08T01:54:55Z

@gmaxwell I assumed so, just wanted to mention it just in case (bip schnorr doesn't specify a specific benchmark)

I'm not sure about the TLB thing because this memory barrier results in 0 more instructions. It's just to make the optimizer not optimize the result out (and by that also maybe the whole call graph)

About a different approach I can try and rewrite those benchmarks so that they all use the output from the function they're benchmarking and in the end of the loop assert on it in some way (so that it won't optimize from the end)

gmaxwell · 2019-10-08T17:38:30Z

Aside, http://sape.inf.usi.ch/publications/asplos09 should be required reading for anyone attempting to use microbenchmarks like these. (and perhaps https://github.com/ccurtsinger/stabilizer )

elichai · 2019-10-21T21:23:43Z

Ok, I tried to remove as much fences as possible and use simple logic to do the same.
in functions that the inputs and outputs were the same I added one fence at the end instead of every iteration (because at the end of the iterations the output isn't used).
Also added an __always_inline__ to the fence function.
if people prefer to just print the whole data structure at the end of the benchmarks this might allow us to remove most of the fences.

Now the stats look like this:

scalar_add: min 0.00757us / avg 0.00800us / max 0.00820us
scalar_negate: min 0.00288us / avg 0.00290us / max 0.00293us
scalar_sqr: min 0.0303us / avg 0.0307us / max 0.0312us
scalar_mul: min 0.0312us / avg 0.0316us / max 0.0321us
scalar_inverse: min 8.80us / avg 8.95us / max 9.07us
scalar_inverse_var: min 2.10us / avg 2.15us / max 2.31us
field_normalize: min 0.00745us / avg 0.00772us / max 0.00846us
field_normalize_weak: min 0.00299us / avg 0.00303us / max 0.00313us
field_sqr: min 0.0145us / avg 0.0149us / max 0.0153us
field_mul: min 0.0185us / avg 0.0189us / max 0.0193us
field_inverse: min 4.18us / avg 4.21us / max 4.26us
field_inverse_var: min 2.11us / avg 2.15us / max 2.18us
field_sqrt: min 4.07us / avg 4.10us / max 4.16us
group_double_var: min 0.124us / avg 0.127us / max 0.131us
group_add_var: min 0.290us / avg 0.292us / max 0.298us
group_add_affine: min 0.244us / avg 0.248us / max 0.252us
group_add_affine_var: min 0.203us / avg 0.206us / max 0.209us
group_jacobi_var: min 0.752us / avg 0.759us / max 0.797us
wnaf_const: min 0.116us / avg 0.120us / max 0.122us
ecmult_wnaf: min 0.443us / avg 0.453us / max 0.467us
hash_sha256: min 0.249us / avg 0.251us / max 0.255us
hash_hmac_sha256: min 1.01us / avg 1.02us / max 1.05us
hash_rfc6979_hmac_sha256: min 5.56us / avg 5.59us / max 5.67us
context_verify: min 2635us / avg 2646us / max 2660us
context_sign: min 27.3us / avg 27.9us / max 28.1us
num_jacobi: min 0.631us / avg 0.637us / max 0.646us
  Time (mean ± σ):      8.334 s ±  0.059 s    [User: 8.327 s, System: 0.000 s]
  Range (min … max):    8.185 s …  8.387 s    10 runs

elichai · 2019-10-21T21:28:03Z

src/bench_internal.c

-        secp256k1_scalar_split_lambda(&l, &r, &data->scalar_x);
-        secp256k1_scalar_add(&data->scalar_x, &data->scalar_x, &data->scalar_y);
+        secp256k1_scalar_split_lambda(&data->scalar_x, &data->scalar_y, &data->scalar_x);
+        j += secp256k1_scalar_add(&data->scalar_x, &data->scalar_x, &data->scalar_y);


I think this should be fine because it solves for x=a-y*lambda and y=(a-x)/lambda so adding them up again every time should not result back in the same a and should give a different result.

But please tell me if there's something here I'm missing(maybe it will make it go to zero very quickly? (though I don't think so because this is all in Fp so it overflows) but if it does I can replace the addition with multiplication if it's any better)

apoelstra · 2019-10-22T21:08:19Z

Worth revisiting #290 in light of this PR.

Perhaps my frustration at GMP being impossibly faster than my own implementation was just an illusion?

elichai · 2019-10-22T21:19:31Z

Maybe. I can try to rebase and see how the benchmarks compare (actually your implementation has side effect. so it makes sense (you use the CHECK macro in some of the functions))

apoelstra · 2019-10-22T21:34:41Z

I would be extremely curious to see how my implementation compares, assuming that a rebase is feasible. (The bulk of the code is in the new num_* files as I recall, but the integration with the rest of the library may need to be rewritten at this point..)

elichai · 2019-10-22T21:43:18Z

Will give it a try tomorrow. If the speed does come close it would be pretty awesome for the future schnorr verification (we need it to at least be better than secp256k1_fe_sqrt)

elichai · 2019-10-23T13:37:31Z

@apoelstra
https://github.com/elichai/secp256k1/tree/rebased-jacobi
Used hyperfine -w 5 -m 20 --show-output './bench_internal'
yours 64bit:

Benchmark #1: ./bench_internal
num_jacobi: min 4.12us / avg 4.16us / max 4.20us
num_jacobi: min 3.99us / avg 4.08us / max 4.10us
num_jacobi: min 4.11us / avg 4.12us / max 4.13us
num_jacobi: min 4.27us / avg 4.30us / max 4.36us
num_jacobi: min 4.18us / avg 4.19us / max 4.21us
num_jacobi: min 4.18us / avg 4.19us / max 4.19us
num_jacobi: min 4.28us / avg 4.33us / max 4.58us
num_jacobi: min 4.33us / avg 4.47us / max 4.62us
num_jacobi: min 4.26us / avg 4.32us / max 4.54us
num_jacobi: min 4.37us / avg 4.39us / max 4.41us
num_jacobi: min 4.15us / avg 4.30us / max 4.43us
num_jacobi: min 4.22us / avg 4.38us / max 4.56us
num_jacobi: min 4.34us / avg 4.46us / max 4.52us
num_jacobi: min 4.35us / avg 4.37us / max 4.39us
num_jacobi: min 4.17us / avg 4.23us / max 4.42us
num_jacobi: min 4.15us / avg 4.16us / max 4.18us
num_jacobi: min 4.09us / avg 4.13us / max 4.21us
num_jacobi: min 4.22us / avg 4.25us / max 4.27us
num_jacobi: min 4.17us / avg 4.22us / max 4.26us
num_jacobi: min 4.19us / avg 4.23us / max 4.28us
num_jacobi: min 4.19us / avg 4.22us / max 4.29us
num_jacobi: min 4.18us / avg 4.20us / max 4.26us
num_jacobi: min 4.18us / avg 4.21us / max 4.30us
num_jacobi: min 4.17us / avg 4.19us / max 4.22us
num_jacobi: min 4.17us / avg 4.18us / max 4.19us
  Time (mean ± σ):      8.542 s ±  0.199 s    [User: 8.533 s, System: 0.001 s]
  Range (min … max):    8.251 s …  8.934 s    20 runs

GMP:

Benchmark #1: ./bench_internal
num_jacobi: min 0.613us / avg 0.628us / max 0.636us
num_jacobi: min 0.600us / avg 0.612us / max 0.624us
num_jacobi: min 0.609us / avg 0.641us / max 0.708us
num_jacobi: min 0.644us / avg 0.676us / max 0.730us
num_jacobi: min 0.658us / avg 0.670us / max 0.705us
num_jacobi: min 0.627us / avg 0.635us / max 0.648us
num_jacobi: min 0.647us / avg 0.677us / max 0.739us
num_jacobi: min 0.657us / avg 0.679us / max 0.691us
num_jacobi: min 0.647us / avg 0.673us / max 0.700us
num_jacobi: min 0.638us / avg 0.680us / max 0.718us
num_jacobi: min 0.644us / avg 0.702us / max 0.784us
num_jacobi: min 0.647us / avg 0.713us / max 0.780us
num_jacobi: min 0.640us / avg 0.699us / max 0.792us
num_jacobi: min 0.647us / avg 0.689us / max 0.749us
num_jacobi: min 0.684us / avg 0.742us / max 0.819us
num_jacobi: min 0.672us / avg 0.691us / max 0.711us
num_jacobi: min 0.639us / avg 0.752us / max 0.806us
num_jacobi: min 0.679us / avg 0.719us / max 0.788us
num_jacobi: min 0.673us / avg 0.698us / max 0.749us
num_jacobi: min 0.668us / avg 0.684us / max 0.722us
num_jacobi: min 0.668us / avg 0.686us / max 0.724us
num_jacobi: min 0.639us / avg 0.652us / max 0.665us
num_jacobi: min 0.648us / avg 0.663us / max 0.706us
num_jacobi: min 0.649us / avg 0.658us / max 0.671us
num_jacobi: min 0.649us / avg 0.670us / max 0.698us
  Time (mean ± σ):      1.377 s ±  0.058 s    [User: 1.374 s, System: 0.000 s]
  Range (min … max):    1.270 s …  1.506 s    20 runs

apoelstra · 2019-10-23T21:07:01Z

Ok, 8x is better than 50x I suppose :)

jonasnick · 2019-10-28T10:31:24Z

Before progress on this PR stalls because of the the memory fences, how about we add just the num_jacobi fix for now (can be separate PR)?

real-or-random · 2019-10-28T10:38:58Z

Before progress on this PR stalls because of the the memory fences, how about we add just the num_jacobi fix for now (can be separate PR)?

I agree. Let's get this part merged and we can still have a discussion about memory fences in another PR maybe or on IRC.

elichai · 2019-10-28T14:31:28Z

Done #678

… memory fence 362bb25 Modified bench_scalar_split so it won't get optimized out (Elichai Turkel) 73a30c6 Added accumulators and checks on benchmarks so they won't get optimized out (Elichai Turkel) Pull request description: As asked #667 (comment) this is the parts of #667 that don't require an assembly memory fence. I splitted them to 2 commits, one with obvious easy ones. and another that changes the logic a bit to achieve this (See #667 (comment) ) ACKs for top commit: jonasnick: ACK 362bb25 real-or-random: ACK 362bb25 I read the diff and I ran the benchmarks Tree-SHA512: d5e47f5d64c3b035155276f057671ceb7f5852f24c7102fee4d0141aabebf882039f3eae0d152bae89d0603bc09fa6ad9f7bc6b8c0f74a668ee252c727517804

real-or-random · 2019-11-25T10:29:17Z

Now after #678 has been merged, this here is mostly related to memory fences. Let me note that my plan was to add a fence in #636 anyway for clearing memory ( with a #ifdef though).

By the way, the fence here is:

     __asm__ __volatile__("": : "g"(p) : "memory");

whereas the one in #636 (that I also added to Bitcoin Core in bitcoin/bitcoin#16158) is

    __asm__ __volatile__("" : : "r"(p) : "memory");

As far as I can tell from the docs, those should be equivalent but I'm really not an expert here.

real-or-random · 2020-04-27T14:52:18Z

As far as I can tell from the docs, those should be equivalent but I'm really not an expert here.

It seems that the g version is slighter cleaner because it does not require the compiler to allocate a register for a ASM noop, see https://bugs.llvm.org/show_bug.cgi?id=15495#c11.

edit: actually not sure in practice. I guess the pointer is usually in a register anywhere, and r/g doesn't make a difference then for GCC. LLVM will always pass the pointer via memory if I use g, so this will always need a useless memory access. For this PR, this is anyway not important but I would like to be careful in #636. I think keeping the r variant is better, also because this seems more commonly used, e.g., in the Linux kernel and in Core. So the change is higher that someone will notice if this breaks in the future.

sipa · 2023-05-08T15:11:01Z

I'm not sure there is an issue anymore; the jacobi benchmark was bad, but it has been fixed (and removed, and re-added, ...).

For added confidence, I think there are possibilities to avoiding the compiler optimizing some things away that don't actually need a memory fence (e.g. printing a value that's a function of all benchmark's success counters, or comparing success counters after reading through volatile like CHECK(*((volatile int*)&count) <= iters);).

elichai force-pushed the 2019-09-bench branch from 28a460e to 2bf75d4 Compare September 27, 2019 17:43

jonasnick reviewed Oct 7, 2019

View reviewed changes

Add memory fence to bench_internal

f423d2e

elichai force-pushed the 2019-09-bench branch from 2bf75d4 to 79dbc56 Compare October 21, 2019 21:19

elichai commented Oct 21, 2019

View reviewed changes

Remove memory fences where we can do better

a5198c5

elichai force-pushed the 2019-09-bench branch from 79dbc56 to a5198c5 Compare October 21, 2019 21:30

elichai mentioned this pull request Oct 23, 2019

Add native num implementation; modular inverse and Jacobi symbol without GMP #290

Closed

elichai mentioned this pull request Oct 28, 2019

Preventing compiler optimizations in benchmarks without a memory fence #678

Merged

elichai mentioned this pull request Mar 1, 2020

Implement go bindings to secp256k1 kaspanet/go-secp256k1#1

Merged

jonasnick mentioned this pull request Mar 24, 2020

Consider removing x86_64 field assembly #726

Closed

sipa closed this May 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Is the compiler optimizing out some of the benchmarks? #667

Is the compiler optimizing out some of the benchmarks? #667

elichai commented Sep 27, 2019 •

edited

Loading

jonasnick commented Sep 28, 2019

elichai commented Sep 30, 2019

elichai commented Sep 30, 2019 •

edited

Loading

jonasnick left a comment

jonasnick Oct 7, 2019

jonasnick Oct 7, 2019

elichai Oct 7, 2019 •

edited

Loading

elichai commented Oct 7, 2019

real-or-random commented Oct 7, 2019

gmaxwell commented Oct 8, 2019 •

edited

Loading

elichai commented Oct 8, 2019

gmaxwell commented Oct 8, 2019 •

edited

Loading

elichai commented Oct 21, 2019

elichai Oct 21, 2019

apoelstra commented Oct 22, 2019

elichai commented Oct 22, 2019

apoelstra commented Oct 22, 2019

elichai commented Oct 22, 2019 •

edited

Loading

elichai commented Oct 23, 2019

apoelstra commented Oct 23, 2019

jonasnick commented Oct 28, 2019

real-or-random commented Oct 28, 2019

elichai commented Oct 28, 2019

real-or-random commented Nov 25, 2019

real-or-random commented Apr 27, 2020 •

edited

Loading

sipa commented May 8, 2023

		@@ -27,6 +27,7 @@ typedef struct {
		int wnaf[256];
		} bench_inv;

Is the compiler optimizing out some of the benchmarks? #667

Is the compiler optimizing out some of the benchmarks? #667

Conversation

elichai commented Sep 27, 2019 • edited Loading

jonasnick commented Sep 28, 2019

elichai commented Sep 30, 2019

elichai commented Sep 30, 2019 • edited Loading

jonasnick left a comment

Choose a reason for hiding this comment

jonasnick Oct 7, 2019

Choose a reason for hiding this comment

jonasnick Oct 7, 2019

Choose a reason for hiding this comment

elichai Oct 7, 2019 • edited Loading

Choose a reason for hiding this comment

elichai commented Oct 7, 2019

real-or-random commented Oct 7, 2019

gmaxwell commented Oct 8, 2019 • edited Loading

elichai commented Oct 8, 2019

gmaxwell commented Oct 8, 2019 • edited Loading

elichai commented Oct 21, 2019

elichai Oct 21, 2019

Choose a reason for hiding this comment

apoelstra commented Oct 22, 2019

elichai commented Oct 22, 2019

apoelstra commented Oct 22, 2019

elichai commented Oct 22, 2019 • edited Loading

elichai commented Oct 23, 2019

apoelstra commented Oct 23, 2019

jonasnick commented Oct 28, 2019

real-or-random commented Oct 28, 2019

elichai commented Oct 28, 2019

real-or-random commented Nov 25, 2019

real-or-random commented Apr 27, 2020 • edited Loading

sipa commented May 8, 2023

elichai commented Sep 27, 2019 •

edited

Loading

elichai commented Sep 30, 2019 •

edited

Loading

elichai Oct 7, 2019 •

edited

Loading

gmaxwell commented Oct 8, 2019 •

edited

Loading

gmaxwell commented Oct 8, 2019 •

edited

Loading

elichai commented Oct 22, 2019 •

edited

Loading

real-or-random commented Apr 27, 2020 •

edited

Loading