-
Notifications
You must be signed in to change notification settings - Fork 1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Is the compiler optimizing out some of the benchmarks? #667
Conversation
28a460e
to
2bf75d4
Compare
Interesting. |
So I ran perf only on the jacobi benchmarks. with and without the ASM escaping and these are the results:
without:
It doesn't seem like it's doing anything except allocating memory |
Looking at the assembly of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm getting similar numbers to @elichai. Interestingly with clang the jacobi computation isn't optimized out on master.
Unfortunately this PR doesn't solve the problem completely because the memory barrier as implemented here doesn't work with all compilers. In particular, it works with gcc and clang but not with Microsoft Visual C afaik. Ideally we would check the results of the individual benchmark computations in bench_teardown
similar to how the other bench
binaries work. Perhaps bench_internal
could even print a note saying that the results are to be taken with a grain of salt if not on gcc/clang.
Why didn't you add the memory barrier to context_verify/sign
? Doesn't seem to make a difference though.
src/bench_internal.c
Outdated
@@ -27,6 +27,7 @@ typedef struct { | |||
int wnaf[256]; | |||
} bench_inv; | |||
|
|||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: unnecessary newline?
src/bench.h
Outdated
@@ -12,6 +12,10 @@ | |||
#include <math.h> | |||
#include "sys/time.h" | |||
|
|||
static void escape(void *p) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Perhaps rename to memory fence/barrier to make this more clear and add a comment what this is for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah i'll rename and add a comment, maybe even an inline hint?
I know almost nothing about windows/Microsoft's compiler.
Wanted to get some feedback before I started going over the rest of the benchmarks, mostly wanted to make sure this doesn't affect the conclusions for BIP schnorr (I assume the jacobi symbol was tested as part of a full verification so the internal_benchmarks didn't affect those decisions) |
https://github.com/google/benchmark/blob/7d97a057e16597e8020d0aca110480fe82c9ca67/include/benchmark/benchmark.h#L307 and the following lines may help |
elichai: the internal benchmarks are just that, internal benchmarks... no one should be using them for systems comparisons; this isn't an issue for bip schnorr. I don't like this approach, it's very compiler specific and doesn't obviously preclude all sufficiently advanced optimizations-- and some platforms the memory barrier might do something stupidly expensive like flush the TLBs or something. It's better to set things up so that that data gets used. I've sometimes seen microbenchmarks print a final value, which can sometimes help catch other bugs or changes. (e.g. you make some optimizations that shouldn't change behaviour and the final value changes). If that was required in order to get the results to print, it wouldn't be the end of the word. For Jacobi simply carrying a running sum of the results would probably be fine. As a side, the jacobi symbol in this code is currently GMP only, the QR test without GMP (e.g. as bitcoin ships) is done with a grievously slow constant time exponentiation ladder. ... that should probably be someone's priority, even if it's not an outright deployment blocker. |
@gmaxwell I assumed so, just wanted to mention it just in case (bip schnorr doesn't specify a specific benchmark) I'm not sure about the TLB thing because this memory barrier results in 0 more instructions. It's just to make the optimizer not optimize the result out (and by that also maybe the whole call graph) About a different approach I can try and rewrite those benchmarks so that they all use the output from the function they're benchmarking and in the end of the loop assert on it in some way (so that it won't optimize from the end) |
Aside, http://sape.inf.usi.ch/publications/asplos09 should be required reading for anyone attempting to use microbenchmarks like these. (and perhaps https://github.com/ccurtsinger/stabilizer ) |
2bf75d4
to
79dbc56
Compare
Ok, I tried to remove as much fences as possible and use simple logic to do the same. Now the stats look like this:
|
secp256k1_scalar_split_lambda(&l, &r, &data->scalar_x); | ||
secp256k1_scalar_add(&data->scalar_x, &data->scalar_x, &data->scalar_y); | ||
secp256k1_scalar_split_lambda(&data->scalar_x, &data->scalar_y, &data->scalar_x); | ||
j += secp256k1_scalar_add(&data->scalar_x, &data->scalar_x, &data->scalar_y); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this should be fine because it solves for x=a-y*lambda
and y=(a-x)/lambda
so adding them up again every time should not result back in the same a
and should give a different result.
But please tell me if there's something here I'm missing(maybe it will make it go to zero very quickly? (though I don't think so because this is all in Fp so it overflows) but if it does I can replace the addition with multiplication if it's any better)
79dbc56
to
a5198c5
Compare
Worth revisiting #290 in light of this PR. Perhaps my frustration at GMP being impossibly faster than my own implementation was just an illusion? |
Maybe. I can try to rebase and see how the benchmarks compare (actually your implementation has side effect. so it makes sense (you use the CHECK macro in some of the functions)) |
I would be extremely curious to see how my implementation compares, assuming that a rebase is feasible. (The bulk of the code is in the new |
Will give it a try tomorrow. If the speed does come close it would be pretty awesome for the future schnorr verification (we need it to at least be better than |
@apoelstra
GMP:
|
Ok, 8x is better than 50x I suppose :) |
Before progress on this PR stalls because of the the memory fences, how about we add just the num_jacobi fix for now (can be separate PR)? |
I agree. Let's get this part merged and we can still have a discussion about memory fences in another PR maybe or on IRC. |
Done #678 |
… memory fence 362bb25 Modified bench_scalar_split so it won't get optimized out (Elichai Turkel) 73a30c6 Added accumulators and checks on benchmarks so they won't get optimized out (Elichai Turkel) Pull request description: As asked #667 (comment) this is the parts of #667 that don't require an assembly memory fence. I splitted them to 2 commits, one with obvious easy ones. and another that changes the logic a bit to achieve this (See #667 (comment) ) ACKs for top commit: jonasnick: ACK 362bb25 real-or-random: ACK 362bb25 I read the diff and I ran the benchmarks Tree-SHA512: d5e47f5d64c3b035155276f057671ceb7f5852f24c7102fee4d0141aabebf882039f3eae0d152bae89d0603bc09fa6ad9f7bc6b8c0f74a668ee252c727517804
Now after #678 has been merged, this here is mostly related to memory fences. Let me note that my plan was to add a fence in #636 anyway for clearing memory ( with a #ifdef though). By the way, the fence here is:
whereas the one in #636 (that I also added to Bitcoin Core in bitcoin/bitcoin#16158) is
As far as I can tell from the docs, those should be equivalent but I'm really not an expert here. |
It seems that the edit: actually not sure in practice. I guess the pointer is usually in a register anywhere, and |
I'm not sure there is an issue anymore; the jacobi benchmark was bad, but it has been fixed (and removed, and re-added, ...). For added confidence, I think there are possibilities to avoiding the compiler optimizing some things away that don't actually need a memory fence (e.g. printing a value that's a function of all benchmark's success counters, or comparing success counters after reading through volatile like |
I've been playing with the benchmarks and I started thinking about the optimizer.
Now most of the benchmarks reuse the variables from past iterations(thanks @apoelstra for showing me that:) ) but the compiler might still be able to figure out it can remove at least the last iteration.
another thing is the jacobi symbol calculation, that has 0 side effect, so I would assume the optimizer will just remove it (which it seems to do so).
I'm not sure what to make out of these results and I think I'll need to try and read the ASM itself to figure out what's going on.
But before:
After:
It does seem like there's a big difference in
num_jacobi
(from 0.0882us before to 0.656us)