Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cost analysis: Remove "Unacceptable" hack #6782

Merged
merged 15 commits into from
Jul 25, 2024
Merged

Conversation

kripken
Copy link
Member

@kripken kripken commented Jul 23, 2024

We marked various expressions as having cost "Unacceptable", fixed at 100, to
ensure we never moved them out from an If arm, etc. Giving them such a high
cost avoids that problem - the cost is higher than the limit we have for moving
code from conditional to unconditional execution - but it also means the total
cost is unrealistic. For example, a function with one such instruction + an add
(cost 1) would end up with cost 101, and removing the add would look
insignificant, which causes issues for things that want to compare costs
(like Monomorphization).

To fix this, adjust some costs. The main change here is to give casts a cost of 5.
I measured this in depth, see the attached benchmark scripts, and it looks
clear that in both V8 and SpiderMonkey the cost of a cast is high enough to
make it not worth turning an if with ref.test arm into a select (which would
always execute the test).

Other costs adjusted here matter a lot less, because they are on operations
that have side effects and so the optimizer will anyhow not move them from
conditional to unconditional execution, but I tried to make them a bit more
realistic while I was removing "Unacceptable":

  • Give most atomic operations the 10 cost we've been using for atomic loads/
    stores. Perhaps wait and notify should be slower, however, but it seems like
    assuming fast switching might be more relevant.
  • Give growth operations a cost of 20, and throw operations a cost of 10. These
    numbers are entirely made up as I am not even sure how to measure them in
    a useful way (but, again, this should not matter much as they have side
    effects).

I verified that building a large Java program with this PR causes 0 changes
to code, so this should not risk regressions, though in principle this is not NFC
(and it unlocks Monomorphization work because of that).

@kripken kripken requested review from tlively and aheejin July 23, 2024 20:05
@aheejin
Copy link
Member

aheejin commented Jul 23, 2024

Would considering something like https://github.com/dfinity/ic/blob/master/rs/execution_environment/benches/wasm_instructions/WASM_BENCHMARKS.md be more informing than some made-up numbers?

@kripken
Copy link
Member Author

kripken commented Jul 23, 2024

Interesting, thanks. Hmm, e.g. memory.fill having a cost of "98" there is surely dependent on the size of the fill they are doing... Though if they had numbers on casts that could be very useful, but I don't see any, unfortunately. I also don't see atomics. edit: or for throw

But separately it might make sense to update all our basic math costs at some point, maybe using their numbers in part. Another source of info could be the benchmark framework that is begun in this PR - the cost of 5 for casts is from there.

@tlively
Copy link
Member

tlively commented Jul 23, 2024

The dfinity people did warn us to take their numbers with a grain of salt. In particular, their runtime does NaN canonicalization, so their floating point operations are much more expensive than we would expect.

Creating our own benchmarks to get better informed numbers makes sense to me.

@kripken
Copy link
Member Author

kripken commented Jul 23, 2024

Btw, here is the output of the script here:

len time:          2489.531000003106

and time:          2947.7890000008583
iff-both time:     3103.6149999974114

or time:           2935.7989999977317
iff-either time:   3241.3820000004575

select time:       2698.045000000748
iff-nextor time:   2607.062000000701

select-three time: 4385.273999998848
iff-three time:    3377.1729999980234

Details of what those are is in the benchmark file, but overall the first is just computing the length of a linked list (baseline to see the overhead of memory and the export call), and then pairs of a non-if and an if, that is, patterns of either executing a ref.test unconditionally or not. And/Or are perhaps a bit faster than an If, but a Select might be slower even in the best case (pair before last) while a worst-case Select is a lot worse (last pair).

Overall these justify not executing ref.test unconditionally even if it allows shrinking code size, hence the cost 5 (which is the cost from which we don't do that).

(Numbers are on V8, but the pattern is similar in SpiderMonkey too.)

Comment on lines 67 to 68
// We'll call the benchmark functions in random orders.
function makeOrders(prefix) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it standard benchmarking practice to run the benchmarks in random orders? Is this meant to defeat unwanted optimizations? Generating every possible order up front is not going to scale to a larger number of benchmarks. Is there something simpler and more scalable we can do?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it standard benchmarking practice to run the benchmarks in random orders?

There are other ways to deal with benchmark interactions, like running all the tests for A first, then all the tests for B, etc., rather than interleaving. But interleaving actually makes it more realistic since real-world code is mixed in with other stuff, and it's simple enough to handle here, so it seems appropriate to me.

Is this meant to defeat unwanted optimizations?

Mainly to avoid order being an issue. Imagine that running A, B, C happens to have A warm up the cache for B, or B reset the branch predictor for C. Random orders avoid that.

Generating every possible order up front is not going to scale to a larger number of benchmarks. Is there something simpler and more scalable we can do?

Yeah, past some point it can't work, but so long as we don't hit that limit it is faster to do it this way. The other way would be to generate an unbiased random order on the fly each time, which is not hard, but just takes more work.

)
)

(func $makeC (export "makeC") (param $next (ref null $A)) (result anyref)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is never called from the benchmarking script. Intentional?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, having $C prevents the optimizer from thinking $B could be final. And so far there isn't a benchmark that uses $C. I'll add a comment.

Comment on lines +37 to +40
// The cost of throwing a wasm exception. This does not include the cost of
// catching it (which might be in another function than the one we are
// considering).
static const CostType ThrowCost = 10;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Even when pulling numbers out of nowhere, how do you divide the total cost between the catch and the throw? Neither executes without the other (assuming throws are caught at all).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'm not sure about Throw. My intuition is this cost would be the total cost (including the catch), since we don't add any cost in Try.

Comment on lines -91 to -92
static_assert(TooCostlyToRunUnconditionally < CostAnalyzer::Unacceptable,
"We never run code unconditionally if it has unacceptable cost");
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There are still instructions we know should not be run unconditionally if it can be avoided; is there something else we can replace this assertion with?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good idea, done.

@kripken
Copy link
Member Author

kripken commented Jul 25, 2024

All comments should be addressed - @tlively did you have anything else?

@tlively
Copy link
Member

tlively commented Jul 25, 2024

Nope, LGTM

@kripken kripken merged commit 9cc1cb1 into WebAssembly:main Jul 25, 2024
13 checks passed
@kripken kripken deleted the newcost branch July 25, 2024 18:16
@gkdn gkdn mentioned this pull request Aug 31, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants