-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Avoid quadratic behavior in pathological label-alias case in MachBuffer. #3469
Avoid quadratic behavior in pathological label-alias case in MachBuffer. #3469
Conversation
Fixes bytecodealliance#3468. If a program has many instances of the pattern "goto next; next:" in a row (i.e., no-op branches to the fallthrough address), the branch simplification in `MachBuffer` would remove them all, as expected. However, in order to work correctly, the algorithm needs to track all labels that alias the current buffer tail, so that they can be adjusted later if another branch chomp occurs. When many thousands of this branch-to-next pattern occur, many thousands of labels will reference the current buffer tail, and this list of thousands of labels will be shuffled between the branch metadata struct and the "labels at tail" struct as branches are appended and then chomped immediately. It's possible that with smarter data structure design, we could somehow share the list of labels -- e.g., a single array of all labels, in order they are bound, with ranges of indices in this array used to represent lists of labels (actually, that seems like a better design in general); but let's leave that to future optimization work. For now, we can avoid the quadratic behavior by just "giving up" if the list is too long; it's always valid to not optimize a branch. It is very unlikely that the "normal" case will have more than 100 "goto next" branches in a row, so this should not have any perf impact; if it does, we will leave 1 out of every 100 such branches un-optimized in a long sequence of thousands. This takes total compilation time down on my machine from ~300ms to ~72ms for the `foo.wasm` case in bytecodealliance#3441. For reference, the old backend (now removed), built from arbitrarily-chosen-1-year-old commit `c7fcc344`, takes 158ms, so we're ~twice as fast, which is what I would expect.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know enough about MachBuffer
per se to review this in isolation, but I trust you and your knowledge of MachBuffer
and that the invariants of MachBuffer
are checked thoroughly enough internally so this seems fine by me. Thanks for taking a look at the performance here!
Out of further curiosity, even 70ms for a function like this seems somewhat high, is that still due to MachBuffer
things or is it general "too much elbow grease is needed to bring that down further"
I think it is mostly in the middle-end (analyses and optimizations), which will see the huge CFG with all the loops before it's reduced. The backend stages that are specifically broken out in the A |
Fixes #3468.
If a program has many instances of the pattern "goto next; next:" in a
row (i.e., no-op branches to the fallthrough address), the branch
simplification in
MachBuffer
would remove them all, as expected.However, in order to work correctly, the algorithm needs to track all
labels that alias the current buffer tail, so that they can be adjusted
later if another branch chomp occurs.
When many thousands of this branch-to-next pattern occur, many thousands
of labels will reference the current buffer tail, and this list of
thousands of labels will be shuffled between the branch metadata struct
and the "labels at tail" struct as branches are appended and then
chomped immediately.
It's possible that with smarter data structure design, we could somehow
share the list of labels -- e.g., a single array of all labels, in order
they are bound, with ranges of indices in this array used to represent
lists of labels (actually, that seems like a better design in general);
but let's leave that to future optimization work.
For now, we can avoid the quadratic behavior by just "giving up" if the
list is too long; it's always valid to not optimize a branch. It is very
unlikely that the "normal" case will have more than 100 "goto next"
branches in a row, so this should not have any perf impact; if it does,
we will leave 1 out of every 100 such branches un-optimized in a long
sequence of thousands.
This takes total compilation time down on my machine from ~300ms to
~72ms for the
foo.wasm
case in #3441. For reference, the old backend(now removed), built from arbitrarily-chosen-1-year-old commit
c7fcc344
, takes 158ms, so we're ~twice as fast, which is what I wouldexpect.
(This PR also switches a few
static
s toconst
s just above where I addeda
const
, as s drive-by change.)