Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid quadratic behavior in pathological label-alias case in MachBuffer. #3469

Merged
merged 1 commit into from
Oct 21, 2021

Conversation

cfallin
Copy link
Member

@cfallin cfallin commented Oct 21, 2021

Fixes #3468.

If a program has many instances of the pattern "goto next; next:" in a
row (i.e., no-op branches to the fallthrough address), the branch
simplification in MachBuffer would remove them all, as expected.
However, in order to work correctly, the algorithm needs to track all
labels that alias the current buffer tail, so that they can be adjusted
later if another branch chomp occurs.

When many thousands of this branch-to-next pattern occur, many thousands
of labels will reference the current buffer tail, and this list of
thousands of labels will be shuffled between the branch metadata struct
and the "labels at tail" struct as branches are appended and then
chomped immediately.

It's possible that with smarter data structure design, we could somehow
share the list of labels -- e.g., a single array of all labels, in order
they are bound, with ranges of indices in this array used to represent
lists of labels (actually, that seems like a better design in general);
but let's leave that to future optimization work.

For now, we can avoid the quadratic behavior by just "giving up" if the
list is too long; it's always valid to not optimize a branch. It is very
unlikely that the "normal" case will have more than 100 "goto next"
branches in a row, so this should not have any perf impact; if it does,
we will leave 1 out of every 100 such branches un-optimized in a long
sequence of thousands.

This takes total compilation time down on my machine from ~300ms to
~72ms for the foo.wasm case in #3441. For reference, the old backend
(now removed), built from arbitrarily-chosen-1-year-old commit
c7fcc344, takes 158ms, so we're ~twice as fast, which is what I would
expect.

(This PR also switches a few statics to consts just above where I added
a const, as s drive-by change.)

Fixes bytecodealliance#3468.

If a program has many instances of the pattern "goto next; next:" in a
row (i.e., no-op branches to the fallthrough address), the branch
simplification in `MachBuffer` would remove them all, as expected.
However, in order to work correctly, the algorithm needs to track all
labels that alias the current buffer tail, so that they can be adjusted
later if another branch chomp occurs.

When many thousands of this branch-to-next pattern occur, many thousands
of labels will reference the current buffer tail, and this list of
thousands of labels will be shuffled between the branch metadata struct
and the "labels at tail" struct as branches are appended and then
chomped immediately.

It's possible that with smarter data structure design, we could somehow
share the list of labels -- e.g., a single array of all labels, in order
they are bound, with ranges of indices in this array used to represent
lists of labels (actually, that seems like a better design in general);
but let's leave that to future optimization work.

For now, we can avoid the quadratic behavior by just "giving up" if the
list is too long; it's always valid to not optimize a branch. It is very
unlikely that the "normal" case will have more than 100 "goto next"
branches in a row, so this should not have any perf impact; if it does,
we will leave 1 out of every 100 such branches un-optimized in a long
sequence of thousands.

This takes total compilation time down on my machine from ~300ms to
~72ms for the `foo.wasm` case in bytecodealliance#3441. For reference, the old backend
(now removed), built from arbitrarily-chosen-1-year-old commit
`c7fcc344`, takes 158ms, so we're ~twice as fast, which is what I would
expect.
@cfallin cfallin requested a review from alexcrichton October 21, 2021 19:17
@github-actions github-actions bot added cranelift Issues related to the Cranelift code generator cranelift:area:machinst Issues related to instruction selection and the new MachInst backend. labels Oct 21, 2021
Copy link
Member

@alexcrichton alexcrichton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know enough about MachBuffer per se to review this in isolation, but I trust you and your knowledge of MachBuffer and that the invariants of MachBuffer are checked thoroughly enough internally so this seems fine by me. Thanks for taking a look at the performance here!

Out of further curiosity, even 70ms for a function like this seems somewhat high, is that still due to MachBuffer things or is it general "too much elbow grease is needed to bring that down further"

@cfallin
Copy link
Member Author

cfallin commented Oct 21, 2021

Out of further curiosity, even 70ms for a function like this seems somewhat high, is that still due to MachBuffer things or is it general "too much elbow grease is needed to bring that down further"

I think it is mostly in the middle-end (analyses and optimizations), which will see the huge CFG with all the loops before it's reduced. The backend stages that are specifically broken out in the clif-util wasm -T output show: 3ms in CLIF -> VCode lowering; 4ms in regalloc; and 4ms in binary emission (MachBuffer + cpu-specific instruction encoding code). So only 11ms (EDIT: I can add I promise) in the "backend" and the rest in attempted optimization.

A perf profile of the compilation shows a lot of time in the kernel's pagefault path, so I think that just writing out the data structures has some overhead (for the large function body). I imagine we could probably be smarter about early optimizations that cut down the amount of work the later stages have to do; but nothing immediately obviously or anomalously bad is happening here, I think.

@cfallin cfallin merged commit 54896ac into bytecodealliance:main Oct 21, 2021
@cfallin cfallin deleted the machbuffer-quadratic-labels branch October 21, 2021 20:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cranelift:area:machinst Issues related to instruction selection and the new MachInst backend. cranelift Issues related to the Cranelift code generator
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Suboptimal behavior in branch simplification (MachBuffer) when thousands of labels alias the same offset
3 participants