Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Limit split count per original bundle with fallback 1-to-N split. #59

Merged
merged 2 commits into from
Jun 27, 2022

Conversation

cfallin
Copy link
Member

@cfallin cfallin commented Jun 26, 2022

Right now, splitting a bundle produces two halves. Furthermore, it has
cost linear in the length of the bundle, because the resulting
half-bundles have their requirements recomputed with a new scan, and
because we copy half the use-list over to the tail end sub-bundle.

This works fine when a bundle has a handful of splits overall, but not
when an input has a systematic pattern of conflicts that will require
O(|bundle|) splits (e.g., every Use is constrained to a different fixed
register than the last one). In such a case, we get quadratic behavior.

This PR adds a per-spillset (so, per-original-bundle) counter for
splits, and when it reaches a preset threshold (10 for now), we instead
split directly into minimal bundles along the whole length of the
bundle, putting the regions without uses in the spill bundle.

This basically approximates what a non-splitting allocator would do: it
"spills" the whole bundle to possibly a stackslot, or a second-chance
register allocation at best, via the spill bundle; and then does minimal
reservations of registers just at uses/defs and moves the "spilled"
value into/out of them immediately.

Together with another small optimization, this PR results in a 4x
compilation speedup and 24x memory use reduction on one particularly bad
case with alternating conflicting requirements on a vreg (see
bytecodealliance/wasmtime#4291 for details).

@cfallin cfallin requested review from alexcrichton and fitzgen June 26, 2022 18:01
@cfallin
Copy link
Member Author

cfallin commented Jun 26, 2022

Apologies to whomever reviews this for the hairy innards of the LR/bundle representation and splitting!

It's been fuzzing overnight on my 12-core system without finding anything (after a few initial fuzzbugs yesterday that guided some corner-cases here) so it seems relatively stable...

cfallin added a commit to cfallin/wasmtime that referenced this pull request Jun 26, 2022
- Handle call instructions' clobbers with the clobbers API, using RA2's
  clobbers bitmask (bytecodealliance/regalloc2#58) rather than clobbers
  list;

- Pull in changes from bytecodealliance/regalloc2#59 for much more sane
  edge-case behavior w.r.t. liverange splitting.

Currently refers to local path for RA2 so will fail CI; I will update
this once RA2 PRs 58 and 59 are merged, version-bumped and released.

Fixes bytecodealliance#4291.
Copy link
Member

@alexcrichton alexcrichton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those are some awesome wins! Unfortunately though I'm pretty unfamiliar with the implementation here so it would take me quite some time to get up to speed and able to review this. I had a random few stylistic comments and while at a high level this seems reasonable I'll leave it to @fitzgen to review the internals.

src/ion/process.rs Outdated Show resolved Hide resolved
let mut last_inst: Option<Inst> = None;
let mut last_vreg: Option<VRegIndex> = None;

for entry_idx in 0..self.bundles[bundle.index()].ranges.len() {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Random stylistic comment but you can impl<T> Index<MyLocalType> for [T] which could help change all the indexing here to self.bundles[bundle] (or even self[bundle]) rather than self.bundles[bundle.index()]

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I briefly played with this but it seems we would need wrapper types around the Vec<T> to make this work properly. I'd love to have this eventually but it's a bigger refactor (all use-sites across the allocator) than I want to bite off right now; I'll file an issue!

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

#62

src/ion/process.rs Outdated Show resolved Hide resolved
Comment on lines +776 to +805
for use_idx in 0..self.ranges[entry.index.index()].uses.len() {
let u = self.ranges[entry.index.index()].uses[use_idx];
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could also do the impl Index thing here as well.

cfallin added 2 commits June 27, 2022 12:29
Right now, splitting a bundle produces two halves. Furthermore, it has
cost linear in the length of the bundle, because the resulting
half-bundles have their requirements recomputed with a new scan, and
because we copy half the use-list over to the tail end sub-bundle.

This works fine when a bundle has a handful of splits overall, but not
when an input has a systematic pattern of conflicts that will require
O(|bundle|) splits (e.g., every Use is constrained to a different fixed
register than the last one). In such a case, we get quadratic behavior.

This PR adds a per-spillset (so, per-original-bundle) counter for
splits, and when it reaches a preset threshold (10 for now), we instead
split directly into minimal bundles along the whole length of the
bundle, putting the regions without uses in the spill bundle.

This basically approximates what a non-splitting allocator would do: it
"spills" the whole bundle to possibly a stackslot, or a second-chance
register allocation at best, via the spill bundle; and then does minimal
reservations of registers just at uses/defs and moves the "spilled"
value into/out of them immediately.

Together with another small optimization, this PR results in a 4x
compilation speedup and 24x memory use reduction on one particularly bad
case with alternating conflicting requirements on a vreg (see
bytecodealliance/wasmtime#4291 for details).
@cfallin cfallin force-pushed the only-a-little-bit-of-splitting branch from 527f024 to ac04044 Compare June 27, 2022 20:01
@cfallin cfallin merged commit 4eb2a25 into bytecodealliance:main Jun 27, 2022
@cfallin cfallin deleted the only-a-little-bit-of-splitting branch June 27, 2022 20:23
cfallin added a commit to cfallin/regalloc2 that referenced this pull request Jun 27, 2022
Includes improvements in splitting performance (bytecodealliance#59) and more efficient
handling of clobber lists (bytecodealliance#58).

Semver break because of API change in bytecodealliance#58.
@cfallin cfallin mentioned this pull request Jun 27, 2022
cfallin added a commit that referenced this pull request Jun 27, 2022
Includes improvements in splitting performance (#59) and more efficient
handling of clobber lists (#58).

Semver break because of API change in #58.
cfallin added a commit to cfallin/wasmtime that referenced this pull request Jun 27, 2022
- Handle call instructions' clobbers with the clobbers API, using RA2's
  clobbers bitmask (bytecodealliance/regalloc2#58) rather than clobbers
  list;

- Pull in changes from bytecodealliance/regalloc2#59 for much more sane
  edge-case behavior w.r.t. liverange splitting.
cfallin added a commit to cfallin/wasmtime that referenced this pull request Jun 27, 2022
- Handle call instructions' clobbers with the clobbers API, using RA2's
  clobbers bitmask (bytecodealliance/regalloc2#58) rather than clobbers
  list;

- Pull in changes from bytecodealliance/regalloc2#59 for much more sane
  edge-case behavior w.r.t. liverange splitting.
cfallin added a commit to bytecodealliance/wasmtime that referenced this pull request Jun 28, 2022
- Handle call instructions' clobbers with the clobbers API, using RA2's
  clobbers bitmask (bytecodealliance/regalloc2#58) rather than clobbers
  list;

- Pull in changes from bytecodealliance/regalloc2#59 for much more sane
  edge-case behavior w.r.t. liverange splitting.
afonso360 pushed a commit to afonso360/wasmtime that referenced this pull request Jun 30, 2022
- Handle call instructions' clobbers with the clobbers API, using RA2's
  clobbers bitmask (bytecodealliance/regalloc2#58) rather than clobbers
  list;

- Pull in changes from bytecodealliance/regalloc2#59 for much more sane
  edge-case behavior w.r.t. liverange splitting.
cfallin added a commit to cfallin/regalloc2 that referenced this pull request Sep 9, 2022
…clobbers with constraints.

- Allow early-pos uses with fixed regs that conflict with
  clobbers (which happen at late-pos), in addition to the
  existing logic for conflicts with late-pos defs with fixed
  regs.

  This is a pretty subtle issue that was uncovered in bytecodealliance#53 for the def
  case, and the fix here is the mirror of that fix for clobbers. The
  root cause for all this complexity is that we can't split in the
  middle of an instruction (because there's no way to insert a move
  there!) so if a use is live-downward, we can't let it live in preg A
  at early-pos and preg B != A at late-pos; instead we need to rewrite
  the constraints and use a fixup move.

  The earlier change to fix bytecodealliance#53 was actually a bit too conservative in
  that it always applied when such conflicts existed, even if the
  downward arg was not live. This PR fixes that (it's fine for the
  early-use and late-def to be fixed to the same reg if the use's
  liverange ends after early-pos) and adapts the same flexibility to
  the clobbers case as well.

- Handles safepoints that happen at call instructions where the above
  rewrite occurred.

  The above case is handled by moving the fixed-reg def to early-pos,
  to (still) reserve the fixed reg for the whole instruction, and
  allow the fixup move to actually put the used vreg in the reg until
  the def overwrites it.

  However if a def defines a reffy vreg (that is, one that is
  reference-typed), the reffy value is falsely seen as *live* by the
  safepoint at the early-pos. It's not actually live until the
  late-pos, so we must exclude it. We can detect this case by looking
  for the `StartsAtDef` flag on the LR; actually even for early-pos
  defs this logic is still valid, because the safepoint logically sees
  a snapshot of state just *before* the instruction (so before the
  early-pos).

- Fixes the last-resort split-bundle-into-minimal-pieces logic from
  bytecodealliance#59 to properly limit a minimal bundle piece to end after the
  early-pos, rather than cover the entire instruction. This was causing
  artificial overlaps between args that end after early-pos and defs
  that start at late-pos when one of the vregs hit the fallback split
  behavior.
cfallin added a commit to cfallin/regalloc2 that referenced this pull request Sep 12, 2022
…clobbers with constraints.

- Allow early-pos uses with fixed regs that conflict with
  clobbers (which happen at late-pos), in addition to the
  existing logic for conflicts with late-pos defs with fixed
  regs.

  This is a pretty subtle issue that was uncovered in bytecodealliance#53 for the def
  case, and the fix here is the mirror of that fix for clobbers. The
  root cause for all this complexity is that we can't split in the
  middle of an instruction (because there's no way to insert a move
  there!) so if a use is live-downward, we can't let it live in preg A
  at early-pos and preg B != A at late-pos; instead we need to rewrite
  the constraints and use a fixup move.

  The earlier change to fix bytecodealliance#53 was actually a bit too conservative in
  that it always applied when such conflicts existed, even if the
  downward arg was not live. This PR fixes that (it's fine for the
  early-use and late-def to be fixed to the same reg if the use's
  liverange ends after early-pos) and adapts the same flexibility to
  the clobbers case as well.

- Reworks the fixups for the def case mentioned above to not shift the
  def to the Early point. Doing so causes issues when the def is to a
  reffy vreg: it can then be falsely included in a stackmap if the
  instruction containing this operand is a safepoint.

- Fixes the last-resort split-bundle-into-minimal-pieces logic from
  bytecodealliance#59 to properly limit a minimal bundle piece to end after the
  early-pos, rather than cover the entire instruction. This was causing
  artificial overlaps between args that end after early-pos and defs
  that start at late-pos when one of the vregs hit the fallback split
  behavior.
cfallin added a commit to cfallin/regalloc2 that referenced this pull request Sep 19, 2022
…clobbers with constraints.

- Allow early-pos uses with fixed regs that conflict with
  clobbers (which happen at late-pos), in addition to the
  existing logic for conflicts with late-pos defs with fixed
  regs.

  This is a pretty subtle issue that was uncovered in bytecodealliance#53 for the def
  case, and the fix here is the mirror of that fix for clobbers. The
  root cause for all this complexity is that we can't split in the
  middle of an instruction (because there's no way to insert a move
  there!) so if a use is live-downward, we can't let it live in preg A
  at early-pos and preg B != A at late-pos; instead we need to rewrite
  the constraints and use a fixup move.

  The earlier change to fix bytecodealliance#53 was actually a bit too conservative in
  that it always applied when such conflicts existed, even if the
  downward arg was not live. This PR fixes that (it's fine for the
  early-use and late-def to be fixed to the same reg if the use's
  liverange ends after early-pos) and adapts the same flexibility to
  the clobbers case as well.

- Reworks the fixups for the def case mentioned above to not shift the
  def to the Early point. Doing so causes issues when the def is to a
  reffy vreg: it can then be falsely included in a stackmap if the
  instruction containing this operand is a safepoint.

- Fixes the last-resort split-bundle-into-minimal-pieces logic from
  bytecodealliance#59 to properly limit a minimal bundle piece to end after the
  early-pos, rather than cover the entire instruction. This was causing
  artificial overlaps between args that end after early-pos and defs
  that start at late-pos when one of the vregs hit the fallback split
  behavior.
cfallin added a commit that referenced this pull request Sep 20, 2022
…clobbers with constraints. (#74)

* Some fixes to allow for call instructions to name args, returns, and clobbers with constraints.

- Allow early-pos uses with fixed regs that conflict with
  clobbers (which happen at late-pos), in addition to the
  existing logic for conflicts with late-pos defs with fixed
  regs.

  This is a pretty subtle issue that was uncovered in #53 for the def
  case, and the fix here is the mirror of that fix for clobbers. The
  root cause for all this complexity is that we can't split in the
  middle of an instruction (because there's no way to insert a move
  there!) so if a use is live-downward, we can't let it live in preg A
  at early-pos and preg B != A at late-pos; instead we need to rewrite
  the constraints and use a fixup move.

  The earlier change to fix #53 was actually a bit too conservative in
  that it always applied when such conflicts existed, even if the
  downward arg was not live. This PR fixes that (it's fine for the
  early-use and late-def to be fixed to the same reg if the use's
  liverange ends after early-pos) and adapts the same flexibility to
  the clobbers case as well.

- Reworks the fixups for the def case mentioned above to not shift the
  def to the Early point. Doing so causes issues when the def is to a
  reffy vreg: it can then be falsely included in a stackmap if the
  instruction containing this operand is a safepoint.

- Fixes the last-resort split-bundle-into-minimal-pieces logic from
  #59 to properly limit a minimal bundle piece to end after the
  early-pos, rather than cover the entire instruction. This was causing
  artificial overlaps between args that end after early-pos and defs
  that start at late-pos when one of the vregs hit the fallback split
  behavior.

* Fix fuzzbug: do not merge when a liverange has a fixed-reg def.

This can create impossible situations: e.g., if a vreg is constrained
to p0 as a late-def, and another, completely different vreg is
constrained to p0 as an early-use on the same instruction, and the
instruction also has a third vreg (early-use), we do not want to merge
the liverange for the third vreg with the first, because it would
result in an unsolveable conflict for p0 at the early-point.

* Review comments.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants