-
Notifications
You must be signed in to change notification settings - Fork 41
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Limit split count per original bundle with fallback 1-to-N split. #59
Limit split count per original bundle with fallback 1-to-N split. #59
Conversation
Apologies to whomever reviews this for the hairy innards of the LR/bundle representation and splitting! It's been fuzzing overnight on my 12-core system without finding anything (after a few initial fuzzbugs yesterday that guided some corner-cases here) so it seems relatively stable... |
- Handle call instructions' clobbers with the clobbers API, using RA2's clobbers bitmask (bytecodealliance/regalloc2#58) rather than clobbers list; - Pull in changes from bytecodealliance/regalloc2#59 for much more sane edge-case behavior w.r.t. liverange splitting. Currently refers to local path for RA2 so will fail CI; I will update this once RA2 PRs 58 and 59 are merged, version-bumped and released. Fixes bytecodealliance#4291.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Those are some awesome wins! Unfortunately though I'm pretty unfamiliar with the implementation here so it would take me quite some time to get up to speed and able to review this. I had a random few stylistic comments and while at a high level this seems reasonable I'll leave it to @fitzgen to review the internals.
let mut last_inst: Option<Inst> = None; | ||
let mut last_vreg: Option<VRegIndex> = None; | ||
|
||
for entry_idx in 0..self.bundles[bundle.index()].ranges.len() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Random stylistic comment but you can impl<T> Index<MyLocalType> for [T]
which could help change all the indexing here to self.bundles[bundle]
(or even self[bundle]
) rather than self.bundles[bundle.index()]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I briefly played with this but it seems we would need wrapper types around the Vec<T>
to make this work properly. I'd love to have this eventually but it's a bigger refactor (all use-sites across the allocator) than I want to bite off right now; I'll file an issue!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
for use_idx in 0..self.ranges[entry.index.index()].uses.len() { | ||
let u = self.ranges[entry.index.index()].uses[use_idx]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could also do the impl Index
thing here as well.
Right now, splitting a bundle produces two halves. Furthermore, it has cost linear in the length of the bundle, because the resulting half-bundles have their requirements recomputed with a new scan, and because we copy half the use-list over to the tail end sub-bundle. This works fine when a bundle has a handful of splits overall, but not when an input has a systematic pattern of conflicts that will require O(|bundle|) splits (e.g., every Use is constrained to a different fixed register than the last one). In such a case, we get quadratic behavior. This PR adds a per-spillset (so, per-original-bundle) counter for splits, and when it reaches a preset threshold (10 for now), we instead split directly into minimal bundles along the whole length of the bundle, putting the regions without uses in the spill bundle. This basically approximates what a non-splitting allocator would do: it "spills" the whole bundle to possibly a stackslot, or a second-chance register allocation at best, via the spill bundle; and then does minimal reservations of registers just at uses/defs and moves the "spilled" value into/out of them immediately. Together with another small optimization, this PR results in a 4x compilation speedup and 24x memory use reduction on one particularly bad case with alternating conflicting requirements on a vreg (see bytecodealliance/wasmtime#4291 for details).
527f024
to
ac04044
Compare
Includes improvements in splitting performance (bytecodealliance#59) and more efficient handling of clobber lists (bytecodealliance#58). Semver break because of API change in bytecodealliance#58.
- Handle call instructions' clobbers with the clobbers API, using RA2's clobbers bitmask (bytecodealliance/regalloc2#58) rather than clobbers list; - Pull in changes from bytecodealliance/regalloc2#59 for much more sane edge-case behavior w.r.t. liverange splitting.
- Handle call instructions' clobbers with the clobbers API, using RA2's clobbers bitmask (bytecodealliance/regalloc2#58) rather than clobbers list; - Pull in changes from bytecodealliance/regalloc2#59 for much more sane edge-case behavior w.r.t. liverange splitting.
- Handle call instructions' clobbers with the clobbers API, using RA2's clobbers bitmask (bytecodealliance/regalloc2#58) rather than clobbers list; - Pull in changes from bytecodealliance/regalloc2#59 for much more sane edge-case behavior w.r.t. liverange splitting.
- Handle call instructions' clobbers with the clobbers API, using RA2's clobbers bitmask (bytecodealliance/regalloc2#58) rather than clobbers list; - Pull in changes from bytecodealliance/regalloc2#59 for much more sane edge-case behavior w.r.t. liverange splitting.
…clobbers with constraints. - Allow early-pos uses with fixed regs that conflict with clobbers (which happen at late-pos), in addition to the existing logic for conflicts with late-pos defs with fixed regs. This is a pretty subtle issue that was uncovered in bytecodealliance#53 for the def case, and the fix here is the mirror of that fix for clobbers. The root cause for all this complexity is that we can't split in the middle of an instruction (because there's no way to insert a move there!) so if a use is live-downward, we can't let it live in preg A at early-pos and preg B != A at late-pos; instead we need to rewrite the constraints and use a fixup move. The earlier change to fix bytecodealliance#53 was actually a bit too conservative in that it always applied when such conflicts existed, even if the downward arg was not live. This PR fixes that (it's fine for the early-use and late-def to be fixed to the same reg if the use's liverange ends after early-pos) and adapts the same flexibility to the clobbers case as well. - Handles safepoints that happen at call instructions where the above rewrite occurred. The above case is handled by moving the fixed-reg def to early-pos, to (still) reserve the fixed reg for the whole instruction, and allow the fixup move to actually put the used vreg in the reg until the def overwrites it. However if a def defines a reffy vreg (that is, one that is reference-typed), the reffy value is falsely seen as *live* by the safepoint at the early-pos. It's not actually live until the late-pos, so we must exclude it. We can detect this case by looking for the `StartsAtDef` flag on the LR; actually even for early-pos defs this logic is still valid, because the safepoint logically sees a snapshot of state just *before* the instruction (so before the early-pos). - Fixes the last-resort split-bundle-into-minimal-pieces logic from bytecodealliance#59 to properly limit a minimal bundle piece to end after the early-pos, rather than cover the entire instruction. This was causing artificial overlaps between args that end after early-pos and defs that start at late-pos when one of the vregs hit the fallback split behavior.
…clobbers with constraints. - Allow early-pos uses with fixed regs that conflict with clobbers (which happen at late-pos), in addition to the existing logic for conflicts with late-pos defs with fixed regs. This is a pretty subtle issue that was uncovered in bytecodealliance#53 for the def case, and the fix here is the mirror of that fix for clobbers. The root cause for all this complexity is that we can't split in the middle of an instruction (because there's no way to insert a move there!) so if a use is live-downward, we can't let it live in preg A at early-pos and preg B != A at late-pos; instead we need to rewrite the constraints and use a fixup move. The earlier change to fix bytecodealliance#53 was actually a bit too conservative in that it always applied when such conflicts existed, even if the downward arg was not live. This PR fixes that (it's fine for the early-use and late-def to be fixed to the same reg if the use's liverange ends after early-pos) and adapts the same flexibility to the clobbers case as well. - Reworks the fixups for the def case mentioned above to not shift the def to the Early point. Doing so causes issues when the def is to a reffy vreg: it can then be falsely included in a stackmap if the instruction containing this operand is a safepoint. - Fixes the last-resort split-bundle-into-minimal-pieces logic from bytecodealliance#59 to properly limit a minimal bundle piece to end after the early-pos, rather than cover the entire instruction. This was causing artificial overlaps between args that end after early-pos and defs that start at late-pos when one of the vregs hit the fallback split behavior.
…clobbers with constraints. - Allow early-pos uses with fixed regs that conflict with clobbers (which happen at late-pos), in addition to the existing logic for conflicts with late-pos defs with fixed regs. This is a pretty subtle issue that was uncovered in bytecodealliance#53 for the def case, and the fix here is the mirror of that fix for clobbers. The root cause for all this complexity is that we can't split in the middle of an instruction (because there's no way to insert a move there!) so if a use is live-downward, we can't let it live in preg A at early-pos and preg B != A at late-pos; instead we need to rewrite the constraints and use a fixup move. The earlier change to fix bytecodealliance#53 was actually a bit too conservative in that it always applied when such conflicts existed, even if the downward arg was not live. This PR fixes that (it's fine for the early-use and late-def to be fixed to the same reg if the use's liverange ends after early-pos) and adapts the same flexibility to the clobbers case as well. - Reworks the fixups for the def case mentioned above to not shift the def to the Early point. Doing so causes issues when the def is to a reffy vreg: it can then be falsely included in a stackmap if the instruction containing this operand is a safepoint. - Fixes the last-resort split-bundle-into-minimal-pieces logic from bytecodealliance#59 to properly limit a minimal bundle piece to end after the early-pos, rather than cover the entire instruction. This was causing artificial overlaps between args that end after early-pos and defs that start at late-pos when one of the vregs hit the fallback split behavior.
…clobbers with constraints. (#74) * Some fixes to allow for call instructions to name args, returns, and clobbers with constraints. - Allow early-pos uses with fixed regs that conflict with clobbers (which happen at late-pos), in addition to the existing logic for conflicts with late-pos defs with fixed regs. This is a pretty subtle issue that was uncovered in #53 for the def case, and the fix here is the mirror of that fix for clobbers. The root cause for all this complexity is that we can't split in the middle of an instruction (because there's no way to insert a move there!) so if a use is live-downward, we can't let it live in preg A at early-pos and preg B != A at late-pos; instead we need to rewrite the constraints and use a fixup move. The earlier change to fix #53 was actually a bit too conservative in that it always applied when such conflicts existed, even if the downward arg was not live. This PR fixes that (it's fine for the early-use and late-def to be fixed to the same reg if the use's liverange ends after early-pos) and adapts the same flexibility to the clobbers case as well. - Reworks the fixups for the def case mentioned above to not shift the def to the Early point. Doing so causes issues when the def is to a reffy vreg: it can then be falsely included in a stackmap if the instruction containing this operand is a safepoint. - Fixes the last-resort split-bundle-into-minimal-pieces logic from #59 to properly limit a minimal bundle piece to end after the early-pos, rather than cover the entire instruction. This was causing artificial overlaps between args that end after early-pos and defs that start at late-pos when one of the vregs hit the fallback split behavior. * Fix fuzzbug: do not merge when a liverange has a fixed-reg def. This can create impossible situations: e.g., if a vreg is constrained to p0 as a late-def, and another, completely different vreg is constrained to p0 as an early-use on the same instruction, and the instruction also has a third vreg (early-use), we do not want to merge the liverange for the third vreg with the first, because it would result in an unsolveable conflict for p0 at the early-point. * Review comments.
Right now, splitting a bundle produces two halves. Furthermore, it has
cost linear in the length of the bundle, because the resulting
half-bundles have their requirements recomputed with a new scan, and
because we copy half the use-list over to the tail end sub-bundle.
This works fine when a bundle has a handful of splits overall, but not
when an input has a systematic pattern of conflicts that will require
O(|bundle|) splits (e.g., every Use is constrained to a different fixed
register than the last one). In such a case, we get quadratic behavior.
This PR adds a per-spillset (so, per-original-bundle) counter for
splits, and when it reaches a preset threshold (10 for now), we instead
split directly into minimal bundles along the whole length of the
bundle, putting the regions without uses in the spill bundle.
This basically approximates what a non-splitting allocator would do: it
"spills" the whole bundle to possibly a stackslot, or a second-chance
register allocation at best, via the spill bundle; and then does minimal
reservations of registers just at uses/defs and moves the "spilled"
value into/out of them immediately.
Together with another small optimization, this PR results in a 4x
compilation speedup and 24x memory use reduction on one particularly bad
case with alternating conflicting requirements on a vreg (see
bytecodealliance/wasmtime#4291 for details).