Limit split count per original bundle with fallback 1-to-N split. #59

cfallin · 2022-06-26T18:01:54Z

Right now, splitting a bundle produces two halves. Furthermore, it has
cost linear in the length of the bundle, because the resulting
half-bundles have their requirements recomputed with a new scan, and
because we copy half the use-list over to the tail end sub-bundle.

This works fine when a bundle has a handful of splits overall, but not
when an input has a systematic pattern of conflicts that will require
O(|bundle|) splits (e.g., every Use is constrained to a different fixed
register than the last one). In such a case, we get quadratic behavior.

This PR adds a per-spillset (so, per-original-bundle) counter for
splits, and when it reaches a preset threshold (10 for now), we instead
split directly into minimal bundles along the whole length of the
bundle, putting the regions without uses in the spill bundle.

This basically approximates what a non-splitting allocator would do: it
"spills" the whole bundle to possibly a stackslot, or a second-chance
register allocation at best, via the spill bundle; and then does minimal
reservations of registers just at uses/defs and moves the "spilled"
value into/out of them immediately.

Together with another small optimization, this PR results in a 4x
compilation speedup and 24x memory use reduction on one particularly bad
case with alternating conflicting requirements on a vreg (see
bytecodealliance/wasmtime#4291 for details).

cfallin · 2022-06-26T18:04:03Z

Apologies to whomever reviews this for the hairy innards of the LR/bundle representation and splitting!

It's been fuzzing overnight on my 12-core system without finding anything (after a few initial fuzzbugs yesterday that guided some corner-cases here) so it seems relatively stable...

- Handle call instructions' clobbers with the clobbers API, using RA2's clobbers bitmask (bytecodealliance/regalloc2#58) rather than clobbers list; - Pull in changes from bytecodealliance/regalloc2#59 for much more sane edge-case behavior w.r.t. liverange splitting. Currently refers to local path for RA2 so will fail CI; I will update this once RA2 PRs 58 and 59 are merged, version-bumped and released. Fixes bytecodealliance#4291.

alexcrichton

Those are some awesome wins! Unfortunately though I'm pretty unfamiliar with the implementation here so it would take me quite some time to get up to speed and able to review this. I had a random few stylistic comments and while at a high level this seems reasonable I'll leave it to @fitzgen to review the internals.

src/ion/process.rs

alexcrichton · 2022-06-27T14:04:02Z

src/ion/process.rs

+        let mut last_inst: Option<Inst> = None;
+        let mut last_vreg: Option<VRegIndex> = None;
+
+        for entry_idx in 0..self.bundles[bundle.index()].ranges.len() {


Random stylistic comment but you can impl<T> Index<MyLocalType> for [T] which could help change all the indexing here to self.bundles[bundle] (or even self[bundle]) rather than self.bundles[bundle.index()]

I briefly played with this but it seems we would need wrapper types around the Vec<T> to make this work properly. I'd love to have this eventually but it's a bigger refactor (all use-sites across the allocator) than I want to bite off right now; I'll file an issue!

src/ion/process.rs

fitzgen · 2022-06-27T16:31:52Z

src/ion/process.rs

+            for use_idx in 0..self.ranges[entry.index.index()].uses.len() {
+                let u = self.ranges[entry.index.index()].uses[use_idx];


Could also do the impl Index thing here as well.

Right now, splitting a bundle produces two halves. Furthermore, it has cost linear in the length of the bundle, because the resulting half-bundles have their requirements recomputed with a new scan, and because we copy half the use-list over to the tail end sub-bundle. This works fine when a bundle has a handful of splits overall, but not when an input has a systematic pattern of conflicts that will require O(|bundle|) splits (e.g., every Use is constrained to a different fixed register than the last one). In such a case, we get quadratic behavior. This PR adds a per-spillset (so, per-original-bundle) counter for splits, and when it reaches a preset threshold (10 for now), we instead split directly into minimal bundles along the whole length of the bundle, putting the regions without uses in the spill bundle. This basically approximates what a non-splitting allocator would do: it "spills" the whole bundle to possibly a stackslot, or a second-chance register allocation at best, via the spill bundle; and then does minimal reservations of registers just at uses/defs and moves the "spilled" value into/out of them immediately. Together with another small optimization, this PR results in a 4x compilation speedup and 24x memory use reduction on one particularly bad case with alternating conflicting requirements on a vreg (see bytecodealliance/wasmtime#4291 for details).

Includes improvements in splitting performance (bytecodealliance#59) and more efficient handling of clobber lists (bytecodealliance#58). Semver break because of API change in bytecodealliance#58.

Includes improvements in splitting performance (#59) and more efficient handling of clobber lists (#58). Semver break because of API change in #58.

- Handle call instructions' clobbers with the clobbers API, using RA2's clobbers bitmask (bytecodealliance/regalloc2#58) rather than clobbers list; - Pull in changes from bytecodealliance/regalloc2#59 for much more sane edge-case behavior w.r.t. liverange splitting.

…clobbers with constraints. - Allow early-pos uses with fixed regs that conflict with clobbers (which happen at late-pos), in addition to the existing logic for conflicts with late-pos defs with fixed regs. This is a pretty subtle issue that was uncovered in bytecodealliance#53 for the def case, and the fix here is the mirror of that fix for clobbers. The root cause for all this complexity is that we can't split in the middle of an instruction (because there's no way to insert a move there!) so if a use is live-downward, we can't let it live in preg A at early-pos and preg B != A at late-pos; instead we need to rewrite the constraints and use a fixup move. The earlier change to fix bytecodealliance#53 was actually a bit too conservative in that it always applied when such conflicts existed, even if the downward arg was not live. This PR fixes that (it's fine for the early-use and late-def to be fixed to the same reg if the use's liverange ends after early-pos) and adapts the same flexibility to the clobbers case as well. - Handles safepoints that happen at call instructions where the above rewrite occurred. The above case is handled by moving the fixed-reg def to early-pos, to (still) reserve the fixed reg for the whole instruction, and allow the fixup move to actually put the used vreg in the reg until the def overwrites it. However if a def defines a reffy vreg (that is, one that is reference-typed), the reffy value is falsely seen as *live* by the safepoint at the early-pos. It's not actually live until the late-pos, so we must exclude it. We can detect this case by looking for the `StartsAtDef` flag on the LR; actually even for early-pos defs this logic is still valid, because the safepoint logically sees a snapshot of state just *before* the instruction (so before the early-pos). - Fixes the last-resort split-bundle-into-minimal-pieces logic from bytecodealliance#59 to properly limit a minimal bundle piece to end after the early-pos, rather than cover the entire instruction. This was causing artificial overlaps between args that end after early-pos and defs that start at late-pos when one of the vregs hit the fallback split behavior.

…clobbers with constraints. - Allow early-pos uses with fixed regs that conflict with clobbers (which happen at late-pos), in addition to the existing logic for conflicts with late-pos defs with fixed regs. This is a pretty subtle issue that was uncovered in bytecodealliance#53 for the def case, and the fix here is the mirror of that fix for clobbers. The root cause for all this complexity is that we can't split in the middle of an instruction (because there's no way to insert a move there!) so if a use is live-downward, we can't let it live in preg A at early-pos and preg B != A at late-pos; instead we need to rewrite the constraints and use a fixup move. The earlier change to fix bytecodealliance#53 was actually a bit too conservative in that it always applied when such conflicts existed, even if the downward arg was not live. This PR fixes that (it's fine for the early-use and late-def to be fixed to the same reg if the use's liverange ends after early-pos) and adapts the same flexibility to the clobbers case as well. - Reworks the fixups for the def case mentioned above to not shift the def to the Early point. Doing so causes issues when the def is to a reffy vreg: it can then be falsely included in a stackmap if the instruction containing this operand is a safepoint. - Fixes the last-resort split-bundle-into-minimal-pieces logic from bytecodealliance#59 to properly limit a minimal bundle piece to end after the early-pos, rather than cover the entire instruction. This was causing artificial overlaps between args that end after early-pos and defs that start at late-pos when one of the vregs hit the fallback split behavior.

…clobbers with constraints. (#74) * Some fixes to allow for call instructions to name args, returns, and clobbers with constraints. - Allow early-pos uses with fixed regs that conflict with clobbers (which happen at late-pos), in addition to the existing logic for conflicts with late-pos defs with fixed regs. This is a pretty subtle issue that was uncovered in #53 for the def case, and the fix here is the mirror of that fix for clobbers. The root cause for all this complexity is that we can't split in the middle of an instruction (because there's no way to insert a move there!) so if a use is live-downward, we can't let it live in preg A at early-pos and preg B != A at late-pos; instead we need to rewrite the constraints and use a fixup move. The earlier change to fix #53 was actually a bit too conservative in that it always applied when such conflicts existed, even if the downward arg was not live. This PR fixes that (it's fine for the early-use and late-def to be fixed to the same reg if the use's liverange ends after early-pos) and adapts the same flexibility to the clobbers case as well. - Reworks the fixups for the def case mentioned above to not shift the def to the Early point. Doing so causes issues when the def is to a reffy vreg: it can then be falsely included in a stackmap if the instruction containing this operand is a safepoint. - Fixes the last-resort split-bundle-into-minimal-pieces logic from #59 to properly limit a minimal bundle piece to end after the early-pos, rather than cover the entire instruction. This was causing artificial overlaps between args that end after early-pos and defs that start at late-pos when one of the vregs hit the fallback split behavior. * Fix fuzzbug: do not merge when a liverange has a fixed-reg def. This can create impossible situations: e.g., if a vreg is constrained to p0 as a late-def, and another, completely different vreg is constrained to p0 as an early-use on the same instruction, and the instruction also has a third vreg (early-use), we do not want to merge the liverange for the third vreg with the first, because it would result in an unsolveable conflict for p0 at the early-point. * Review comments.

cfallin requested review from alexcrichton and fitzgen June 26, 2022 18:01

This was referenced Jun 26, 2022

Cranelift: update to latest regalloc2 for better clobber and splitting handling. bytecodealliance/wasmtime#4324

Merged

Bugfix: no hole in liveranges for pinned vreg move src. #60

Merged

alexcrichton reviewed Jun 27, 2022

View reviewed changes

fitzgen approved these changes Jun 27, 2022

View reviewed changes

cfallin added 2 commits June 27, 2022 12:29

Review comments.

ac04044

cfallin force-pushed the only-a-little-bit-of-splitting branch from 527f024 to ac04044 Compare June 27, 2022 20:01

cfallin merged commit 4eb2a25 into bytecodealliance:main Jun 27, 2022

cfallin deleted the only-a-little-bit-of-splitting branch June 27, 2022 20:23

cfallin mentioned this pull request Jun 27, 2022

Release 0.3.0. #63

Merged

cfallin added a commit that referenced this pull request Jun 27, 2022

Release 0.3.0. (#63)

8bede95

Includes improvements in splitting performance (#59) and more efficient handling of clobber lists (#58). Semver break because of API change in #58.

cfallin mentioned this pull request Sep 9, 2022

Some fixes to allow for call instructions to name args, returns, and clobbers with constraints. #74

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Limit split count per original bundle with fallback 1-to-N split. #59

Limit split count per original bundle with fallback 1-to-N split. #59

cfallin commented Jun 26, 2022

cfallin commented Jun 26, 2022

alexcrichton left a comment

alexcrichton Jun 27, 2022

cfallin Jun 27, 2022

cfallin Jun 27, 2022

fitzgen Jun 27, 2022

		for use_idx in 0..self.ranges[entry.index.index()].uses.len() {
		let u = self.ranges[entry.index.index()].uses[use_idx];

Limit split count per original bundle with fallback 1-to-N split. #59

Limit split count per original bundle with fallback 1-to-N split. #59

Conversation

cfallin commented Jun 26, 2022

cfallin commented Jun 26, 2022

alexcrichton left a comment

Choose a reason for hiding this comment

alexcrichton Jun 27, 2022

Choose a reason for hiding this comment

cfallin Jun 27, 2022

Choose a reason for hiding this comment

cfallin Jun 27, 2022

Choose a reason for hiding this comment

fitzgen Jun 27, 2022

Choose a reason for hiding this comment