Fix spillslot size bug in SIMD by removing type-dependent spillslot allocation. #3645

cfallin · 2022-01-04T18:54:32Z

Fixes a recently-received fuzzbug exposed by differential fuzzing against V8.

This patch makes spillslot allocation, spilling and reloading all based
on register class only. Hence when we have a 32- or 64-bit value in a
128-bit XMM register on x86-64 or vector register on aarch64, this
results in larger spillslots and spills/restores.

Why make this change, if it results in less efficient stack-frame usage?
Simply put, it is safer: there is always a risk when allocating
spillslots or spilling/reloading that we get the wrong type and make the
spillslot or the store/load too small. This was one contributing factor
to CVE-2021-32629, and is now the source of a fuzzbug in SIMD code that
puns an arbitrary user-controlled vector constant over another
stackslot. (If this were a pointer, that could result in RCE. SIMD is
not yet on by default in a release, fortunately.)

In particular, we have not been particularly careful about using moves
between values of different types, for example with raw_bitcast or
with certain SIMD operations, and such moves indicate to regalloc.rs
that vregs are in equivalence classes and some arbitrary vreg in the
class is provided when allocating the spillslot or spilling/reloading.
Since regalloc.rs does not track actual type, and since we haven't been
careful about moves, we can't really trust this "arbitrary vreg in
equivalence class" to provide accurate type information.

In the fix to CVE-2021-32629 we fixed this for integer registers by
always spilling/reloading 64 bits; this fix can be seen as the analogous
change for FP/vector regs.

alexcrichton

This seems like a reasonable fix to me for now. Could an issue be filed about trying to do something better in the future? Also, could the sightglass benchmark suite be run to see if this has any impact on it?

I also think that it would be good to include the original wasm reproducer here as well:

(module
  (func (result f32 f32)
    i32.const 0
    f32.convert_i32_s
    v128.const i32x4 0 0 0 0
    data.drop 0
    f32x4.extract_lane 0
    data.drop 0)
  (export "" (func 0))
  (data ""))

(where the "" export is run)

I'm a little uneasy about applying a blanket fix like this in the sense that it's making things "simpler" without any real path forward. For example the underlying bug here is that f32x4.extract_lane 0 is simply a mov so we could fix just that one instruction but you're also worried about raw_bitcast. This is the part that feels odd to me where there doesn't seem to be any recourse for concluding this doesn't affect raw_bitcast.

I don't mean to say we shouldn't land this fix, I think we should land this ASAP, but I still think there's some food-for-thought style things to think about here.

cranelift/filetests/filetests/runtests/simd-regtype.clif

cfallin · 2022-01-04T19:47:33Z

Thanks! I ran a quick sanity-check with bz2 and found no perf regressions. A full sightglass run is usually an overnight ordeal if I give it enough iterations to be meaningful and last time I tried it there were still some weird bimodal-performance gremlins, so I'm hesitant to block anything on that. It's certainly the case that this change could regress FP-heavy code that spills registers, though; no benchmarks needed for that conclusion.

The reason I feel pretty strongly about doing it anyway is that there are "unknown unknowns": once I realized that regalloc.rs merges all vregs linked by move-elision into one equivalence class and gives us the type of an arbitrary one, that means we need to carefully audit all of our uses of move and move-like instructions for type safety. raw_bitcast is one example, but there may be others. In short, it's an invariant we didn't stick to initially, so now we've lost it and we need to do a lot of work before we can rely on it again.

This is a safety issue that @julian-seward1 actually pointed out back in Jan 2020 when we were designing the regalloc.rs API; I pushed for using the vreg type to use smaller spillslots where possible, but given that this has now caused one real CVE and would have caused another CVE here if we had released SIMD support, I think that was a mistake, or at least we didn't adequately consider how to maintain the type-safety in the IR-to-registers mapping.

One could reasonably ask: how do other compilers get this right, and use spillslots only as large as necessary? The answer is better discipline in all of the lowering patterns. I think one way of getting this would be to actually require an IR-level type to be passed down into the register allocator for each vreg; it would be opaque to the regalloc, but would be checked for every move (and would cause a panic if a cross-type move occurs). If we do that and fuzz all the panics out, we could probably move back to "spillslot only as large as actual type". I'll file an issue for this, and probably the right time to consider it is at the same time as we move over to regalloc2.

Hopefully that seems reasonable; thoughts?

alexcrichton · 2022-01-04T20:11:33Z

A full sightglass run is usually an overnight ordeal [..] so I'm hesitant to block anything on that

cc @fitzgen and @jlb6740 while not super relevant to this PR per-se this seems relevant to sightglass.

Hopefully that seems reasonable; thoughts?

That all sound reasonable to me, yeah, thanks for writing that up! I like the idea of passing opaque types to regalloc and regalloc could also possibly automatically determine that a move between two vregs isn't an equivalence class of vregs if the vregs have different types, assigning them separate live ranges and such.

fitzgen · 2022-01-04T20:14:07Z

It isn't an overnight thing for me (at least for the realistic benchmarks, I don't run the shootout ones) usually just half an hour or so.

cfallin · 2022-01-04T20:19:55Z

It isn't an overnight thing for me (at least for the realistic benchmarks, I don't run the shootout ones) usually just half an hour or so.

Ah, OK, it's certainly possible I was holding it wrong; I remember cranking up the iterations and letting it run for a long time to try to get reasonable confidence intervals; and I remember you mentioning something similar (... on some searching, this comment) but maybe it's improved. Anyway, a separate topic; I'd love to be able to just ping a GitHub bot for this :-)

abrown · 2022-01-04T20:37:38Z

I'm not a big fan of using extra stack space but I guess it makes sense as a temporary measure.

that means we need to carefully audit all of our uses of move and move-like instructions for type safety. raw_bitcast is one example, but there may be others.

I think the existence of raw_bitcast is an indicator of a leaky abstraction in the IR and APIs. Ideally we could resolve that issue and get rid of it. Having to carefully audit moves reinforces this impression for me: whoever is lowering should not need to also worry about regalloc beyond reserving a temporary register once in a while.

One could reasonably ask: how do other compilers get this right, and use spillslots only as large as necessary? The answer is better discipline in all of the lowering patterns.

I don't know if it is "programmer discipline" as much as "abstraction discipline." In V8, there are surely ways to make a mistake during lowering but my opinion is that this area is designed to be easier to use (no/less invariants to remember).

I think one way of getting this would be to actually require an IR-level type to be passed down into the register allocator for each vreg; it would be opaque to the regalloc, but would be checked for every move (and would cause a panic if a cross-type move occurs).

An alternate suggestion: perhaps regalloc should be smarter about RegClass and have a way to express that, e.g., "RegClass::F64 fits in the lower bits of RegClass::V128" and the like. That way the IR types can map 1-to-1 to RegClass and the abstraction will not leak.

cfallin · 2022-01-04T20:44:27Z

I just ran a few Sightglass benchmarks (blake3, pulldown-cmark, bz2, meshoptimizer) and saw no difference in runtime. I suspect we'd see more if we had benchmarks with lots of live FP values, but we don't seem to have any like that at the moment.

results

cfallin · 2022-01-04T20:49:39Z

I don't know if it is "programmer discipline" as much as "abstraction discipline."

Yes, absolutely, "discipline in lowering patterns" was ambiguous, sorry! We should build guardrails into the abstractions that prevent us from doing the wrong thing by mistake or unknowingly; we clearly didn't do enough here.

…llocation. This patch makes spillslot allocation, spilling and reloading all based on register class only. Hence when we have a 32- or 64-bit value in a 128-bit XMM register on x86-64 or vector register on aarch64, this results in larger spillslots and spills/restores. Why make this change, if it results in less efficient stack-frame usage? Simply put, it is safer: there is always a risk when allocating spillslots or spilling/reloading that we get the wrong type and make the spillslot or the store/load too small. This was one contributing factor to CVE-2021-32629, and is now the source of a fuzzbug in SIMD code that puns an arbitrary user-controlled vector constant over another stackslot. (If this were a pointer, that could result in RCE. SIMD is not yet on by default in a release, fortunately. In particular, we have not been particularly careful about using moves between values of different types, for example with `raw_bitcast` or with certain SIMD operations, and such moves indicate to regalloc.rs that vregs are in equivalence classes and some arbitrary vreg in the class is provided when allocating the spillslot or spilling/reloading. Since regalloc.rs does not track actual type, and since we haven't been careful about moves, we can't really trust this "arbitrary vreg in equivalence class" to provide accurate type information. In the fix to CVE-2021-32629 we fixed this for integer registers by always spilling/reloading 64 bits; this fix can be seen as the analogous change for FP/vector regs.

cfallin requested review from alexcrichton and abrown January 4, 2022 18:54

alexcrichton reviewed Jan 4, 2022

View reviewed changes

cranelift/filetests/filetests/runtests/simd-regtype.clif Outdated Show resolved Hide resolved

cfallin force-pushed the fix-xmm-spillslot-fuzzbug branch from f2bbbcc to 7a0e1ef Compare January 4, 2022 19:37

alexcrichton approved these changes Jan 4, 2022

View reviewed changes

cfallin mentioned this pull request Jan 4, 2022

Cranelift: use spillslots only as large as actual values, not registers #3647

Open

cfallin force-pushed the fix-xmm-spillslot-fuzzbug branch from 7a0e1ef to bed1d8e Compare January 4, 2022 20:42

cfallin force-pushed the fix-xmm-spillslot-fuzzbug branch from bed1d8e to 833ebee Compare January 4, 2022 21:24

cfallin merged commit be24edf into bytecodealliance:main Jan 4, 2022

cfallin deleted the fix-xmm-spillslot-fuzzbug branch January 4, 2022 22:12

cfallin mentioned this pull request Jan 7, 2022

RFC: Cranelift sizeless vector types bytecodealliance/rfcs#19

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix spillslot size bug in SIMD by removing type-dependent spillslot allocation. #3645

Fix spillslot size bug in SIMD by removing type-dependent spillslot allocation. #3645

cfallin commented Jan 4, 2022

alexcrichton left a comment

cfallin commented Jan 4, 2022

alexcrichton commented Jan 4, 2022

fitzgen commented Jan 4, 2022

cfallin commented Jan 4, 2022

abrown commented Jan 4, 2022

cfallin commented Jan 4, 2022

cfallin commented Jan 4, 2022

Fix spillslot size bug in SIMD by removing type-dependent spillslot allocation. #3645

Fix spillslot size bug in SIMD by removing type-dependent spillslot allocation. #3645

Conversation

cfallin commented Jan 4, 2022

alexcrichton left a comment

Choose a reason for hiding this comment

cfallin commented Jan 4, 2022

alexcrichton commented Jan 4, 2022

fitzgen commented Jan 4, 2022

cfallin commented Jan 4, 2022

abrown commented Jan 4, 2022

cfallin commented Jan 4, 2022

cfallin commented Jan 4, 2022