Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix spillslot size bug in SIMD by removing type-dependent spillslot allocation. #3645

Merged
merged 1 commit into from
Jan 4, 2022

Conversation

cfallin
Copy link
Member

@cfallin cfallin commented Jan 4, 2022

Fixes a recently-received fuzzbug exposed by differential fuzzing against V8.

This patch makes spillslot allocation, spilling and reloading all based
on register class only. Hence when we have a 32- or 64-bit value in a
128-bit XMM register on x86-64 or vector register on aarch64, this
results in larger spillslots and spills/restores.

Why make this change, if it results in less efficient stack-frame usage?
Simply put, it is safer: there is always a risk when allocating
spillslots or spilling/reloading that we get the wrong type and make the
spillslot or the store/load too small. This was one contributing factor
to CVE-2021-32629, and is now the source of a fuzzbug in SIMD code that
puns an arbitrary user-controlled vector constant over another
stackslot. (If this were a pointer, that could result in RCE. SIMD is
not yet on by default in a release, fortunately.)

In particular, we have not been particularly careful about using moves
between values of different types, for example with raw_bitcast or
with certain SIMD operations, and such moves indicate to regalloc.rs
that vregs are in equivalence classes and some arbitrary vreg in the
class is provided when allocating the spillslot or spilling/reloading.
Since regalloc.rs does not track actual type, and since we haven't been
careful about moves, we can't really trust this "arbitrary vreg in
equivalence class" to provide accurate type information.

In the fix to CVE-2021-32629 we fixed this for integer registers by
always spilling/reloading 64 bits; this fix can be seen as the analogous
change for FP/vector regs.

@cfallin cfallin requested review from alexcrichton and abrown January 4, 2022 18:54
Copy link
Member

@alexcrichton alexcrichton left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems like a reasonable fix to me for now. Could an issue be filed about trying to do something better in the future? Also, could the sightglass benchmark suite be run to see if this has any impact on it?

I also think that it would be good to include the original wasm reproducer here as well:

(module
  (func (result f32 f32)
    i32.const 0
    f32.convert_i32_s
    v128.const i32x4 0 0 0 0
    data.drop 0
    f32x4.extract_lane 0
    data.drop 0)
  (export "" (func 0))
  (data ""))

(where the "" export is run)

I'm a little uneasy about applying a blanket fix like this in the sense that it's making things "simpler" without any real path forward. For example the underlying bug here is that f32x4.extract_lane 0 is simply a mov so we could fix just that one instruction but you're also worried about raw_bitcast. This is the part that feels odd to me where there doesn't seem to be any recourse for concluding this doesn't affect raw_bitcast.

I don't mean to say we shouldn't land this fix, I think we should land this ASAP, but I still think there's some food-for-thought style things to think about here.

cranelift/filetests/filetests/runtests/simd-regtype.clif Outdated Show resolved Hide resolved
@cfallin cfallin force-pushed the fix-xmm-spillslot-fuzzbug branch from f2bbbcc to 7a0e1ef Compare January 4, 2022 19:37
@cfallin
Copy link
Member Author

cfallin commented Jan 4, 2022

Thanks! I ran a quick sanity-check with bz2 and found no perf regressions. A full sightglass run is usually an overnight ordeal if I give it enough iterations to be meaningful and last time I tried it there were still some weird bimodal-performance gremlins, so I'm hesitant to block anything on that. It's certainly the case that this change could regress FP-heavy code that spills registers, though; no benchmarks needed for that conclusion.

The reason I feel pretty strongly about doing it anyway is that there are "unknown unknowns": once I realized that regalloc.rs merges all vregs linked by move-elision into one equivalence class and gives us the type of an arbitrary one, that means we need to carefully audit all of our uses of move and move-like instructions for type safety. raw_bitcast is one example, but there may be others. In short, it's an invariant we didn't stick to initially, so now we've lost it and we need to do a lot of work before we can rely on it again.

This is a safety issue that @julian-seward1 actually pointed out back in Jan 2020 when we were designing the regalloc.rs API; I pushed for using the vreg type to use smaller spillslots where possible, but given that this has now caused one real CVE and would have caused another CVE here if we had released SIMD support, I think that was a mistake, or at least we didn't adequately consider how to maintain the type-safety in the IR-to-registers mapping.

One could reasonably ask: how do other compilers get this right, and use spillslots only as large as necessary? The answer is better discipline in all of the lowering patterns. I think one way of getting this would be to actually require an IR-level type to be passed down into the register allocator for each vreg; it would be opaque to the regalloc, but would be checked for every move (and would cause a panic if a cross-type move occurs). If we do that and fuzz all the panics out, we could probably move back to "spillslot only as large as actual type". I'll file an issue for this, and probably the right time to consider it is at the same time as we move over to regalloc2.

Hopefully that seems reasonable; thoughts?

@github-actions github-actions bot added cranelift Issues related to the Cranelift code generator cranelift:area:machinst Issues related to instruction selection and the new MachInst backend. cranelift:area:aarch64 Issues related to AArch64 backend. cranelift:area:x64 Issues related to x64 codegen labels Jan 4, 2022
@alexcrichton
Copy link
Member

A full sightglass run is usually an overnight ordeal [..] so I'm hesitant to block anything on that

cc @fitzgen and @jlb6740 while not super relevant to this PR per-se this seems relevant to sightglass.

Hopefully that seems reasonable; thoughts?

That all sound reasonable to me, yeah, thanks for writing that up! I like the idea of passing opaque types to regalloc and regalloc could also possibly automatically determine that a move between two vregs isn't an equivalence class of vregs if the vregs have different types, assigning them separate live ranges and such.

@fitzgen
Copy link
Member

fitzgen commented Jan 4, 2022

It isn't an overnight thing for me (at least for the realistic benchmarks, I don't run the shootout ones) usually just half an hour or so.

@cfallin
Copy link
Member Author

cfallin commented Jan 4, 2022

It isn't an overnight thing for me (at least for the realistic benchmarks, I don't run the shootout ones) usually just half an hour or so.

Ah, OK, it's certainly possible I was holding it wrong; I remember cranking up the iterations and letting it run for a long time to try to get reasonable confidence intervals; and I remember you mentioning something similar (... on some searching, this comment) but maybe it's improved. Anyway, a separate topic; I'd love to be able to just ping a GitHub bot for this :-)

@abrown
Copy link
Contributor

abrown commented Jan 4, 2022

I'm not a big fan of using extra stack space but I guess it makes sense as a temporary measure.

that means we need to carefully audit all of our uses of move and move-like instructions for type safety. raw_bitcast is one example, but there may be others.

I think the existence of raw_bitcast is an indicator of a leaky abstraction in the IR and APIs. Ideally we could resolve that issue and get rid of it. Having to carefully audit moves reinforces this impression for me: whoever is lowering should not need to also worry about regalloc beyond reserving a temporary register once in a while.

One could reasonably ask: how do other compilers get this right, and use spillslots only as large as necessary? The answer is better discipline in all of the lowering patterns.

I don't know if it is "programmer discipline" as much as "abstraction discipline." In V8, there are surely ways to make a mistake during lowering but my opinion is that this area is designed to be easier to use (no/less invariants to remember).

I think one way of getting this would be to actually require an IR-level type to be passed down into the register allocator for each vreg; it would be opaque to the regalloc, but would be checked for every move (and would cause a panic if a cross-type move occurs).

An alternate suggestion: perhaps regalloc should be smarter about RegClass and have a way to express that, e.g., "RegClass::F64 fits in the lower bits of RegClass::V128" and the like. That way the IR types can map 1-to-1 to RegClass and the abstraction will not leak.

@cfallin cfallin force-pushed the fix-xmm-spillslot-fuzzbug branch from 7a0e1ef to bed1d8e Compare January 4, 2022 20:42
@cfallin
Copy link
Member Author

cfallin commented Jan 4, 2022

I just ran a few Sightglass benchmarks (blake3, pulldown-cmark, bz2, meshoptimizer) and saw no difference in runtime. I suspect we'd see more if we had benchmarks with lots of live FP values, but we don't seem to have any like that at the moment.

results

@cfallin
Copy link
Member Author

cfallin commented Jan 4, 2022

I don't know if it is "programmer discipline" as much as "abstraction discipline."

Yes, absolutely, "discipline in lowering patterns" was ambiguous, sorry! We should build guardrails into the abstractions that prevent us from doing the wrong thing by mistake or unknowingly; we clearly didn't do enough here.

…llocation.

This patch makes spillslot allocation, spilling and reloading all based
on register class only. Hence when we have a 32- or 64-bit value in a
128-bit XMM register on x86-64 or vector register on aarch64, this
results in larger spillslots and spills/restores.

Why make this change, if it results in less efficient stack-frame usage?
Simply put, it is safer: there is always a risk when allocating
spillslots or spilling/reloading that we get the wrong type and make the
spillslot or the store/load too small. This was one contributing factor
to CVE-2021-32629, and is now the source of a fuzzbug in SIMD code that
puns an arbitrary user-controlled vector constant over another
stackslot. (If this were a pointer, that could result in RCE. SIMD is
not yet on by default in a release, fortunately.

In particular, we have not been particularly careful about using moves
between values of different types, for example with `raw_bitcast` or
with certain SIMD operations, and such moves indicate to regalloc.rs
that vregs are in equivalence classes and some arbitrary vreg in the
class is provided when allocating the spillslot or spilling/reloading.
Since regalloc.rs does not track actual type, and since we haven't been
careful about moves, we can't really trust this "arbitrary vreg in
equivalence class" to provide accurate type information.

In the fix to CVE-2021-32629 we fixed this for integer registers by
always spilling/reloading 64 bits; this fix can be seen as the analogous
change for FP/vector regs.
@cfallin cfallin force-pushed the fix-xmm-spillslot-fuzzbug branch from bed1d8e to 833ebee Compare January 4, 2022 21:24
@cfallin cfallin merged commit be24edf into bytecodealliance:main Jan 4, 2022
@cfallin cfallin deleted the fix-xmm-spillslot-fuzzbug branch January 4, 2022 22:12
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cranelift:area:aarch64 Issues related to AArch64 backend. cranelift:area:machinst Issues related to instruction selection and the new MachInst backend. cranelift:area:x64 Issues related to x64 codegen cranelift Issues related to the Cranelift code generator
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants