Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Cranelift: Optimize op+splat into splat+op in the mid-end #6828

Closed
afonso360 opened this issue Aug 9, 2023 · 5 comments · Fixed by #6851
Closed

Cranelift: Optimize op+splat into splat+op in the mid-end #6828

afonso360 opened this issue Aug 9, 2023 · 5 comments · Fixed by #6851
Labels
cranelift:E-compiler Compiler issues. cranelift:E-easy Issues suitable for newcomers to investigate, including Rust newcomers! cranelift:goal:optimize-speed Focus area: the speed of the code produced by Cranelift. cranelift Issues related to the Cranelift code generator good first issue Issues that are good for new contributors to tackle!

Comments

@afonso360
Copy link
Contributor

👋 Hey,

Feature

This was pointed out by @jameysharp in #6815 (review)!

We should try to transform (op (splat x) (splat y) ...) into (splat (op x y ...)) for operations that support this.

Benefit

This transforms SIMD operations into their scalar counterpart which should be beneficial. We also have better constant propagation on scalars, so this is also an opportunity to do that further.

RISC-V specifically really benefits from this optimization since we have opcodes that can eat the splat on the transformed version.

Implementation

The e-graphs mid end is awesome for stuff like this! Here's one rule:

(rule (simplify (imul ty (splat ty x) (splat ty y)))
      (splat ty (imul (lane_type ty) x y)))

Pasting this into cranelift/codegen/src/opts/vector.isle makes this tranform work on imul!

You can try running this test to verify that it works:

test optimize
set opt_level=speed
target x86_64

function %imul_splat_into_splat_imul(i64, i64) -> i64x2 {
block0(v0: i64, v1: i64):
  v2 = splat.i64x2 v0
  v3 = splat.i64x2 v1
  v4 = imul.i64x2 v2, v3
  return v4
  ; check: v5 = imul v0, v1
  ; check: v6 = splat.i64x2 v5
  ; check: return v6
}

(Run this with cargo run -- test ./the-above.clif from the /cranelift directory)

This is also a good test to add to our testsuite in cranelift/filetests/filetests/egraph/....

There are so many opcodes that this optimization works for that I actually can't list them all. Here's a few: iadd,isub,imul,ineg,iabs,umulhi,smulhi,... Just look at our opcode list and a lot of them will work out.

Where it wont work:

Where it may not work:

  • Floating point operations. I'm not 100% sure about this one, it logically makes sense that it wouldn't change the result, but I'm just afraid of doing transforms on floats.

Alternatives

There are so many opcodes for which this rule can be implemented that it may be worth considering auto-generating it. However I doubt it would be worth the effort + maintenance complexity of that. Copy pasting a bunch of times is sometimes better!

@afonso360 afonso360 added good first issue Issues that are good for new contributors to tackle! cranelift Issues related to the Cranelift code generator cranelift:goal:optimize-speed Focus area: the speed of the code produced by Cranelift. cranelift:E-compiler Compiler issues. cranelift:E-easy Issues suitable for newcomers to investigate, including Rust newcomers! labels Aug 9, 2023
@gurry
Copy link
Contributor

gurry commented Aug 14, 2023

@afonso360 I'd like to pick this one up. Will be doing some reading on ISLE first.

@gurry
Copy link
Contributor

gurry commented Aug 16, 2023

@afonso360 How do I see the generated code after all the mid-end opts are applied to a func? Some of my tests are failing for some operators. For example for band_not for this simplify rule:

(rule (simplify (band_not ty (splat ty x) (splat ty y)))
      (splat ty (band_not (lane_type ty) x y)))

the following test fails:

function %band_not_splat_into_splat_band_not(i64, i64) -> i64x2 {
block0(v0: i64, v1: i64):
  v2 = splat.i64x2 v0
  v3 = splat.i64x2 v1
  v4 = band_not.i64x2 v2, v3
  return v4
  ; check: v5 = band_not v0, v1
  ; check: v6 = splat.i64x2 v5
  ; check: return v6
}

with the following error:

FAIL filetests\filetests\egraph\simd-splat-simplify.clif: optimize

Caused by:
    filecheck failed for function on line 38:
    #0 check: v5 = band_not v0, v1
    #1 check: v6 = splat.i64x2 v5
    #2 check: return v6
    > function %band_not_splat_into_splat_band_not(i64, i64) -> i64x2 fast {
    Missed #0: \bv5 = band_not v0, v1\b
    > block0(v0: i64, v1: i64):
    >     v2 = splat.i64x2 v0
    >     v3 = splat.i64x2 v1
    >     v5 = bnot v3
    >     v4 = band v2, v5
    >     return v4
    > }

1356 tests
Error: 1 failure
error: test failed, to rerun pass `--test filetests`

So I just wanted to debug and see how the generated code differs from the test expectation.

@afonso360
Copy link
Contributor Author

afonso360 commented Aug 16, 2023

Oh! Right I forgot that we have a few special opcodes like band_not that are transformed into multiple ops.

We should be able to ignore those since the mid-end will never see them (I think!) before they are expanded into multiple operations. With everything that you have on that PR, I think you are just missing the transform for bnot and we should be able to transform those two ops.


But to check the optimized code you can use the slightly complicated command:

cargo run -- compile --target x86_64 --set opt_level=speed -p ./the-file.clif

This runs the whole compile pipeline, and -p prints the clif file before being sent to the backends for converting into machine code.

@gurry
Copy link
Contributor

gurry commented Aug 16, 2023

Thanks for the info @afonso360 .

The test for bnot was failing for me for some reason. I'll investigate it now with the help of the command you shared.

What about the family of shift and rotate ops such as ishl and rotl? Will this optimization be applicable to them?

@afonso360
Copy link
Contributor Author

afonso360 commented Aug 16, 2023

What about the family of shift and rotate ops such as ishl and rotl? Will this optimization be applicable to them?

Yeah, they should be! However note that ishl and friends only support a scalar rhs (the number of bits to shift) and reject vectors in that position.

So the optimization is slightly different. We only need to ensure that the lhs is splatted. So it would look something like this
(ishl (splat lhs) rhs) -> (splat (ishl lhs rhs)).

It's also probably worth adding a comment with that reasoning in the code! I didn't remember it myself until I tried it.

The verifier shouldn't even let you build a ishl with a vector rhs. A good way to check what kinds of values are supported by each operation is to read the documentation for it. We'll usually have a remark stating: "this supports scalar or vectors", or this is scalar only. Something along those lines.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cranelift:E-compiler Compiler issues. cranelift:E-easy Issues suitable for newcomers to investigate, including Rust newcomers! cranelift:goal:optimize-speed Focus area: the speed of the code produced by Cranelift. cranelift Issues related to the Cranelift code generator good first issue Issues that are good for new contributors to tackle!
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants