Strength reduction for add operations performed power of 2 times #34938

kunalspathak · 2020-04-14T08:22:07Z

static uint i4(uint i) {
	return i + i + i + i;
}

Above code should obviously be return 4 * i, but today if we see something like this, we don't optimize it that way and perform add operations. We could optimize it to lsl if adds are performed in power of 2.
Today we generate:

G_M48040_IG02:
        0B000001          add     w1, w0, w0
        0B000021          add     w1, w1, w0
        0B000020          add     w0, w1, w0

We should generate:

        531E7400          lsl     w0, w0, #2

We never optimize even for 8 or 16 operations and thus emitting series of add operations.

category:cq
theme:basic-cq
skill-level:intermediate
cost:medium

The text was updated successfully, but these errors were encountered:

kunalspathak · 2020-04-14T08:22:16Z

@BruceForstall

EgorBo · 2020-04-14T08:53:09Z

I guess it's not arm-specific (could generate a single lea on x86 but it does not)

kunalspathak · 2020-04-14T17:45:51Z

Yep, I agree. I have seen x86 too emitting lea r8d, [rcx+rcx]. Updated the title and label.

anthonycanino · 2021-11-09T18:50:25Z

@kunalspathak I'm interested in working on this issue and could use a little design advice if you don't mind.

My strategy to this peephole optimization is to reduce chains of add operations into a single mul operation, and then let the optimizer transform the mul operation into a more efficient one, as is done at: https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/morph.cpp#L12426

So essentially, I can do something like Add(Add(Add(V0, V0), V0), V0) => Mul(V0, 3) and even Add(Mul(V0, 3), V0) => Mul(V0, 4), where V0 is some local var, but it is done as an additional walk over the tree before the preorder processing of each operand, i.e., done at https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/morph.cpp#L11545, with something like

if (opts.OptimizationEnabled() && fgGlobalMorph)
{
  tree = gtReduceStrength(tree);
}

/*-------------------------------------------------------------------------
 * Process the first operand, if any
 */

if (op1)
{
  // ...

at which point the subsequent (https://github.com/dotnet/runtime/blob/main/src/coreclr/jit/morph.cpp#L12426) optimization will transform Mul(V0, 4) => Lsh(V0, 2) etc.

I wanted to fold this in during the postorder processing, but the multiplication to shift optimization is conflicts with the strength reduction, i.e., Add(Add(Add(V0, V0), V0), V0) becomes Add(Add(Lsh(V0, 1), V0), V0), which becomes Add(Add(LSH(V0, 1), V0), V0).

I see three solutions, and I'm curious how they might fit in to the design and looking for some feedback:

Keep the strength reduction peephole separate as I currently have done. But I think this conflicts with the natural order of the rest of the optimizations.
Pass additional context from parent to child, similar to MorphAddrContext, so we can known whether to leave Mul(V0, 2) alone if its parent is of the shape Add(Mul(V0, 2), V0) etc.
Handle cases like Add(Lsh(V0, 1), V0) , which would unwind the Lsh to a Mul, and eventually the outermost expression will stay in an optimized state.

kunalspathak · 2021-11-09T20:53:19Z

@kunalspathak I'm interested in working on this issue

Thank you for volunteering.

Keep the strength reduction peephole separate as I currently have done. But I think this conflicts with the natural order of the rest of the optimizations.

You mean, doing it using gtReduceStrength() before the preorder processing of each operand? This approach looks reasonable and feels natural because we won't have to add extra processing to handle cases of Add(Mul(V0, 2), V0) or Add(Lsh(V0, 1), V0). @EgorBo - do you have any different opinion?

anthonycanino · 2021-11-09T23:26:02Z

You mean, doing it using gtReduceStrength() before the preorder processing of each operand? This approach looks reasonable and feels natural because we won't have to add extra processing to handle cases of Add(Mul(V0, 2), V0) or Add(Lsh(V0, 1), V0). @EgorBo - do you have any different opinion?

Yes, that's what I meant --- use gtReduceStrength before preorder processing of each operand.

One thing though, is don't we want to handle cases like Add(Mul(V0, 2), V0), in the event that we have an expression like i * 2 + i (admittedly, it doesn't seem likely that would pop up), or are we strictly speaking about reducing add expressions in the shape of i + i + i + i etc.

kunalspathak · 2021-11-09T23:28:30Z

One thing though, is don't we want to handle cases like Add(Mul(V0, 2), V0), in the event that we have an expression like i * 2 + i

We need to see what we generate for those code patterns. This issue is for cases like i + i + .....

SingleAccretion · 2021-11-09T23:31:43Z

Yes, that's what I meant --- use gtReduceStrength before preorder processing of each operand.

Why before each operand? I would have expected it to live in the pre-order processing of GT_ADD (i. e. in the first big switch).

This is a tradeoff in complexity vs completeness. Pre-order will miss some things, but the code will be simpler. The way to evaluate this is to make both and check the diffs (which will also tell us how valuable this optimization is in general in real-world(ish) code).

anthonycanino · 2021-11-11T21:23:36Z

One thing though, is don't we want to handle cases like Add(Mul(V0, 2), V0), in the event that we have an expression like i * 2 + i

We need to see what we generate for those code patterns. This issue is for cases like i + i + .....

Understood. Will work on it as such.

Yes, that's what I meant --- use gtReduceStrength before preorder processing of each operand.

Why before each operand? I would have expected it to live in the pre-order processing of GT_ADD (i. e. in the first big switch).

I think we are thinking the same thing, I meant to say in the pre-order processing of GT_ADD (first big switch).

Thanks for the feedback folks, will work on it.

kunalspathak · 2022-12-23T06:08:18Z

I do see that we generate expected output.

G_M44653_IG02:              ;; offset=0008H
        531E7400          lsl     w0, w0, #2

G_M44653_IG02:              ;; offset=0000H
       8D048D00000000       lea      eax, [4*rcx]

kunalspathak added arch-arm64 area-CodeGen-coreclr CLR JIT compiler in src/coreclr/src/jit and related components such as SuperPMI labels Apr 14, 2020

kunalspathak self-assigned this Apr 14, 2020

Dotnet-GitSync-Bot added the untriaged New issue has not been triaged by the area owner label Apr 14, 2020

kunalspathak removed the arch-arm64 label Apr 14, 2020

kunalspathak changed the title ~~ARM64: Strength reduction for add operations performed power of 2 times~~ Strength reduction for add operations performed power of 2 times Apr 14, 2020

BruceForstall added this to the Future milestone Apr 20, 2020

BruceForstall added optimization and removed untriaged New issue has not been triaged by the area owner labels Apr 20, 2020

BruceForstall mentioned this issue Oct 17, 2020

Improve JIT loop optimizations (.NET 6) #43549

Closed

25 tasks

BruceForstall added the JitUntriaged CLR JIT issues needing additional triage label Oct 28, 2020

BruceForstall removed the JitUntriaged CLR JIT issues needing additional triage label Nov 25, 2020

BruceForstall mentioned this issue Jul 6, 2021

Improve JIT loop optimizations (.NET 7) #55235

Closed

5 tasks

This was referenced Nov 16, 2021

Add additional binary operations into the RangeCheck analysis. #61662

Merged

Reduce repeated variable add operations to a single multiply. #61663

Merged

BruceForstall mentioned this issue Feb 15, 2022

Improve JIT loop optimizations #65342

Open

20 tasks

kunalspathak closed this as completed Dec 23, 2022

ghost locked as resolved and limited conversation to collaborators Jan 22, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Strength reduction for add operations performed power of 2 times #34938

Strength reduction for add operations performed power of 2 times #34938

kunalspathak commented Apr 14, 2020 •

edited by BruceForstall

Loading

kunalspathak commented Apr 14, 2020

EgorBo commented Apr 14, 2020 •

edited

Loading

kunalspathak commented Apr 14, 2020

anthonycanino commented Nov 9, 2021

kunalspathak commented Nov 9, 2021

anthonycanino commented Nov 9, 2021

kunalspathak commented Nov 9, 2021

SingleAccretion commented Nov 9, 2021

anthonycanino commented Nov 11, 2021

kunalspathak commented Dec 23, 2022

Strength reduction for add operations performed power of 2 times #34938

Strength reduction for add operations performed power of 2 times #34938

Comments

kunalspathak commented Apr 14, 2020 • edited by BruceForstall Loading

kunalspathak commented Apr 14, 2020

EgorBo commented Apr 14, 2020 • edited Loading

kunalspathak commented Apr 14, 2020

anthonycanino commented Nov 9, 2021

kunalspathak commented Nov 9, 2021

anthonycanino commented Nov 9, 2021

kunalspathak commented Nov 9, 2021

SingleAccretion commented Nov 9, 2021

anthonycanino commented Nov 11, 2021

kunalspathak commented Dec 23, 2022

kunalspathak commented Apr 14, 2020 •

edited by BruceForstall

Loading

EgorBo commented Apr 14, 2020 •

edited

Loading