Skip to content

Commit

Permalink
[SLP]Reduce number of alternate instruction, where possible
Browse files Browse the repository at this point in the history
Patch tries to remove wide alternate operations.
Currently SLP vectorizer emits something like this:
```
%0 = add i32
%1 = sub i32
%2 = add i32
%3 = sub i32
%4 = add i32
%5 = sub i32
%6 = add i32
%7 = sub i32

transformes to

%v1 = add <8 x i32>
%v2 = sub <8 x i32>
%res = shuffle %v1, %v2, <0, 9, 2, 11, 4, 13, 6, 15>
```
i.e. half of the results are just unused. This leads to increased
register pressure and potentially doubles number of operations.

Patch introduces SplitVectorize mode, where it splits the operations by
opcodes and produces instead something like this:
```
%v1 = add <4 x i32>
%v2 = sub <4 x i32>
%res = shuffle %v1, %v2, <0, 4, 1, 5, 2, 6, 3, 7>
```
It allows to improve the performance by reducing number of ops. Also, it
turns on some other improvements, like improved graph reordering.

-O3+LTO, AVX512
Metric: size..text
Program                                                                         size..text
                                                                                            results     results0    diff
           test-suite :: MultiSource/Benchmarks/Prolangs-C/TimberWolfMC/timberwolfmc.test   277800.00   280536.00  1.0%
                          test-suite :: MultiSource/Benchmarks/FreeBench/pifft/pifft.test    81802.00    82426.00  0.8%
                        test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test   790552.00   790952.00  0.1%
                             test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test   383795.00   383987.00  0.1%
           test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test  2075541.00  2076501.00  0.0%
            test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test  2075541.00  2076501.00  0.0%
                                  test-suite :: MultiSource/Benchmarks/Bullet/bullet.test   312702.00   312766.00  0.0%
                 test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 12569783.00 12569751.00 -0.0%
                   test-suite :: External/SPEC/CFP2017rate/510.parest_r/510.parest_r.test  2049374.00  2049358.00 -0.0%
                    test-suite :: External/SPEC/CINT2006/400.perlbench/400.perlbench.test  1091836.00  1091772.00 -0.0%
                             test-suite :: MultiSource/Applications/JM/lencod/lencod.test   852339.00   852211.00 -0.0%
                                test-suite :: MultiSource/Applications/oggenc/oggenc.test   190651.00   190523.00 -0.1%
                test-suite :: MultiSource/Benchmarks/DOE-ProxyApps-C/miniGMG/miniGMG.test    44203.00    44155.00 -0.1%
test-suite :: SingleSource/UnitTests/Vector/AVX512BWVL/Vector-AVX512BWVL-mask_set_bw.test    12997.00    12981.00 -0.1%
                     test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test   668971.00   658427.00 -1.6%
                      test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test   668971.00   658427.00 -1.6%

Prolangs-C/TimberWolfMC/timberwolfmc - small variations, some code not
inlined
FreeBench/pifft - extra stores <8 x double> vectorized, some other extra
vectorizations
CINT2006/464.h264ref - some smaller code + changes similar to x264
JM/ldecod - changes similar x264
CINT2017speed/600.perlbench_s
CINT2017rate/500.perlbench_r - significantly compact vector code
Benchmarks/Bullet - small variations
CFP2017rate/526.blender_r - small variations
CFP2017rate/510.parest_r - small variations
CINT2006/400.perlbench - extra vector code
JM/lencod - extra store <16 x i32> and other changes similar x264
Applications/oggenc - extra store <16 x i8>, small variations
DOE-ProxyApps-C/miniGMG - small variations
Vector/AVX512BWVL/Vector-AVX512BWVL-mask_set_bw - better vector code
CINT2017speed/625.x264_s
CINT2017rate/525.x264_r - the number of instructions increased, but
looks like they are more performant. E.g., for function
x264_pixel_satd_8x8, llvm-mca reports better throughput - 84 for the
current version and 59 for the new version.

-O3+LTO, march=rva32u64

CINT2017rate/525.x264_r - similar to x86, extra code in pixel_hadamard_ac
function vectorized, idct4x4dc stopped being vectorized (looks like
issue with shuffles cost)
CINT2006/400.perlbench - better vector code
CINT2006/445.gobmk - some variations in vector code
CINT2006/464.h264ref - extra code vectorized
CINT2017rate/500.perlbench_r - small variations

-O3+LTO, mcpu=sifive-p470

Metric: size..text

Program                                                                                                                                                size..text
                                                                               results    results0   diff
             test-suite :: External/SPEC/CINT2006/464.h264ref/464.h264ref.test  587336.00  587668.00  0.1%
                  test-suite :: MultiSource/Applications/JM/lencod/lencod.test  643308.00  643614.00  0.0%
                    test-suite :: MultiSource/Applications/d/make_dparser.test   79678.00   79710.00  0.0%
                       test-suite :: MultiSource/Benchmarks/Bullet/bullet.test  277322.00  277420.00  0.0%
         test-suite :: External/SPEC/CINT2006/400.perlbench/400.perlbench.test  933660.00  933682.00  0.0%
      test-suite :: External/SPEC/CFP2017rate/526.blender_r/526.blender_r.test 9497722.00 9497682.00 -0.0%
 test-suite :: External/SPEC/CINT2017rate/500.perlbench_r/500.perlbench_r.test 1767806.00 1767772.00 -0.0%
test-suite :: External/SPEC/CINT2017speed/600.perlbench_s/600.perlbench_s.test 1767806.00 1767772.00 -0.0%
 test-suite :: MultiSource/Benchmarks/MiBench/consumer-lame/consumer-lame.test  148038.00  148024.00 -0.0%
                  test-suite :: MultiSource/Applications/JM/ldecod/ldecod.test  283036.00  283008.00 -0.0%
   test-suite :: MultiSource/Benchmarks/mediabench/g721/g721encode/encode.test    4776.00    4772.00 -0.1%
           test-suite :: External/SPEC/CINT2017rate/525.x264_r/525.x264_r.test  540582.00  511772.00 -5.3%
          test-suite :: External/SPEC/CINT2017speed/625.x264_s/625.x264_s.test  540582.00  511772.00 -5.3%

CINT2006/464.h264ref - extra vector code in find_sad_16x16
JM/lencod - extra vector code in find_sad_16x16
d/make_dparser - smaller vector code
Benchmarks/Bullet - small variations
CINT2006/400.perlbench - smaller vector code
CFP2017rate/526.blender_r - small variations, extra store <8 x float> in
the loop, extra store <8 x i8> in loop
CINT2017rate/500.perlbench_r
CINT2017speed/600.perlbench_s - small variations
MiBench/consumer-lame - small variations
JM/ldecod - extra vector code
mediabench/g721/g721encode - small variations
CINT2017rate/525.x264_r
CINT2017speed/625.x264_s - reduced number of wide operations and
shuffles, saving the registers, similar to X86, extra code in
pixel_hadamard_ac vectorized, idct4x4dc not vectorized (issue with some
TTI costs)

Reviewers: RKSimon, hiraditya

Reviewed By: RKSimon

Pull Request: llvm#123360
  • Loading branch information
alexey-bataev committed Feb 1, 2025
1 parent 5cba1f1 commit d5a7a48
Show file tree
Hide file tree
Showing 22 changed files with 1,326 additions and 854 deletions.
8 changes: 8 additions & 0 deletions llvm/include/llvm/Analysis/TargetTransformInfo.h
Original file line number Diff line number Diff line change
Expand Up @@ -1759,6 +1759,10 @@ class TargetTransformInfo {
/// scalable version of the vectorized loop.
bool preferFixedOverScalableIfEqualCost() const;

/// \returns True if target prefers SLP vectorizer with altermate opcode
/// vectorization, false - otherwise.
bool preferAlternateOpcodeVectorization() const;

/// \returns True if the target prefers reductions in loop.
bool preferInLoopReduction(unsigned Opcode, Type *Ty,
ReductionFlags Flags) const;
Expand Down Expand Up @@ -2306,6 +2310,7 @@ class TargetTransformInfo::Concept {
unsigned ChainSizeInBytes,
VectorType *VecTy) const = 0;
virtual bool preferFixedOverScalableIfEqualCost() const = 0;
virtual bool preferAlternateOpcodeVectorization() const = 0;
virtual bool preferInLoopReduction(unsigned Opcode, Type *Ty,
ReductionFlags) const = 0;
virtual bool preferPredicatedReductionSelect(unsigned Opcode, Type *Ty,
Expand Down Expand Up @@ -3106,6 +3111,9 @@ class TargetTransformInfo::Model final : public TargetTransformInfo::Concept {
bool preferFixedOverScalableIfEqualCost() const override {
return Impl.preferFixedOverScalableIfEqualCost();
}
bool preferAlternateOpcodeVectorization() const override {
return Impl.preferAlternateOpcodeVectorization();
}
bool preferInLoopReduction(unsigned Opcode, Type *Ty,
ReductionFlags Flags) const override {
return Impl.preferInLoopReduction(Opcode, Ty, Flags);
Expand Down
2 changes: 2 additions & 0 deletions llvm/include/llvm/Analysis/TargetTransformInfoImpl.h
Original file line number Diff line number Diff line change
Expand Up @@ -989,6 +989,8 @@ class TargetTransformInfoImplBase {

bool preferFixedOverScalableIfEqualCost() const { return false; }

bool preferAlternateOpcodeVectorization() const { return true; }

bool preferInLoopReduction(unsigned Opcode, Type *Ty,
TTI::ReductionFlags Flags) const {
return false;
Expand Down
4 changes: 4 additions & 0 deletions llvm/lib/Analysis/TargetTransformInfo.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -1360,6 +1360,10 @@ bool TargetTransformInfo::preferFixedOverScalableIfEqualCost() const {
return TTIImpl->preferFixedOverScalableIfEqualCost();
}

bool TargetTransformInfo::preferAlternateOpcodeVectorization() const {
return TTIImpl->preferAlternateOpcodeVectorization();
}

bool TargetTransformInfo::preferInLoopReduction(unsigned Opcode, Type *Ty,
ReductionFlags Flags) const {
return TTIImpl->preferInLoopReduction(Opcode, Ty, Flags);
Expand Down
2 changes: 2 additions & 0 deletions llvm/lib/Target/RISCV/RISCVTargetTransformInfo.h
Original file line number Diff line number Diff line change
Expand Up @@ -111,6 +111,8 @@ class RISCVTTIImpl : public BasicTTIImplBase<RISCVTTIImpl> {

unsigned getMaximumVF(unsigned ElemWidth, unsigned Opcode) const;

bool preferAlternateOpcodeVectorization() const { return false; }

bool preferEpilogueVectorization() const {
// Epilogue vectorization is usually unprofitable - tail folding or
// a smaller VF would have been better. This a blunt hammer - we
Expand Down
1 change: 1 addition & 0 deletions llvm/lib/Target/X86/X86TargetTransformInfo.h
Original file line number Diff line number Diff line change
Expand Up @@ -292,6 +292,7 @@ class X86TTIImpl : public BasicTTIImplBase<X86TTIImpl> {

TTI::MemCmpExpansionOptions enableMemCmpExpansion(bool OptSize,
bool IsZeroCmp) const;
bool preferAlternateOpcodeVectorization() const { return false; }
bool prefersVectorizedAddressing() const;
bool supportsEfficientVectorElementLoadStore() const;
bool enableInterleavedAccessVectorization();
Expand Down
Loading

0 comments on commit d5a7a48

Please sign in to comment.