[AMDGPU] Fix SDWA commuting #106920

yxsamliu · 2024-09-01T19:11:34Z

SDWA insts miss reverse opcode, which causes them to be treated as commutable with default reverse opcode i.e. their own opcode. As a result, SWDA F16 sub A, B and Sub B, A are merged by machine CSE. The correct behavior is to merged sub A, B and subrev B, A instead of sub B, A. This issues caused failures in rocFFT tests.

Another issue is that src0_sel and src1_sel are not swapped when SDWA insts are commuted.

Verified that this fixes rocFFT tests failure.

llvmbot · 2024-09-01T19:12:07Z

@llvm/pr-subscribers-backend-amdgpu

Author: Yaxun (Sam) Liu (yxsamliu)

Changes

SDWA insts miss reverse opcode, which causes them to be treated as commutable with default reverse opcode i.e. their own opcode. As a result, SWDA F16 sub A, B and Sub B, A are merged by machine CSE. The correct behavior is to merged sub A, B and subrev B, A instead of sub B, A. This issues caused failures in rocFFT tests.

Another issue is that src0_sel and src1_sel are not swapped when SDWA insts are commuted.

Verified that this fixes rocFFT tests failure.

Full diff: https://github.com/llvm/llvm-project/pull/106920.diff

3 Files Affected:

(modified) llvm/lib/Target/AMDGPU/SIInstrInfo.cpp (+4)
(modified) llvm/lib/Target/AMDGPU/VOP2Instructions.td (+7-4)
(added) llvm/test/CodeGen/AMDGPU/sdwa-cse.mir (+56)

diff --git a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
index a857bdba53c3e8..43ec2b4484ec28 100644
--- a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
@@ -2812,6 +2812,10 @@ MachineInstr *SIInstrInfo::commuteInstructionImpl(MachineInstr &MI, bool NewMI,
     swapSourceModifiers(MI, Src0, AMDGPU::OpName::src0_modifiers,
                         Src1, AMDGPU::OpName::src1_modifiers);
 
+    if (isSDWA(MI))
+      swapSourceModifiers(MI, Src0, AMDGPU::OpName::src0_sel, Src1,
+                          AMDGPU::OpName::src1_sel);
+
     CommutedMI->setDesc(get(CommutedOpcode));
   }
 
diff --git a/llvm/lib/Target/AMDGPU/VOP2Instructions.td b/llvm/lib/Target/AMDGPU/VOP2Instructions.td
index fccaa27f361381..852d434cf743e8 100644
--- a/llvm/lib/Target/AMDGPU/VOP2Instructions.td
+++ b/llvm/lib/Target/AMDGPU/VOP2Instructions.td
@@ -174,10 +174,12 @@ multiclass VOP2Inst_e64<string opName,
 
 multiclass VOP2Inst_sdwa<string opName,
                          VOPProfile P,
+                         string revOp = opName,
                          bit GFX9Renamed = 0> {
   let renamedInGFX9 = GFX9Renamed in {
     if P.HasExtSDWA then
-      def _sdwa : VOP2_SDWA_Pseudo <opName, P>;
+      def _sdwa : VOP2_SDWA_Pseudo <opName, P>,
+                  Commutable_REV<revOp#"_sdwa", !eq(revOp, opName)>;
   } // End renamedInGFX9 = GFX9Renamed
 }
 
@@ -188,7 +190,7 @@ multiclass VOP2Inst<string opName,
                     bit GFX9Renamed = 0> :
     VOP2Inst_e32<opName, P, node, revOp, GFX9Renamed>,
     VOP2Inst_e64<opName, P, node, revOp, GFX9Renamed>,
-    VOP2Inst_sdwa<opName, P, GFX9Renamed> {
+    VOP2Inst_sdwa<opName, P, revOp, GFX9Renamed> {
   let renamedInGFX9 = GFX9Renamed in {
     if P.HasExtDPP then
       def _dpp  : VOP2_DPP_Pseudo <opName, P>;
@@ -237,7 +239,7 @@ multiclass VOP2Inst_VOPD<string opName,
                          bit GFX9Renamed = 0> :
     VOP2Inst_e32_VOPD<opName, P, VOPDOp, VOPDName, node, revOp, GFX9Renamed>,
     VOP2Inst_e64<opName, P, node, revOp, GFX9Renamed>,
-    VOP2Inst_sdwa<opName, P, GFX9Renamed> {
+    VOP2Inst_sdwa<opName, P, revOp, GFX9Renamed> {
   let renamedInGFX9 = GFX9Renamed in {
     if P.HasExtDPP then
       def _dpp  : VOP2_DPP_Pseudo <opName, P>;
@@ -259,7 +261,8 @@ multiclass VOP2bInst <string opName,
         }
 
         if P.HasExtSDWA then
-          def _sdwa  : VOP2_SDWA_Pseudo <opName, P> {
+          def _sdwa  : VOP2_SDWA_Pseudo <opName, P>,
+                       Commutable_REV<revOp#"_sdwa", !eq(revOp, opName)> {
             let AsmMatchConverter = "cvtSdwaVOP2b";
           }
         if P.HasExtDPP then
diff --git a/llvm/test/CodeGen/AMDGPU/sdwa-cse.mir b/llvm/test/CodeGen/AMDGPU/sdwa-cse.mir
new file mode 100644
index 00000000000000..1c12812cbcf527
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/sdwa-cse.mir
@@ -0,0 +1,56 @@
+# RUN: llc -mtriple=amdgcn -mcpu=gfx1030 -run-pass=machine-cse -verify-machineinstrs %s -o - 2>&1 | FileCheck --check-prefix=GCN %s
+
+# GCN-LABEL: name: test_machine_cse_subtraction_sdwa_f16_no_merge
+# GCN:   %2:vgpr_32 = contract nofpexcept V_SUB_F16_sdwa 0, %0.sub0, 0, %1.sub0, 0, 0, 6, 0, 5, 5, implicit $mode, implicit $exec
+# GCN:   %3:vgpr_32 = contract nofpexcept V_SUB_F16_sdwa 0, %1.sub0, 0, %0.sub0, 0, 0, 6, 0, 5, 5, implicit $mode, implicit $exec
+# GCN:   %5:vgpr_32 = contract nofpexcept V_ADD_F16_e32 %2, %4, implicit $mode, implicit $exec
+# GCN:   %6:vgpr_32 = contract nofpexcept V_ADD_F16_e32 %3, %4, implicit $mode, implicit $exec
+# GCN:   DS_WRITE2_B32_gfx9 undef %7:vgpr_32, %5, %6, 0, 1, 0, implicit $exec
+---
+name: test_machine_cse_subtraction_sdwa_f16_no_merge
+body: |
+  bb.0:
+    %0:vreg_64 = IMPLICIT_DEF
+    %1:vreg_64 = IMPLICIT_DEF
+    %2:vgpr_32 = contract nofpexcept V_SUB_F16_sdwa 0, %0.sub0, 0, %1.sub0, 0, 0, 6, 0, 5, 5, implicit $mode, implicit $exec
+    %3:vgpr_32 = contract nofpexcept V_SUB_F16_sdwa 0, %1.sub0, 0, %0.sub0, 0, 0, 6, 0, 5, 5, implicit $mode, implicit $exec
+    %4:vgpr_32 = IMPLICIT_DEF
+    %5:vgpr_32 = contract nofpexcept V_ADD_F16_e32 %2, %4, implicit $mode, implicit $exec
+    %6:vgpr_32 = contract nofpexcept V_ADD_F16_e32 %3, %4, implicit $mode, implicit $exec
+    DS_WRITE2_B32_gfx9 undef %7:vgpr_32, %5, %6, 0, 1, 0, implicit $exec
+...
+
+# GCN-LABEL: name: test_machine_cse_subtraction_sdwa_f16_merge_same_src_sel
+# GCN:   %2:vgpr_32 = contract nofpexcept V_SUB_F16_sdwa 0, %0.sub0, 0, %1.sub0, 0, 0, 6, 0, 5, 5, implicit $mode, implicit $exec
+# GCN:   %5:vgpr_32 = contract nofpexcept V_ADD_F16_e32 %2, %4, implicit $mode, implicit $exec
+# GCN:   DS_WRITE2_B32_gfx9 undef %7:vgpr_32, %5, %5, 0, 1, 0, implicit $exec
+---
+name: test_machine_cse_subtraction_sdwa_f16_merge_same_src_sel
+body: |
+  bb.0:
+    %0:vreg_64 = IMPLICIT_DEF
+    %1:vreg_64 = IMPLICIT_DEF
+    %2:vgpr_32 = contract nofpexcept V_SUB_F16_sdwa 0, %0.sub0, 0, %1.sub0, 0, 0, 6, 0, 5, 5, implicit $mode, implicit $exec
+    %3:vgpr_32 = contract nofpexcept V_SUBREV_F16_sdwa 0, %1.sub0, 0, %0.sub0, 0, 0, 6, 0, 5, 5, implicit $mode, implicit $exec
+    %4:vgpr_32 = IMPLICIT_DEF
+    %5:vgpr_32 = contract nofpexcept V_ADD_F16_e32 %2, %4, implicit $mode, implicit $exec
+    %6:vgpr_32 = contract nofpexcept V_ADD_F16_e32 %3, %4, implicit $mode, implicit $exec
+    DS_WRITE2_B32_gfx9 undef %7:vgpr_32, %5, %6, 0, 1, 0, implicit $exec
+...
+
+# GCN-LABEL: name: test_machine_cse_subtraction_sdwa_f16_merge_diff_src_sel
+# GCN:   %2:vgpr_32 = contract nofpexcept V_SUB_F16_sdwa 0, %0.sub0, 0, %1.sub0, 0, 0, 6, 0, 6, 5, implicit $mode, implicit $exec
+# GCN:   %5:vgpr_32 = contract nofpexcept V_ADD_F16_e32 %2, %4, implicit $mode, implicit $exec
+# GCN:   DS_WRITE2_B32_gfx9 undef %7:vgpr_32, %5, %5, 0, 1, 0, implicit $exec
+---
+name: test_machine_cse_subtraction_sdwa_f16_merge_diff_src_sel
+body: |
+  bb.0:
+    %0:vreg_64 = IMPLICIT_DEF
+    %1:vreg_64 = IMPLICIT_DEF
+    %2:vgpr_32 = contract nofpexcept V_SUB_F16_sdwa 0, %0.sub0, 0, %1.sub0, 0, 0, 6, 0, 6, 5, implicit $mode, implicit $exec
+    %3:vgpr_32 = contract nofpexcept V_SUBREV_F16_sdwa 0, %1.sub0, 0, %0.sub0, 0, 0, 6, 0, 5, 6, implicit $mode, implicit $exec
+    %4:vgpr_32 = IMPLICIT_DEF
+    %5:vgpr_32 = contract nofpexcept V_ADD_F16_e32 %2, %4, implicit $mode, implicit $exec
+    %6:vgpr_32 = contract nofpexcept V_ADD_F16_e32 %3, %4, implicit $mode, implicit $exec
+    DS_WRITE2_B32_gfx9 undef %7:vgpr_32, %5, %6, 0, 1, 0, implicit $exec

arsenm

An end to end IR test wouldn't hurt either, there are other contexts that commute

llvm/test/CodeGen/AMDGPU/sdwa-cse.mir

bfavela · 2024-10-02T16:48:35Z

llvm/lib/Target/AMDGPU/SIInstrInfo.cpp

@@ -2812,6 +2812,10 @@ MachineInstr *SIInstrInfo::commuteInstructionImpl(MachineInstr &MI, bool NewMI,
    swapSourceModifiers(MI, Src0, AMDGPU::OpName::src0_modifiers,
                        Src1, AMDGPU::OpName::src1_modifiers);

+    if (isSDWA(MI))


It's not just SDWA - VOP3 instructions with OPSEL also use src0/1_sel

arsenm

Fix end of line, add test of op_sel case

yxsamliu · 2024-10-03T17:14:45Z

An end to end IR test wouldn't hurt either, there are other contexts that commute

will add IR-to-ISA test

arsenm · 2024-10-03T17:42:08Z

llvm/test/CodeGen/AMDGPU/sdwa-commute.ll

@@ -0,0 +1,46 @@
+; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
+; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx1030 -verify-machineinstrs < %s | FileCheck %s


Suggested change

; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx1030 -verify-machineinstrs < %s | FileCheck %s

; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx1030 < %s | FileCheck %s

arsenm · 2024-10-03T17:42:59Z

llvm/test/CodeGen/AMDGPU/sdwa-cse.mir

+    %5:vgpr_32 = contract nofpexcept V_ADD_F16_e32 %2, %4, implicit $mode, implicit $exec
+    %6:vgpr_32 = contract nofpexcept V_ADD_F16_e32 %3, %4, implicit $mode, implicit $exec
+    DS_WRITE2_B32_gfx9 undef %7:vgpr_32, %5, %6, 0, 1, 0, implicit $exec
+...


Test an op_sel case too

since op_sel is not in SDWA instructions, added another test commute-op-sel.mir

SDWA insts miss reverse opcode, which causes them to be treated as commutable with default reverse opcode i.e. their own opcode. As a result, SWDA F16 sub A, B and Sub B, A are merged by machine CSE. The correct behavior is to merged sub A, B and subrev B, A instead of sub B, A. This issues caused failures in rocFFT tests. Another issue is that src0_sel and src1_sel are not swapped when SDWA insts are commuted. Verified that this fixes rocFFT tests failure.

arsenm · 2024-10-04T18:02:18Z

llvm/test/CodeGen/AMDGPU/commute-op-sel.mir

+# GCN: %2:vgpr_32 = V_ADD_NC_U16_e64 0, %0, 0, %1, 1, 0, implicit $mode, implicit $exec
+# GCN: %3:vgpr_32 = V_ADD_NC_U16_e64 0, %1, 0, %0, 1, 0, implicit $mode, implicit $exec


No commute happened, and the modifiers aren't set?

It seems we do not define VOP3 instructions as commutable. Even if both op_sel are 0, the two V_ADD_NC_U16_e64 are not merged.

That's a bug that should be fixed

Can it be done with separate PR? This PR is for SDWA commute issue which causes rocFFT to fail. Whereas VOP3 not commute is missed performance opportunity. The current issue is kind of urgent.

OK I just noticed you have approved this PR. I will commit it first and will look into VOP3 commuting opportunity later.

open an issue to track this #111205

SDWA insts miss reverse opcode, which causes them to be treated as commutable with default reverse opcode i.e. their own opcode. As a result, SWDA F16 sub A, B and Sub B, A are merged by machine CSE. The correct behavior is to merged sub A, B and subrev B, A instead of sub B, A. This issues caused failures in rocFFT tests. Another issue is that src0_sel and src1_sel are not swapped when SDWA insts are commuted. Verified that this fixes rocFFT tests failure. Change-Id: I55189f27749c7ea5c4bea55013b91fe1672deeb8

yxsamliu requested review from jayfoad, arsenm and Sisyph September 1, 2024 19:11

llvmbot added the backend:AMDGPU label Sep 1, 2024

arsenm reviewed Sep 4, 2024

View reviewed changes

llvm/test/CodeGen/AMDGPU/sdwa-cse.mir Show resolved Hide resolved

arsenm mentioned this pull request Sep 20, 2024

[AMDGPU] fix SIPeepholeSDWA optimization for fp16 #109395

Closed

arsenm requested a review from PankajDwivedi-25 September 20, 2024 13:51

PankajDwivedi-25 approved these changes Sep 20, 2024

View reviewed changes

bfavela reviewed Oct 2, 2024

View reviewed changes

arsenm requested changes Oct 2, 2024

View reviewed changes

yxsamliu force-pushed the sdwa-commute branch from 46b12a2 to 9281934 Compare October 3, 2024 17:36

arsenm reviewed Oct 3, 2024

View reviewed changes

yxsamliu force-pushed the sdwa-commute branch 2 times, most recently from 157094a to ccc3eca Compare October 4, 2024 17:57

yxsamliu force-pushed the sdwa-commute branch from ccc3eca to e4db378 Compare October 4, 2024 18:00

arsenm reviewed Oct 4, 2024

View reviewed changes

arsenm approved these changes Oct 4, 2024

View reviewed changes

yxsamliu merged commit 3b88805 into llvm:main Oct 4, 2024
9 checks passed

yxsamliu mentioned this pull request Oct 4, 2024

[amdgpu] some VOP3 instructions not defined as commutable #111205

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[AMDGPU] Fix SDWA commuting #106920

[AMDGPU] Fix SDWA commuting #106920

yxsamliu commented Sep 1, 2024

llvmbot commented Sep 1, 2024

arsenm left a comment

bfavela Oct 2, 2024 •

edited

Loading

yxsamliu Oct 3, 2024

arsenm left a comment

yxsamliu commented Oct 3, 2024

arsenm Oct 3, 2024

arsenm Oct 3, 2024

yxsamliu Oct 4, 2024

arsenm Oct 4, 2024

yxsamliu Oct 4, 2024

arsenm Oct 4, 2024

yxsamliu Oct 4, 2024

yxsamliu Oct 4, 2024

yxsamliu Oct 4, 2024

		@@ -0,0 +1,46 @@
		; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
		; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx1030 -verify-machineinstrs < %s \| FileCheck %s

		# GCN: %2:vgpr_32 = V_ADD_NC_U16_e64 0, %0, 0, %1, 1, 0, implicit $mode, implicit $exec
		# GCN: %3:vgpr_32 = V_ADD_NC_U16_e64 0, %1, 0, %0, 1, 0, implicit $mode, implicit $exec

[AMDGPU] Fix SDWA commuting #106920

[AMDGPU] Fix SDWA commuting #106920

Conversation

yxsamliu commented Sep 1, 2024

llvmbot commented Sep 1, 2024

arsenm left a comment

Choose a reason for hiding this comment

bfavela Oct 2, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

arsenm left a comment

Choose a reason for hiding this comment

yxsamliu commented Oct 3, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

bfavela Oct 2, 2024 •

edited

Loading