Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AMDGPU] Fix SDWA commuting #106920

Merged
merged 1 commit into from
Oct 4, 2024
Merged

[AMDGPU] Fix SDWA commuting #106920

merged 1 commit into from
Oct 4, 2024

Conversation

yxsamliu
Copy link
Collaborator

@yxsamliu yxsamliu commented Sep 1, 2024

SDWA insts miss reverse opcode, which causes them to be treated as commutable with default reverse opcode i.e. their own opcode. As a result, SWDA F16 sub A, B and Sub B, A are merged by machine CSE. The correct behavior is to merged sub A, B and subrev B, A instead of sub B, A. This issues caused failures in rocFFT tests.

Another issue is that src0_sel and src1_sel are not swapped when SDWA insts are commuted.

Verified that this fixes rocFFT tests failure.

@llvmbot
Copy link
Member

llvmbot commented Sep 1, 2024

@llvm/pr-subscribers-backend-amdgpu

Author: Yaxun (Sam) Liu (yxsamliu)

Changes

SDWA insts miss reverse opcode, which causes them to be treated as commutable with default reverse opcode i.e. their own opcode. As a result, SWDA F16 sub A, B and Sub B, A are merged by machine CSE. The correct behavior is to merged sub A, B and subrev B, A instead of sub B, A. This issues caused failures in rocFFT tests.

Another issue is that src0_sel and src1_sel are not swapped when SDWA insts are commuted.

Verified that this fixes rocFFT tests failure.


Full diff: https://github.com/llvm/llvm-project/pull/106920.diff

3 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/SIInstrInfo.cpp (+4)
  • (modified) llvm/lib/Target/AMDGPU/VOP2Instructions.td (+7-4)
  • (added) llvm/test/CodeGen/AMDGPU/sdwa-cse.mir (+56)
diff --git a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
index a857bdba53c3e8..43ec2b4484ec28 100644
--- a/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
+++ b/llvm/lib/Target/AMDGPU/SIInstrInfo.cpp
@@ -2812,6 +2812,10 @@ MachineInstr *SIInstrInfo::commuteInstructionImpl(MachineInstr &MI, bool NewMI,
     swapSourceModifiers(MI, Src0, AMDGPU::OpName::src0_modifiers,
                         Src1, AMDGPU::OpName::src1_modifiers);
 
+    if (isSDWA(MI))
+      swapSourceModifiers(MI, Src0, AMDGPU::OpName::src0_sel, Src1,
+                          AMDGPU::OpName::src1_sel);
+
     CommutedMI->setDesc(get(CommutedOpcode));
   }
 
diff --git a/llvm/lib/Target/AMDGPU/VOP2Instructions.td b/llvm/lib/Target/AMDGPU/VOP2Instructions.td
index fccaa27f361381..852d434cf743e8 100644
--- a/llvm/lib/Target/AMDGPU/VOP2Instructions.td
+++ b/llvm/lib/Target/AMDGPU/VOP2Instructions.td
@@ -174,10 +174,12 @@ multiclass VOP2Inst_e64<string opName,
 
 multiclass VOP2Inst_sdwa<string opName,
                          VOPProfile P,
+                         string revOp = opName,
                          bit GFX9Renamed = 0> {
   let renamedInGFX9 = GFX9Renamed in {
     if P.HasExtSDWA then
-      def _sdwa : VOP2_SDWA_Pseudo <opName, P>;
+      def _sdwa : VOP2_SDWA_Pseudo <opName, P>,
+                  Commutable_REV<revOp#"_sdwa", !eq(revOp, opName)>;
   } // End renamedInGFX9 = GFX9Renamed
 }
 
@@ -188,7 +190,7 @@ multiclass VOP2Inst<string opName,
                     bit GFX9Renamed = 0> :
     VOP2Inst_e32<opName, P, node, revOp, GFX9Renamed>,
     VOP2Inst_e64<opName, P, node, revOp, GFX9Renamed>,
-    VOP2Inst_sdwa<opName, P, GFX9Renamed> {
+    VOP2Inst_sdwa<opName, P, revOp, GFX9Renamed> {
   let renamedInGFX9 = GFX9Renamed in {
     if P.HasExtDPP then
       def _dpp  : VOP2_DPP_Pseudo <opName, P>;
@@ -237,7 +239,7 @@ multiclass VOP2Inst_VOPD<string opName,
                          bit GFX9Renamed = 0> :
     VOP2Inst_e32_VOPD<opName, P, VOPDOp, VOPDName, node, revOp, GFX9Renamed>,
     VOP2Inst_e64<opName, P, node, revOp, GFX9Renamed>,
-    VOP2Inst_sdwa<opName, P, GFX9Renamed> {
+    VOP2Inst_sdwa<opName, P, revOp, GFX9Renamed> {
   let renamedInGFX9 = GFX9Renamed in {
     if P.HasExtDPP then
       def _dpp  : VOP2_DPP_Pseudo <opName, P>;
@@ -259,7 +261,8 @@ multiclass VOP2bInst <string opName,
         }
 
         if P.HasExtSDWA then
-          def _sdwa  : VOP2_SDWA_Pseudo <opName, P> {
+          def _sdwa  : VOP2_SDWA_Pseudo <opName, P>,
+                       Commutable_REV<revOp#"_sdwa", !eq(revOp, opName)> {
             let AsmMatchConverter = "cvtSdwaVOP2b";
           }
         if P.HasExtDPP then
diff --git a/llvm/test/CodeGen/AMDGPU/sdwa-cse.mir b/llvm/test/CodeGen/AMDGPU/sdwa-cse.mir
new file mode 100644
index 00000000000000..1c12812cbcf527
--- /dev/null
+++ b/llvm/test/CodeGen/AMDGPU/sdwa-cse.mir
@@ -0,0 +1,56 @@
+# RUN: llc -mtriple=amdgcn -mcpu=gfx1030 -run-pass=machine-cse -verify-machineinstrs %s -o - 2>&1 | FileCheck --check-prefix=GCN %s
+
+# GCN-LABEL: name: test_machine_cse_subtraction_sdwa_f16_no_merge
+# GCN:   %2:vgpr_32 = contract nofpexcept V_SUB_F16_sdwa 0, %0.sub0, 0, %1.sub0, 0, 0, 6, 0, 5, 5, implicit $mode, implicit $exec
+# GCN:   %3:vgpr_32 = contract nofpexcept V_SUB_F16_sdwa 0, %1.sub0, 0, %0.sub0, 0, 0, 6, 0, 5, 5, implicit $mode, implicit $exec
+# GCN:   %5:vgpr_32 = contract nofpexcept V_ADD_F16_e32 %2, %4, implicit $mode, implicit $exec
+# GCN:   %6:vgpr_32 = contract nofpexcept V_ADD_F16_e32 %3, %4, implicit $mode, implicit $exec
+# GCN:   DS_WRITE2_B32_gfx9 undef %7:vgpr_32, %5, %6, 0, 1, 0, implicit $exec
+---
+name: test_machine_cse_subtraction_sdwa_f16_no_merge
+body: |
+  bb.0:
+    %0:vreg_64 = IMPLICIT_DEF
+    %1:vreg_64 = IMPLICIT_DEF
+    %2:vgpr_32 = contract nofpexcept V_SUB_F16_sdwa 0, %0.sub0, 0, %1.sub0, 0, 0, 6, 0, 5, 5, implicit $mode, implicit $exec
+    %3:vgpr_32 = contract nofpexcept V_SUB_F16_sdwa 0, %1.sub0, 0, %0.sub0, 0, 0, 6, 0, 5, 5, implicit $mode, implicit $exec
+    %4:vgpr_32 = IMPLICIT_DEF
+    %5:vgpr_32 = contract nofpexcept V_ADD_F16_e32 %2, %4, implicit $mode, implicit $exec
+    %6:vgpr_32 = contract nofpexcept V_ADD_F16_e32 %3, %4, implicit $mode, implicit $exec
+    DS_WRITE2_B32_gfx9 undef %7:vgpr_32, %5, %6, 0, 1, 0, implicit $exec
+...
+
+# GCN-LABEL: name: test_machine_cse_subtraction_sdwa_f16_merge_same_src_sel
+# GCN:   %2:vgpr_32 = contract nofpexcept V_SUB_F16_sdwa 0, %0.sub0, 0, %1.sub0, 0, 0, 6, 0, 5, 5, implicit $mode, implicit $exec
+# GCN:   %5:vgpr_32 = contract nofpexcept V_ADD_F16_e32 %2, %4, implicit $mode, implicit $exec
+# GCN:   DS_WRITE2_B32_gfx9 undef %7:vgpr_32, %5, %5, 0, 1, 0, implicit $exec
+---
+name: test_machine_cse_subtraction_sdwa_f16_merge_same_src_sel
+body: |
+  bb.0:
+    %0:vreg_64 = IMPLICIT_DEF
+    %1:vreg_64 = IMPLICIT_DEF
+    %2:vgpr_32 = contract nofpexcept V_SUB_F16_sdwa 0, %0.sub0, 0, %1.sub0, 0, 0, 6, 0, 5, 5, implicit $mode, implicit $exec
+    %3:vgpr_32 = contract nofpexcept V_SUBREV_F16_sdwa 0, %1.sub0, 0, %0.sub0, 0, 0, 6, 0, 5, 5, implicit $mode, implicit $exec
+    %4:vgpr_32 = IMPLICIT_DEF
+    %5:vgpr_32 = contract nofpexcept V_ADD_F16_e32 %2, %4, implicit $mode, implicit $exec
+    %6:vgpr_32 = contract nofpexcept V_ADD_F16_e32 %3, %4, implicit $mode, implicit $exec
+    DS_WRITE2_B32_gfx9 undef %7:vgpr_32, %5, %6, 0, 1, 0, implicit $exec
+...
+
+# GCN-LABEL: name: test_machine_cse_subtraction_sdwa_f16_merge_diff_src_sel
+# GCN:   %2:vgpr_32 = contract nofpexcept V_SUB_F16_sdwa 0, %0.sub0, 0, %1.sub0, 0, 0, 6, 0, 6, 5, implicit $mode, implicit $exec
+# GCN:   %5:vgpr_32 = contract nofpexcept V_ADD_F16_e32 %2, %4, implicit $mode, implicit $exec
+# GCN:   DS_WRITE2_B32_gfx9 undef %7:vgpr_32, %5, %5, 0, 1, 0, implicit $exec
+---
+name: test_machine_cse_subtraction_sdwa_f16_merge_diff_src_sel
+body: |
+  bb.0:
+    %0:vreg_64 = IMPLICIT_DEF
+    %1:vreg_64 = IMPLICIT_DEF
+    %2:vgpr_32 = contract nofpexcept V_SUB_F16_sdwa 0, %0.sub0, 0, %1.sub0, 0, 0, 6, 0, 6, 5, implicit $mode, implicit $exec
+    %3:vgpr_32 = contract nofpexcept V_SUBREV_F16_sdwa 0, %1.sub0, 0, %0.sub0, 0, 0, 6, 0, 5, 6, implicit $mode, implicit $exec
+    %4:vgpr_32 = IMPLICIT_DEF
+    %5:vgpr_32 = contract nofpexcept V_ADD_F16_e32 %2, %4, implicit $mode, implicit $exec
+    %6:vgpr_32 = contract nofpexcept V_ADD_F16_e32 %3, %4, implicit $mode, implicit $exec
+    DS_WRITE2_B32_gfx9 undef %7:vgpr_32, %5, %6, 0, 1, 0, implicit $exec

Copy link
Contributor

@arsenm arsenm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

An end to end IR test wouldn't hurt either, there are other contexts that commute

llvm/test/CodeGen/AMDGPU/sdwa-cse.mir Show resolved Hide resolved
@@ -2812,6 +2812,10 @@ MachineInstr *SIInstrInfo::commuteInstructionImpl(MachineInstr &MI, bool NewMI,
swapSourceModifiers(MI, Src0, AMDGPU::OpName::src0_modifiers,
Src1, AMDGPU::OpName::src1_modifiers);

if (isSDWA(MI))
Copy link
Contributor

@bfavela bfavela Oct 2, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not just SDWA - VOP3 instructions with OPSEL also use src0/1_sel

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

will fix

Copy link
Contributor

@arsenm arsenm left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fix end of line, add test of op_sel case

@yxsamliu
Copy link
Collaborator Author

yxsamliu commented Oct 3, 2024

An end to end IR test wouldn't hurt either, there are other contexts that commute

will add IR-to-ISA test

@@ -0,0 +1,46 @@
; NOTE: Assertions have been autogenerated by utils/update_llc_test_checks.py UTC_ARGS: --version 5
; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx1030 -verify-machineinstrs < %s | FileCheck %s
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx1030 -verify-machineinstrs < %s | FileCheck %s
; RUN: llc -mtriple=amdgcn-amd-amdhsa -mcpu=gfx1030 < %s | FileCheck %s

%5:vgpr_32 = contract nofpexcept V_ADD_F16_e32 %2, %4, implicit $mode, implicit $exec
%6:vgpr_32 = contract nofpexcept V_ADD_F16_e32 %3, %4, implicit $mode, implicit $exec
DS_WRITE2_B32_gfx9 undef %7:vgpr_32, %5, %6, 0, 1, 0, implicit $exec
...
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Test an op_sel case too

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since op_sel is not in SDWA instructions, added another test commute-op-sel.mir

@yxsamliu yxsamliu force-pushed the sdwa-commute branch 2 times, most recently from 157094a to ccc3eca Compare October 4, 2024 17:57
SDWA insts miss reverse opcode, which causes them to be treated as commutable
with default reverse opcode i.e. their own opcode. As a result, SWDA F16
sub A, B and Sub B, A are merged by machine CSE. The correct behavior is
to merged sub A, B and subrev B, A instead of sub B, A. This issues caused
failures in rocFFT tests.

Another issue is that src0_sel and src1_sel are not swapped when SDWA insts
are commuted.

Verified that this fixes rocFFT tests failure.
Comment on lines +4 to +5
# GCN: %2:vgpr_32 = V_ADD_NC_U16_e64 0, %0, 0, %1, 1, 0, implicit $mode, implicit $exec
# GCN: %3:vgpr_32 = V_ADD_NC_U16_e64 0, %1, 0, %0, 1, 0, implicit $mode, implicit $exec
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No commute happened, and the modifiers aren't set?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems we do not define VOP3 instructions as commutable. Even if both op_sel are 0, the two V_ADD_NC_U16_e64 are not merged.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a bug that should be fixed

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can it be done with separate PR? This PR is for SDWA commute issue which causes rocFFT to fail. Whereas VOP3 not commute is missed performance opportunity. The current issue is kind of urgent.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OK I just noticed you have approved this PR. I will commit it first and will look into VOP3 commuting opportunity later.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

open an issue to track this #111205

@yxsamliu yxsamliu merged commit 3b88805 into llvm:main Oct 4, 2024
9 checks passed
searlmc1 pushed a commit to ROCm/llvm-project that referenced this pull request Nov 22, 2024
SDWA insts miss reverse opcode, which causes them to be treated as
commutable with default reverse opcode i.e. their own opcode. As a
result, SWDA F16 sub A, B and Sub B, A are merged by machine CSE. The
correct behavior is to merged sub A, B and subrev B, A instead of sub B,
A. This issues caused failures in rocFFT tests.

Another issue is that src0_sel and src1_sel are not swapped when SDWA
insts are commuted.

Verified that this fixes rocFFT tests failure.

Change-Id: I55189f27749c7ea5c4bea55013b91fe1672deeb8
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants