AMDGPU lowering of abs is bad for i16 vectors with more than 2 elements #94606

arsenm · 2024-06-06T11:22:31Z

We currently get a simple 2-instruction expansion in the v2i16 case, but larger vectors scalarize and produce bad code instead of vector splitting

; RUN: llc -march=amdgcn -mcpu=gfx900 < %s

; v_pk_sub_i16 v1, 0, v0
; v_pk_max_i16 v0, v0, v1
define <2 x i16> @v_abs_v2i16(<2 x i16> %arg) {
  %res = call <2 x i16> @llvm.abs.v2i16(<2 x i16> %arg, i1 false)
  ret <2 x i16> %res
}

; This should decompose into 2 x v2i16 and 4 instructions, but we get this long sequence

;	v_lshrrev_b32_e32 v2, 16, v1
;	v_sub_u16_e32 v3, 0, v2
;	v_max_i16_e32 v2, v2, v3
;	v_lshrrev_b32_e32 v3, 16, v0
;	v_sub_u16_e32 v4, 0, v3
;	v_max_i16_e32 v3, v3, v4
;	v_sub_u16_e32 v4, 0, v1
;	v_max_i16_e32 v1, v1, v4
;	v_sub_u16_e32 v4, 0, v0
;	v_max_i16_e32 v0, v0, v4
;	s_mov_b32 s4, 0x5040100
;	v_perm_b32 v0, v3, v0, s4
;	v_perm_b32 v1, v2, v1, s4
define <4 x i16> @v_abs_v4i16(<4 x i16> %arg) {
  %res = call <4 x i16> @llvm.abs.v4i16(<4 x i16> %arg, i1 false)
  ret <4 x i16> %res
}

The text was updated successfully, but these errors were encountered:

llvmbot · 2024-06-06T11:22:46Z

Hi!

This issue may be a good introductory issue for people new to working on LLVM. If you would like to work on this issue, your first steps are:

Check that no other contributor has already been assigned to this issue. If you believe that no one is actually working on it despite an assignment, ping the person. After one week without a response, the assignee may be changed.
In the comments of this issue, request for it to be assigned to you, or just create a pull request after following the steps below. Mention this issue in the description of the pull request.
Fix the issue locally.
Run the test suite locally. Remember that the subdirectories under test/ create fine-grained testing targets, so you can e.g. use make check-clang-ast to only run Clang's AST tests.
Create a Git commit.
Run git clang-format HEAD~1 to format your changes.
Open a pull request to the upstream repository on GitHub. Detailed instructions can be found in GitHub's documentation. Mention this issue in the description of the pull request.

If you have any further questions about this issue, don't hesitate to ask via a comment in the thread below.

llvmbot · 2024-06-06T11:22:47Z

@llvm/issue-subscribers-good-first-issue

Author: Matt Arsenault (arsenm)

We currently get a simple 2-instruction expansion in the v2i16 case, but larger vectors scalarize and produce bad code instead of vector splitting

; RUN: llc -march=amdgcn -mcpu=gfx900 &lt; %s

; v_pk_sub_i16 v1, 0, v0
; v_pk_max_i16 v0, v0, v1
define &lt;2 x i16&gt; @<!-- -->v_abs_v2i16(&lt;2 x i16&gt; %arg) {
  %res = call &lt;2 x i16&gt; @<!-- -->llvm.abs.v2i16(&lt;2 x i16&gt; %arg, i1 false)
  ret &lt;2 x i16&gt; %res
}

; This should decompose into 2 x v2i16 and 4 instructions, but we get this long sequence

;	v_lshrrev_b32_e32 v2, 16, v1
;	v_sub_u16_e32 v3, 0, v2
;	v_max_i16_e32 v2, v2, v3
;	v_lshrrev_b32_e32 v3, 16, v0
;	v_sub_u16_e32 v4, 0, v3
;	v_max_i16_e32 v3, v3, v4
;	v_sub_u16_e32 v4, 0, v1
;	v_max_i16_e32 v1, v1, v4
;	v_sub_u16_e32 v4, 0, v0
;	v_max_i16_e32 v0, v0, v4
;	s_mov_b32 s4, 0x5040100
;	v_perm_b32 v0, v3, v0, s4
;	v_perm_b32 v1, v2, v1, s4
define &lt;4 x i16&gt; @<!-- -->v_abs_v4i16(&lt;4 x i16&gt; %arg) {
  %res = call &lt;4 x i16&gt; @<!-- -->llvm.abs.v4i16(&lt;4 x i16&gt; %arg, i1 false)
  ret &lt;4 x i16&gt; %res
}

llvmbot · 2024-06-06T11:22:47Z

@llvm/issue-subscribers-backend-amdgpu

Author: Matt Arsenault (arsenm)

We currently get a simple 2-instruction expansion in the v2i16 case, but larger vectors scalarize and produce bad code instead of vector splitting

; RUN: llc -march=amdgcn -mcpu=gfx900 &lt; %s

; v_pk_sub_i16 v1, 0, v0
; v_pk_max_i16 v0, v0, v1
define &lt;2 x i16&gt; @<!-- -->v_abs_v2i16(&lt;2 x i16&gt; %arg) {
  %res = call &lt;2 x i16&gt; @<!-- -->llvm.abs.v2i16(&lt;2 x i16&gt; %arg, i1 false)
  ret &lt;2 x i16&gt; %res
}

; This should decompose into 2 x v2i16 and 4 instructions, but we get this long sequence

;	v_lshrrev_b32_e32 v2, 16, v1
;	v_sub_u16_e32 v3, 0, v2
;	v_max_i16_e32 v2, v2, v3
;	v_lshrrev_b32_e32 v3, 16, v0
;	v_sub_u16_e32 v4, 0, v3
;	v_max_i16_e32 v3, v3, v4
;	v_sub_u16_e32 v4, 0, v1
;	v_max_i16_e32 v1, v1, v4
;	v_sub_u16_e32 v4, 0, v0
;	v_max_i16_e32 v0, v0, v4
;	s_mov_b32 s4, 0x5040100
;	v_perm_b32 v0, v3, v0, s4
;	v_perm_b32 v1, v2, v1, s4
define &lt;4 x i16&gt; @<!-- -->v_abs_v4i16(&lt;4 x i16&gt; %arg) {
  %res = call &lt;4 x i16&gt; @<!-- -->llvm.abs.v4i16(&lt;4 x i16&gt; %arg, i1 false)
  ret &lt;4 x i16&gt; %res
}

Rajveer100 · 2024-06-10T08:17:24Z

@arsenm
I believe in the above IR, we are calculating the absolute value for each element in the vector. For the optimisation part, could you provide pointers for the same?

Also, does the vector size affect the sequence size?

arsenm · 2024-06-10T08:26:23Z

@arsenm I believe in the above IR, we are calculating the absolute value for each element in the vector. For the optimisation part, could you provide pointers for the same?

SelectionDAG makes this difficult because of how it assumes vectors work. If the wider vector types were illegal, the vector legalizer would produce the correct code. We usually work around this by custom lowering wider vector operations and then doing the split ourselves.

Also, does the vector size affect the sequence size?

No, any 16-bit vector should be decomposed into 2-element sequences. 6 x i16 and 8 x i16 should just be decomposed into 2 x i16 pieces. Ideally the 3 x case would be a 2 x i16 plus a scalar

#95413) fixes #94606 Expansion of `ABS` for `i16` vectors with more than 2 elements is currently falling back to scalarization of the vector. This PR adds a custom lowering for `ABS` on `i16` vectors that splits the vector into multiple `<2 x i 16>` vectors.

arsenm added good first issue https://github.com/llvm/llvm-project/contribute backend:AMDGPU missed-optimization labels Jun 6, 2024

tgymnich mentioned this issue Jun 13, 2024

[AMDGPU] Fix lowering of abs for i16 vectors with more than 2 elements #95413

Merged

arsenm closed this as completed in #95413 Jun 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

AMDGPU lowering of abs is bad for i16 vectors with more than 2 elements #94606

AMDGPU lowering of abs is bad for i16 vectors with more than 2 elements #94606

arsenm commented Jun 6, 2024

llvmbot commented Jun 6, 2024

llvmbot commented Jun 6, 2024

llvmbot commented Jun 6, 2024

Rajveer100 commented Jun 10, 2024

arsenm commented Jun 10, 2024

AMDGPU lowering of abs is bad for i16 vectors with more than 2 elements #94606

AMDGPU lowering of abs is bad for i16 vectors with more than 2 elements #94606

Comments

arsenm commented Jun 6, 2024

llvmbot commented Jun 6, 2024

llvmbot commented Jun 6, 2024

llvmbot commented Jun 6, 2024

Rajveer100 commented Jun 10, 2024

arsenm commented Jun 10, 2024