Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

AMDGPU lowering of abs is bad for i16 vectors with more than 2 elements #94606

Closed
arsenm opened this issue Jun 6, 2024 · 5 comments · Fixed by #95413
Closed

AMDGPU lowering of abs is bad for i16 vectors with more than 2 elements #94606

arsenm opened this issue Jun 6, 2024 · 5 comments · Fixed by #95413
Labels
backend:AMDGPU good first issue https://github.com/llvm/llvm-project/contribute missed-optimization

Comments

@arsenm
Copy link
Contributor

arsenm commented Jun 6, 2024

We currently get a simple 2-instruction expansion in the v2i16 case, but larger vectors scalarize and produce bad code instead of vector splitting

; RUN: llc -march=amdgcn -mcpu=gfx900 < %s

; v_pk_sub_i16 v1, 0, v0
; v_pk_max_i16 v0, v0, v1
define <2 x i16> @v_abs_v2i16(<2 x i16> %arg) {
  %res = call <2 x i16> @llvm.abs.v2i16(<2 x i16> %arg, i1 false)
  ret <2 x i16> %res
}

; This should decompose into 2 x v2i16 and 4 instructions, but we get this long sequence

;	v_lshrrev_b32_e32 v2, 16, v1
;	v_sub_u16_e32 v3, 0, v2
;	v_max_i16_e32 v2, v2, v3
;	v_lshrrev_b32_e32 v3, 16, v0
;	v_sub_u16_e32 v4, 0, v3
;	v_max_i16_e32 v3, v3, v4
;	v_sub_u16_e32 v4, 0, v1
;	v_max_i16_e32 v1, v1, v4
;	v_sub_u16_e32 v4, 0, v0
;	v_max_i16_e32 v0, v0, v4
;	s_mov_b32 s4, 0x5040100
;	v_perm_b32 v0, v3, v0, s4
;	v_perm_b32 v1, v2, v1, s4
define <4 x i16> @v_abs_v4i16(<4 x i16> %arg) {
  %res = call <4 x i16> @llvm.abs.v4i16(<4 x i16> %arg, i1 false)
  ret <4 x i16> %res
}
@arsenm arsenm added good first issue https://github.com/llvm/llvm-project/contribute backend:AMDGPU missed-optimization labels Jun 6, 2024
@llvmbot
Copy link
Member

llvmbot commented Jun 6, 2024

Hi!

This issue may be a good introductory issue for people new to working on LLVM. If you would like to work on this issue, your first steps are:

  1. Check that no other contributor has already been assigned to this issue. If you believe that no one is actually working on it despite an assignment, ping the person. After one week without a response, the assignee may be changed.
  2. In the comments of this issue, request for it to be assigned to you, or just create a pull request after following the steps below. Mention this issue in the description of the pull request.
  3. Fix the issue locally.
  4. Run the test suite locally. Remember that the subdirectories under test/ create fine-grained testing targets, so you can e.g. use make check-clang-ast to only run Clang's AST tests.
  5. Create a Git commit.
  6. Run git clang-format HEAD~1 to format your changes.
  7. Open a pull request to the upstream repository on GitHub. Detailed instructions can be found in GitHub's documentation. Mention this issue in the description of the pull request.

If you have any further questions about this issue, don't hesitate to ask via a comment in the thread below.

@llvmbot
Copy link
Member

llvmbot commented Jun 6, 2024

@llvm/issue-subscribers-good-first-issue

Author: Matt Arsenault (arsenm)

We currently get a simple 2-instruction expansion in the v2i16 case, but larger vectors scalarize and produce bad code instead of vector splitting
; RUN: llc -march=amdgcn -mcpu=gfx900 &lt; %s

; v_pk_sub_i16 v1, 0, v0
; v_pk_max_i16 v0, v0, v1
define &lt;2 x i16&gt; @<!-- -->v_abs_v2i16(&lt;2 x i16&gt; %arg) {
  %res = call &lt;2 x i16&gt; @<!-- -->llvm.abs.v2i16(&lt;2 x i16&gt; %arg, i1 false)
  ret &lt;2 x i16&gt; %res
}

; This should decompose into 2 x v2i16 and 4 instructions, but we get this long sequence

;	v_lshrrev_b32_e32 v2, 16, v1
;	v_sub_u16_e32 v3, 0, v2
;	v_max_i16_e32 v2, v2, v3
;	v_lshrrev_b32_e32 v3, 16, v0
;	v_sub_u16_e32 v4, 0, v3
;	v_max_i16_e32 v3, v3, v4
;	v_sub_u16_e32 v4, 0, v1
;	v_max_i16_e32 v1, v1, v4
;	v_sub_u16_e32 v4, 0, v0
;	v_max_i16_e32 v0, v0, v4
;	s_mov_b32 s4, 0x5040100
;	v_perm_b32 v0, v3, v0, s4
;	v_perm_b32 v1, v2, v1, s4
define &lt;4 x i16&gt; @<!-- -->v_abs_v4i16(&lt;4 x i16&gt; %arg) {
  %res = call &lt;4 x i16&gt; @<!-- -->llvm.abs.v4i16(&lt;4 x i16&gt; %arg, i1 false)
  ret &lt;4 x i16&gt; %res
}

@llvmbot
Copy link
Member

llvmbot commented Jun 6, 2024

@llvm/issue-subscribers-backend-amdgpu

Author: Matt Arsenault (arsenm)

We currently get a simple 2-instruction expansion in the v2i16 case, but larger vectors scalarize and produce bad code instead of vector splitting
; RUN: llc -march=amdgcn -mcpu=gfx900 &lt; %s

; v_pk_sub_i16 v1, 0, v0
; v_pk_max_i16 v0, v0, v1
define &lt;2 x i16&gt; @<!-- -->v_abs_v2i16(&lt;2 x i16&gt; %arg) {
  %res = call &lt;2 x i16&gt; @<!-- -->llvm.abs.v2i16(&lt;2 x i16&gt; %arg, i1 false)
  ret &lt;2 x i16&gt; %res
}

; This should decompose into 2 x v2i16 and 4 instructions, but we get this long sequence

;	v_lshrrev_b32_e32 v2, 16, v1
;	v_sub_u16_e32 v3, 0, v2
;	v_max_i16_e32 v2, v2, v3
;	v_lshrrev_b32_e32 v3, 16, v0
;	v_sub_u16_e32 v4, 0, v3
;	v_max_i16_e32 v3, v3, v4
;	v_sub_u16_e32 v4, 0, v1
;	v_max_i16_e32 v1, v1, v4
;	v_sub_u16_e32 v4, 0, v0
;	v_max_i16_e32 v0, v0, v4
;	s_mov_b32 s4, 0x5040100
;	v_perm_b32 v0, v3, v0, s4
;	v_perm_b32 v1, v2, v1, s4
define &lt;4 x i16&gt; @<!-- -->v_abs_v4i16(&lt;4 x i16&gt; %arg) {
  %res = call &lt;4 x i16&gt; @<!-- -->llvm.abs.v4i16(&lt;4 x i16&gt; %arg, i1 false)
  ret &lt;4 x i16&gt; %res
}

@Rajveer100
Copy link
Contributor

@arsenm
I believe in the above IR, we are calculating the absolute value for each element in the vector. For the optimisation part, could you provide pointers for the same?

Also, does the vector size affect the sequence size?

@arsenm
Copy link
Contributor Author

arsenm commented Jun 10, 2024

@arsenm I believe in the above IR, we are calculating the absolute value for each element in the vector. For the optimisation part, could you provide pointers for the same?

SelectionDAG makes this difficult because of how it assumes vectors work. If the wider vector types were illegal, the vector legalizer would produce the correct code. We usually work around this by custom lowering wider vector operations and then doing the split ourselves.

Also, does the vector size affect the sequence size?

No, any 16-bit vector should be decomposed into 2-element sequences. 6 x i16 and 8 x i16 should just be decomposed into 2 x i16 pieces. Ideally the 3 x case would be a 2 x i16 plus a scalar

arsenm pushed a commit that referenced this issue Jun 14, 2024
#95413)

fixes #94606

Expansion of `ABS` for `i16` vectors with more than 2 elements is
currently falling back to scalarization of the vector.
This PR adds a custom lowering for `ABS` on `i16` vectors that splits
the vector into multiple `<2 x i 16>` vectors.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
backend:AMDGPU good first issue https://github.com/llvm/llvm-project/contribute missed-optimization
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants