Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[AMDGPU][PromoteAlloca] Support memsets to ptr allocas #80678

Merged
merged 3 commits into from
Feb 5, 2024

Conversation

Pierre-vh
Copy link
Contributor

Fixes #80366

@llvmbot
Copy link
Member

llvmbot commented Feb 5, 2024

@llvm/pr-subscribers-backend-amdgpu

Author: Pierre van Houtryve (Pierre-vh)

Changes

Fixes #80366


Full diff: https://github.com/llvm/llvm-project/pull/80678.diff

2 Files Affected:

  • (modified) llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp (+12-4)
  • (modified) llvm/test/CodeGen/AMDGPU/promote-alloca-memset.ll (+12)
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp b/llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
index 5e73411cae9b70..c1b244f50d93f8 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
@@ -521,10 +521,18 @@ static Value *promoteAllocaUserToVector(
       // For memset, we don't need to know the previous value because we
       // currently only allow memsets that cover the whole alloca.
       Value *Elt = MSI->getOperand(1);
-      if (DL.getTypeStoreSize(VecEltTy) > 1) {
-        Value *EltBytes =
-            Builder.CreateVectorSplat(DL.getTypeStoreSize(VecEltTy), Elt);
-        Elt = Builder.CreateBitCast(EltBytes, VecEltTy);
+      const unsigned BytesPerElt = DL.getTypeStoreSize(VecEltTy);
+      if (BytesPerElt > 1) {
+        Value *EltBytes = Builder.CreateVectorSplat(BytesPerElt, Elt);
+
+        // If the element type of the vector is a pointer, we need to first cast
+        // to an integer, then use a PtrCast.
+        if (VecEltTy->isPointerTy()) {
+          Type *PtrInt = Builder.getIntNTy(BytesPerElt * 8);
+          Elt = Builder.CreateBitCast(EltBytes, PtrInt);
+          Elt = Builder.CreateIntToPtr(Elt, VecEltTy);
+        } else
+          Elt = Builder.CreateBitCast(EltBytes, VecEltTy);
       }
 
       return Builder.CreateVectorSplat(VectorTy->getElementCount(), Elt);
diff --git a/llvm/test/CodeGen/AMDGPU/promote-alloca-memset.ll b/llvm/test/CodeGen/AMDGPU/promote-alloca-memset.ll
index 15af1f17e230ec..829e7a1b84e90c 100644
--- a/llvm/test/CodeGen/AMDGPU/promote-alloca-memset.ll
+++ b/llvm/test/CodeGen/AMDGPU/promote-alloca-memset.ll
@@ -84,4 +84,16 @@ entry:
   ret void
 }
 
+define amdgpu_kernel void @memset_ptr_alloca(ptr %out) {
+; CHECK-LABEL: @memset_ptr_alloca(
+; CHECK-NEXT:    store i64 0, ptr [[OUT:%.*]], align 8
+; CHECK-NEXT:    ret void
+;
+  %alloca = alloca [6 x ptr], align 16, addrspace(5)
+  call void @llvm.memset.p5.i64(ptr addrspace(5) %alloca, i8 0, i64 48, i1 false)
+  %load = load i64, ptr addrspace(5) %alloca
+  store i64 %load, ptr %out
+  ret void
+}
+
 declare void @llvm.memset.p5.i64(ptr addrspace(5) nocapture writeonly, i8, i64, i1 immarg)


// If the element type of the vector is a pointer, we need to first cast
// to an integer, then use a PtrCast.
if (VecEltTy->isPointerTy()) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Will fail the same way for vector of pointers

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do you mean vector of pointers?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this only looks at the element type of the vector type we're promoting to, which is always a primitive

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mean [6 x <2 x ptr>]

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

added the test as well, that isn't promoted either.

On a side note, PromoteAlloca needs some love still. I'm wondering if it's worth the effort to support things like nested arrays. I'd assume we'd hit the upper limit very fast with X * Y elements but maybe not?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's still not trying to intelligently choose which allocas are the most profitable to promote

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mariusz-sikora-at-amd I'm not sure, no strong opinion. I was thinking of doing it by flattening arrays (e.g. [2 x [3 x float]]) becomes [6 x float]. I think the tricky part is resolving the GEPs correctly, it might be a bigger refactoring than it looks like at first glance.

One alternative may be to have some kind of "alloca canonicalization" pass earlier that does the flattening for us to enable PromoteAlloca better.

@arsenm I haven't lost track of that but I also didn't find the time for it yet :/
Last time I thought about it I thought about changing the pass so it collects allocas, then sorts them by profitability (number of users + whether there are uses in loops), then just greedily promotes them one by one until it runs out of budget. Would that be good?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, that's what I was thinking for a first step

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, I thought SROA did try to flatten nested arrays already

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was thinking about splitting two-dimensional array into multiple one-dimensional because:

  • (this may be graphics specific) only single component / column of the array is used and by splitting we can removed unused columns / components.
  • some Shaders have a very low VGPR usage, but contain "big" (more than 32 elements) arrays like [ 10 x [ 4 x float ] ]. If we split it into four one-dimensional arrays than we will be able to put everything into VGPRs.
  • instruction movrel (which I'm aiming to generate) is supporting up to 32 elements.

@Pierre-vh Pierre-vh requested a review from arsenm February 5, 2024 13:30
@Pierre-vh Pierre-vh merged commit 4e958ab into llvm:main Feb 5, 2024
4 of 5 checks passed
@Pierre-vh Pierre-vh deleted the memset-ptr branch February 5, 2024 13:36
llvmbot pushed a commit to llvmbot/llvm-project that referenced this pull request Feb 5, 2024
llvmbot pushed a commit to llvmbot/llvm-project that referenced this pull request Feb 5, 2024
llvmbot pushed a commit to llvmbot/llvm-project that referenced this pull request Feb 5, 2024
agozillon pushed a commit to agozillon/llvm-project that referenced this pull request Feb 5, 2024
tstellar pushed a commit to tstellar/llvm-project that referenced this pull request Feb 14, 2024
tstellar pushed a commit to tstellar/llvm-project that referenced this pull request Feb 14, 2024
tstellar pushed a commit to tstellar/llvm-project that referenced this pull request Feb 14, 2024
tstellar pushed a commit to tstellar/llvm-project that referenced this pull request Feb 14, 2024
@pointhex pointhex mentioned this pull request May 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

AMDGPUPromoteAlloca triggered assert when attempting to cast the alloca user from <8 x i8> to ptr
4 participants