-
Notifications
You must be signed in to change notification settings - Fork 12.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AMDGPU][PromoteAlloca] Support memsets to ptr allocas #80678
Conversation
@llvm/pr-subscribers-backend-amdgpu Author: Pierre van Houtryve (Pierre-vh) ChangesFixes #80366 Full diff: https://github.com/llvm/llvm-project/pull/80678.diff 2 Files Affected:
diff --git a/llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp b/llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
index 5e73411cae9b70..c1b244f50d93f8 100644
--- a/llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
+++ b/llvm/lib/Target/AMDGPU/AMDGPUPromoteAlloca.cpp
@@ -521,10 +521,18 @@ static Value *promoteAllocaUserToVector(
// For memset, we don't need to know the previous value because we
// currently only allow memsets that cover the whole alloca.
Value *Elt = MSI->getOperand(1);
- if (DL.getTypeStoreSize(VecEltTy) > 1) {
- Value *EltBytes =
- Builder.CreateVectorSplat(DL.getTypeStoreSize(VecEltTy), Elt);
- Elt = Builder.CreateBitCast(EltBytes, VecEltTy);
+ const unsigned BytesPerElt = DL.getTypeStoreSize(VecEltTy);
+ if (BytesPerElt > 1) {
+ Value *EltBytes = Builder.CreateVectorSplat(BytesPerElt, Elt);
+
+ // If the element type of the vector is a pointer, we need to first cast
+ // to an integer, then use a PtrCast.
+ if (VecEltTy->isPointerTy()) {
+ Type *PtrInt = Builder.getIntNTy(BytesPerElt * 8);
+ Elt = Builder.CreateBitCast(EltBytes, PtrInt);
+ Elt = Builder.CreateIntToPtr(Elt, VecEltTy);
+ } else
+ Elt = Builder.CreateBitCast(EltBytes, VecEltTy);
}
return Builder.CreateVectorSplat(VectorTy->getElementCount(), Elt);
diff --git a/llvm/test/CodeGen/AMDGPU/promote-alloca-memset.ll b/llvm/test/CodeGen/AMDGPU/promote-alloca-memset.ll
index 15af1f17e230ec..829e7a1b84e90c 100644
--- a/llvm/test/CodeGen/AMDGPU/promote-alloca-memset.ll
+++ b/llvm/test/CodeGen/AMDGPU/promote-alloca-memset.ll
@@ -84,4 +84,16 @@ entry:
ret void
}
+define amdgpu_kernel void @memset_ptr_alloca(ptr %out) {
+; CHECK-LABEL: @memset_ptr_alloca(
+; CHECK-NEXT: store i64 0, ptr [[OUT:%.*]], align 8
+; CHECK-NEXT: ret void
+;
+ %alloca = alloca [6 x ptr], align 16, addrspace(5)
+ call void @llvm.memset.p5.i64(ptr addrspace(5) %alloca, i8 0, i64 48, i1 false)
+ %load = load i64, ptr addrspace(5) %alloca
+ store i64 %load, ptr %out
+ ret void
+}
+
declare void @llvm.memset.p5.i64(ptr addrspace(5) nocapture writeonly, i8, i64, i1 immarg)
|
|
||
// If the element type of the vector is a pointer, we need to first cast | ||
// to an integer, then use a PtrCast. | ||
if (VecEltTy->isPointerTy()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will fail the same way for vector of pointers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What do you mean vector of pointers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this only looks at the element type of the vector type we're promoting to, which is always a primitive
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I mean [6 x <2 x ptr>]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
added the test as well, that isn't promoted either.
On a side note, PromoteAlloca needs some love still. I'm wondering if it's worth the effort to support things like nested arrays. I'd assume we'd hit the upper limit very fast with X * Y
elements but maybe not?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's still not trying to intelligently choose which allocas are the most profitable to promote
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mariusz-sikora-at-amd I'm not sure, no strong opinion. I was thinking of doing it by flattening arrays (e.g. [2 x [3 x float]]) becomes [6 x float]. I think the tricky part is resolving the GEPs correctly, it might be a bigger refactoring than it looks like at first glance.
One alternative may be to have some kind of "alloca canonicalization" pass earlier that does the flattening for us to enable PromoteAlloca better.
@arsenm I haven't lost track of that but I also didn't find the time for it yet :/
Last time I thought about it I thought about changing the pass so it collects allocas, then sorts them by profitability (number of users + whether there are uses in loops), then just greedily promotes them one by one until it runs out of budget. Would that be good?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's what I was thinking for a first step
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also, I thought SROA did try to flatten nested arrays already
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was thinking about splitting two-dimensional array into multiple one-dimensional because:
- (this may be graphics specific) only single component / column of the array is used and by splitting we can removed unused columns / components.
- some Shaders have a very low VGPR usage, but contain "big" (more than 32 elements) arrays like [ 10 x [ 4 x float ] ]. If we split it into four one-dimensional arrays than we will be able to put everything into VGPRs.
- instruction movrel (which I'm aiming to generate) is supporting up to 32 elements.
Fixes llvm#80366 (cherry picked from commit 4e958ab)
Fixes llvm#80366 (cherry picked from commit 4e958ab)
Fixes llvm#80366 (cherry picked from commit 4e958ab)
Fixes llvm#80366 (cherry picked from commit 4e958ab)
Fixes llvm#80366 (cherry picked from commit 4e958ab)
Fixes llvm#80366 (cherry picked from commit 4e958ab)
Fixes llvm#80366 (cherry picked from commit 4e958ab)
Fixes #80366