-
Notifications
You must be signed in to change notification settings - Fork 5.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
【PaddlePaddle Hackathon 5 No.48】ContiguousKernel、StridedCopyKernel算子CPU、GPU性能优化 -part #57835
【PaddlePaddle Hackathon 5 No.48】ContiguousKernel、StridedCopyKernel算子CPU、GPU性能优化 -part #57835
Conversation
你的PR提交成功,感谢你对开源项目的贡献! |
Sorry to inform you that 5360925's CIs have passed for more than 7 days. To prevent PR conflicts, you need to re-run all CIs manually. |
python/paddle/signal.py
Outdated
@@ -631,6 +631,8 @@ def istft( | |||
'Abort istft because Nonzero Overlap Add (NOLA) condition failed. For more information about NOLA constraint please see `scipy.signal.check_NOLA`(https://docs.scipy.org/doc/scipy/reference/generated/scipy.signal.check_NOLA.html).' | |||
) | |||
|
|||
print(out) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
为什么需要改这个文件呢?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
大赞!有1个疑问和2个建议: |
老师您好~~
|
建议1,其实本质就是多制造一些shape的场景,因为目前一个提升20%一个100%,不能够很好的看出优化的效果。新的测试case尽量让numel大一些。 可以不用OPBenchmark,自己直接测试就行,可以参考如下demo:
|
@wanghuancoder 老师您好,两个gpu端优化的ci都已经过了,如果有时间的话,可以先审核一下代码嘛?测试数据和cpu端优化正在补,不久会传上来。 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
@wanghuancoder 基于time.time()的方式,实际测试发现,多次测试误差巨大。目前数据依据是依据OP Benchmark实现的。 |
case one的情况其实通过拆分成case two(c<=64&&abc<=1024&&rank>6)、case three(!(c<=64&&abc<=1024)&&rank<=6)、case four(其他情况)来继续优化,其中case two与case three的性能提升在case zero与case four之间。 |
可以的!这个PR我先合入了。 |
…PU、GPU性能优化 (PaddlePaddle#57835) * speed up ContiguousKernel * fix bugs * fix bugs * test origin code * fix bugs
…PU、GPU性能优化 (PaddlePaddle#57835) * speed up ContiguousKernel * fix bugs * fix bugs * test origin code * fix bugs
PR types
Performance optimization
PR changes
OPs
Description
目前ContiguousKernel、StridedCopyKernel两个 kernel 都是通过 numel index 计算数据偏移地址,需要一个 for 循环做计算,计算偏移地址效率低,导致 kernel 性能差。
经过优化,性能得到了较大的提升。
目前,原有的case one还可以继续拆分,来加速kernel和减少极端线程配置参数情况的出现。这部分工作不难,但是比较繁琐、工作量很大。黑客松结束会全部做掉。