-
Notifications
You must be signed in to change notification settings - Fork 1.6k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[OpenCL]optimize conv3x3 when group==1 #5618
[OpenCL]optimize conv3x3 when group==1 #5618
Conversation
Thanks for your contribution! |
int in_w_id2 = in_w_id1 + item_w * stride; | ||
int in_w_id3 = in_w_id2 + item_w * stride; | ||
int in_w_id4 = in_w_id3 + item_w * stride; | ||
int in_h_id = mad24((item_h_id % out_h), stride, (-pad)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
直接写乘加实现,与显式使用mad24
,单纯修改这类有多少性能提升,测试过这个吗?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个我单测时测过,没有特别明显的变化,模型没有对应单独测,我再测一下。mad24手册上是建议对性考虑时优先使用
for (int w = 0; w < 3; w++) { | ||
int in_w_val0 = select(in_w_base_id + in_w_id0 + w, | ||
-1, | ||
(in_w_id0 + w < 0 || in_w_id0 + w >= in_w)); | ||
(in_w_id0 + w < 0 | in_w_id0 + w >= in_w)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
同上,按位与 操作比 或 操作,有多少性能提升,可以单独测下只修改此处的性能变化,如果有提升,select 都可以按此方式修改下。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
嗯嗯,我直接用模型再测一下,上次测得模型都是未tune的,这次测试把tune之后的性能变化也补上。本来是修改成int in_w_val0 = ((in_w_base_id + in_w_id0 + w + 1) & -(in_w_id0 + w >= 0 & in_w_id0 + w < in_w)) - 1这种的,发现如果不修改filter实现方式性能有提升,修改后加上这个修改性能反而下降。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
两处 comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
test
int in_w_id2 = in_w_id1 + item_w * stride; | ||
int in_w_id3 = in_w_id2 + item_w * stride; | ||
int in_w_id4 = in_w_id3 + item_w * stride; | ||
int in_h_id = mad24((item_h_id % out_h), stride, (-pad)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
这个我单测时测过,没有特别明显的变化,模型没有对应单独测,我再测一下。mad24手册上是建议对性考虑时优先使用
for (int w = 0; w < 3; w++) { | ||
int in_w_val0 = select(in_w_base_id + in_w_id0 + w, | ||
-1, | ||
(in_w_id0 + w < 0 || in_w_id0 + w >= in_w)); | ||
(in_w_id0 + w < 0 | in_w_id0 + w >= in_w)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
嗯嗯,我直接用模型再测一下,上次测得模型都是未tune的,这次测试把tune之后的性能变化也补上。本来是修改成int in_w_val0 = ((in_w_base_id + in_w_id0 + w + 1) & -(in_w_id0 + w >= 0 & in_w_id0 + w < in_w)) - 1这种的,发现如果不修改filter实现方式性能有提升,修改后加上这个修改性能反而下降。
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
该实现主要对filter进行了重排以及一些其他修改。优化前后效果对比如下图:
图中数据均为 armv7 编译产物测得。