-
Notifications
You must be signed in to change notification settings - Fork 3.5k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[Hexagon] Implement fixed_point_multiply op through intrinsics. #12659
[Hexagon] Implement fixed_point_multiply op through intrinsics. #12659
Conversation
This commit adds high-performance implementation of fixed_point_multiply operation based on Hexagon intrinsics for vmpye/vmpyo instructions. Benchmarking of 'fixed_point_multiply' op with (1,8,56,56,32) input tensor on Qualcomm SM8350: * default implementation: 10.06 ms * optimized implementation: 1.42 ms * speedup: 7x times (!!!) Please note that this is introducing a small round-up error for some corner cases with negative shift argument (The same as for ARM CPU, see PR#5980). This is because we are rounding twice instead than only once: * original q_multiply_shift: round(x*y*2^-s) * hexagon q_multiply_shift: round(round(x*y)*2^-s)
Excellent, thank you for the contribution @ibsidorenko ! |
) | ||
|
||
# Select depending on the shift | ||
return tvm.tir.Select(shift < 0, out_negative_shift, mul_o_2) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The mul_o_2
is just round(x*y)
. There is no shift in it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hm... Maybe I misunderstood the question, but I put shift separately before mul_o_2:
x = x * (1 << (shift))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry, I misread it as a part of the comment...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Thanks.
…he#12659) This commit adds high-performance implementation of fixed_point_multiply operation based on Hexagon intrinsics for vmpye/vmpyo instructions. Benchmarking of 'fixed_point_multiply' op with (1,8,56,56,32) input tensor on Qualcomm SM8350: * default implementation: 10.06 ms * optimized implementation: 1.42 ms * speedup: 7x times (!!!) Please note that this is introducing a small round-up error for some corner cases with negative shift argument (The same as for ARM CPU, see PR#5980). This is because we are rounding twice instead than only once: * original q_multiply_shift: round(x*y*2^-s) * hexagon q_multiply_shift: round(round(x*y)*2^-s)
This commit adds high-performance implementation of
fixed_point_multiply
operation based on Hexagon intrinsics for vmpye/vmpyo instructions.
Benchmarking of
fixed_point_multiply
op with (1,8,56,56,32) inputtensor on Qualcomm SM8350:
Please note that this is introducing a small round-up error for some
corner cases with negative shift argument (The same as for ARM CPU, see
PR#5980). This is because we are rounding twice instead than only once:
cc @mehrdadh