-
Notifications
You must be signed in to change notification settings - Fork 741
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[SYCL] Improve performance of generic shuffles #3815
Conversation
Previous upper bound considered only the offset, allowing a memcpy for the final chunk to walk off the end of the byte array. Signed-off-by: John Pennycook <john.pennycook@intel.com>
sycl::detail::memcpy is implemented as a loop, resulting in different optimizations than std::memcpy. Signed-off-by: John Pennycook <john.pennycook@intel.com>
Opening this for review now that #3879 has been merged. |
@AlexeySachkov ping. Also, is there anybody else you think should review this? |
@@ -153,9 +153,9 @@ EnableIfGenericBroadcast<T, IdT> GroupBroadcast(T x, IdT local_id) { | |||
char *ResultBytes = reinterpret_cast<char *>(&Result); | |||
auto BroadcastBytes = [=](size_t Offset, size_t Size) { | |||
uint64_t BroadcastX, BroadcastResult; | |||
detail::memcpy(&BroadcastX, XBytes + Offset, Size); | |||
std::memcpy(&BroadcastX, XBytes + Offset, Size); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It look like this breaks the tests due to missing #include <cstring>
. See pre-commit failures for #4153.
@Pennycook, could you fix this ASAP, please?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry about that. Opened #4157 to (hopefully) fix it.
The changes in #3815 introduced calls to std::memcpy without including <cstring>, causing some tests to fail. Signed-off-by: John Pennycook <john.pennycook@intel.com>
…ackend_plugin * upstream/sycl: (26 commits) [SPIR-V][NFC] Move non-upstreamed FuncParam decorations into internal:: (intel#4138) [SYCL] Move free function queries to experimental namespace (intel#4090) [SYCL][XPTI] Enable PI calls notifications with arguments (intel#4148) [SYCL] Revert queue::wait() to its old behaviour with Level Zero (intel#4153) [SYCL] Add missing <cstring> header to spirv.hpp (intel#4157) [SYCL] Adds info query for atomic_memory_order_capabilities on device and context (intel#4105) [SYCL] Improve performance of generic shuffles (intel#3815) [SYCL] Fix the error with namespaces caused during rebase of intel#4014 (intel#4151) [ESIMD] Fix 'ambiguous operator' error with length 1 simd operands (intel#4149) [libdevice][NFC] Fix libdevice dependencies list (intel#4130) [SPIR-V] Reland Encode debug info producer in SPIR-V (intel#4082) [SYCL][ROCm] Add ROCm support to get_device_count_by_type (intel#4113) [SYCL] Fix sRGB device info (intel#4145) [SYCL][ROCm] Fix kernel launch with multiple dimensions (intel#4063) [SYCL][ROCm] Fix compilation for AMD GPU with -fsycl-dead-args-optimization (intel#4126) [SYCL][Level Zero] Enable multi-CCS support. (intel#4038) [SYCL] Pass bound arch to unbundler (intel#4112) [ESIMD][doc] Added documentation for some ESIMD math APIs (intel#3995) [ESIMD] rename gather4/scatter4 to gather_rgba/scatter_rgba (intel#4120) [SYCL][NFC] Remove unused variable. (intel#4131) ...
The previous upper bound of the loop in GenericCall considered only the offset,
allowing a memcpy for the final chunk to walk off the end of the byte array.
sycl::detail::memcpy is implemented as a loop, resulting in different
optimizations than std::memcpy.
These two changes improve the performance of generic shuffles by approximately
2x for common use-cases, such as a struct of 3 floats.
Signed-off-by: John Pennycook john.pennycook@intel.com