[SYCL] Improve performance of generic shuffles #3815

Pennycook · 2021-05-25T17:43:56Z

The previous upper bound of the loop in GenericCall considered only the offset,
allowing a memcpy for the final chunk to walk off the end of the byte array.

sycl::detail::memcpy is implemented as a loop, resulting in different
optimizations than std::memcpy.

These two changes improve the performance of generic shuffles by approximately
2x for common use-cases, such as a struct of 3 floats.

Signed-off-by: John Pennycook john.pennycook@intel.com

Previous upper bound considered only the offset, allowing a memcpy for the final chunk to walk off the end of the byte array. Signed-off-by: John Pennycook <john.pennycook@intel.com>

sycl::detail::memcpy is implemented as a loop, resulting in different optimizations than std::memcpy. Signed-off-by: John Pennycook <john.pennycook@intel.com>

Pennycook · 2021-07-08T13:25:05Z

Opening this for review now that #3879 has been merged.

Pennycook · 2021-07-13T21:15:05Z

@AlexeySachkov ping. Also, is there anybody else you think should review this?

bader · 2021-07-21T16:46:28Z

sycl/include/CL/sycl/detail/spirv.hpp

@@ -153,9 +153,9 @@ EnableIfGenericBroadcast<T, IdT> GroupBroadcast(T x, IdT local_id) {
  char *ResultBytes = reinterpret_cast<char *>(&Result);
  auto BroadcastBytes = [=](size_t Offset, size_t Size) {
    uint64_t BroadcastX, BroadcastResult;
-    detail::memcpy(&BroadcastX, XBytes + Offset, Size);
+    std::memcpy(&BroadcastX, XBytes + Offset, Size);


It look like this breaks the tests due to missing #include <cstring>. See pre-commit failures for #4153.
@Pennycook, could you fix this ASAP, please?

Sorry about that. Opened #4157 to (hopefully) fix it.

The changes in #3815 introduced calls to std::memcpy without including <cstring>, causing some tests to fail. Signed-off-by: John Pennycook <john.pennycook@intel.com>

…ackend_plugin * upstream/sycl: (26 commits) [SPIR-V][NFC] Move non-upstreamed FuncParam decorations into internal:: (intel#4138) [SYCL] Move free function queries to experimental namespace (intel#4090) [SYCL][XPTI] Enable PI calls notifications with arguments (intel#4148) [SYCL] Revert queue::wait() to its old behaviour with Level Zero (intel#4153) [SYCL] Add missing <cstring> header to spirv.hpp (intel#4157) [SYCL] Adds info query for atomic_memory_order_capabilities on device and context (intel#4105) [SYCL] Improve performance of generic shuffles (intel#3815) [SYCL] Fix the error with namespaces caused during rebase of intel#4014 (intel#4151) [ESIMD] Fix 'ambiguous operator' error with length 1 simd operands (intel#4149) [libdevice][NFC] Fix libdevice dependencies list (intel#4130) [SPIR-V] Reland Encode debug info producer in SPIR-V (intel#4082) [SYCL][ROCm] Add ROCm support to get_device_count_by_type (intel#4113) [SYCL] Fix sRGB device info (intel#4145) [SYCL][ROCm] Fix kernel launch with multiple dimensions (intel#4063) [SYCL][ROCm] Fix compilation for AMD GPU with -fsycl-dead-args-optimization (intel#4126) [SYCL][Level Zero] Enable multi-CCS support. (intel#4038) [SYCL] Pass bound arch to unbundler (intel#4112) [ESIMD][doc] Added documentation for some ESIMD math APIs (intel#3995) [ESIMD] rename gather4/scatter4 to gather_rgba/scatter_rgba (intel#4120) [SYCL][NFC] Remove unused variable. (intel#4131) ...

Pennycook added 2 commits May 25, 2021 13:23

[SYCL] Fix upper bound in GenericCall

ed02536

Previous upper bound considered only the offset, allowing a memcpy for the final chunk to walk off the end of the byte array. Signed-off-by: John Pennycook <john.pennycook@intel.com>

[SYCL] Replace detail::memcpy with std::memcpy

54eace0

sycl::detail::memcpy is implemented as a loop, resulting in different optimizations than std::memcpy. Signed-off-by: John Pennycook <john.pennycook@intel.com>

Pennycook added enhancement New feature or request performance Performance related issues labels May 25, 2021

Pennycook mentioned this pull request May 25, 2021

Support std::memcpy or improve detail::memcpy #3816

Closed

Pennycook marked this pull request as ready for review July 8, 2021 13:24

Pennycook requested a review from AlexeySachkov as a code owner July 8, 2021 13:24

AlexeySachkov approved these changes Jul 21, 2021

View reviewed changes

bader merged commit fb08adf into intel:sycl Jul 21, 2021

bader reviewed Jul 21, 2021

View reviewed changes

bader mentioned this pull request Jul 21, 2021

[SYCL] Revert queue::wait() to its old behaviour with Level Zero #4153

Merged

Pennycook mentioned this pull request Jul 21, 2021

[SYCL] Add missing <cstring> header to spirv.hpp #4157

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SYCL] Improve performance of generic shuffles #3815

[SYCL] Improve performance of generic shuffles #3815

Pennycook commented May 25, 2021

Pennycook commented Jul 8, 2021

Pennycook commented Jul 13, 2021

bader Jul 21, 2021

Pennycook Jul 21, 2021

[SYCL] Improve performance of generic shuffles #3815

[SYCL] Improve performance of generic shuffles #3815

Conversation

Pennycook commented May 25, 2021

Pennycook commented Jul 8, 2021

Pennycook commented Jul 13, 2021

bader Jul 21, 2021

Choose a reason for hiding this comment

Pennycook Jul 21, 2021

Choose a reason for hiding this comment