Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYCL] Improve performance of generic shuffles #3815

Merged
merged 2 commits into from
Jul 21, 2021

Conversation

Pennycook
Copy link
Contributor

The previous upper bound of the loop in GenericCall considered only the offset,
allowing a memcpy for the final chunk to walk off the end of the byte array.

sycl::detail::memcpy is implemented as a loop, resulting in different
optimizations than std::memcpy.

These two changes improve the performance of generic shuffles by approximately
2x for common use-cases, such as a struct of 3 floats.

Signed-off-by: John Pennycook john.pennycook@intel.com

Previous upper bound considered only the offset, allowing a memcpy
for the final chunk to walk off the end of the byte array.

Signed-off-by: John Pennycook <john.pennycook@intel.com>
sycl::detail::memcpy is implemented as a loop, resulting in different
optimizations than std::memcpy.

Signed-off-by: John Pennycook <john.pennycook@intel.com>
@Pennycook Pennycook added enhancement New feature or request performance Performance related issues labels May 25, 2021
@Pennycook Pennycook marked this pull request as ready for review July 8, 2021 13:24
@Pennycook
Copy link
Contributor Author

Opening this for review now that #3879 has been merged.

@Pennycook
Copy link
Contributor Author

@AlexeySachkov ping. Also, is there anybody else you think should review this?

@bader bader merged commit fb08adf into intel:sycl Jul 21, 2021
@@ -153,9 +153,9 @@ EnableIfGenericBroadcast<T, IdT> GroupBroadcast(T x, IdT local_id) {
char *ResultBytes = reinterpret_cast<char *>(&Result);
auto BroadcastBytes = [=](size_t Offset, size_t Size) {
uint64_t BroadcastX, BroadcastResult;
detail::memcpy(&BroadcastX, XBytes + Offset, Size);
std::memcpy(&BroadcastX, XBytes + Offset, Size);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It look like this breaks the tests due to missing #include <cstring>. See pre-commit failures for #4153.
@Pennycook, could you fix this ASAP, please?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry about that. Opened #4157 to (hopefully) fix it.

bader pushed a commit that referenced this pull request Jul 21, 2021
The changes in #3815 introduced calls to std::memcpy without including
<cstring>, causing some tests to fail.

Signed-off-by: John Pennycook <john.pennycook@intel.com>
alexbatashev added a commit to alexbatashev/llvm that referenced this pull request Jul 22, 2021
…ackend_plugin

* upstream/sycl: (26 commits)
  [SPIR-V][NFC] Move non-upstreamed FuncParam decorations into internal:: (intel#4138)
  [SYCL] Move free function queries to experimental namespace (intel#4090)
  [SYCL][XPTI] Enable PI calls notifications with arguments (intel#4148)
  [SYCL] Revert queue::wait() to its old behaviour with Level Zero (intel#4153)
  [SYCL] Add missing <cstring> header to spirv.hpp (intel#4157)
  [SYCL] Adds info query for atomic_memory_order_capabilities on device and context (intel#4105)
  [SYCL] Improve performance of generic shuffles (intel#3815)
  [SYCL] Fix the error with namespaces caused during rebase of intel#4014 (intel#4151)
  [ESIMD] Fix 'ambiguous operator' error with length 1 simd operands (intel#4149)
  [libdevice][NFC] Fix libdevice dependencies list (intel#4130)
  [SPIR-V] Reland Encode debug info producer in SPIR-V (intel#4082)
  [SYCL][ROCm] Add ROCm support to get_device_count_by_type (intel#4113)
  [SYCL] Fix sRGB device info (intel#4145)
  [SYCL][ROCm] Fix kernel launch with multiple dimensions (intel#4063)
  [SYCL][ROCm] Fix compilation for AMD GPU with -fsycl-dead-args-optimization (intel#4126)
  [SYCL][Level Zero] Enable multi-CCS support. (intel#4038)
  [SYCL] Pass bound arch to unbundler (intel#4112)
  [ESIMD][doc] Added documentation for some ESIMD math APIs (intel#3995)
  [ESIMD] rename gather4/scatter4 to gather_rgba/scatter_rgba (intel#4120)
  [SYCL][NFC] Remove unused variable. (intel#4131)
  ...
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request performance Performance related issues
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants