Copy (Test) Improvements, main branch (2024.02.29.) #270

krasznaa · 2024-02-29T13:07:06Z

Managed to hit "Enter" with my pinkie a bit too early... 😦

This is a follow-up from #268.

First off, I made the "SoA copy tests" behave in the same way as the "general copy tests" do since Copy Test(s) Rewrite, main branch (2024.02.21.) #268. Which also allowed me to start testing the asynchronous copy types in the tests.
After finding a bunch of new synchronization issues, and still having problems with finding the last problem(s), I made "copy events" a little smarter.
- Introduced the vecmem::abstract_event::ignore() function, to allow clients to explicitly ignore a given synchronization event in their code if they want to. (In some rare cases this can be useful.)
- Made both ::sycl_event and ::cuda_event implicitly wait for their underlying events in their destructors, if the user didn't do so explicitly.
- Since the latter (not explicitly waiting on synchronization events) can often be a bug in one's code, leading to performance degradation, introduced the VECMEM_FAIL_ON_ASYNC_ERRORS CMake option for building the project in a way that it would call std::terimate() in case it detects such an error. After considering throwing an exception at first, I went with calling std::terminate() in the end, because throwing exceptions in destructors "cleanly" is just a lot of hassle. I still needed to do something "drastic", since the way that we build the project, with pretty aggressive symbol hiding, there is no easy way of debugging such issues in a client project easily without the code doing something "drastic". 🤔
With all this in place, fixed all of the issues with asynchronous copying in the project that I could find.
- It lead to vecmem::copy becoming yet a bit more complicated. In order to avoid "internal waits" in its functions as much as possible.

With all of this in place, all copy related issues are gone from my tests now... 🤞

krasznaa

Unfortunately the code created a "SEH exception" in the CUDA tests on Windows. I'll have to hunt that down before this would go in... 🤔

cuda/src/utils/cuda/async_copy.cpp

krasznaa · 2024-03-01T12:19:32Z

The "SEH exceptions" on Windows turned out to come from this:

https://forums.developer.nvidia.com/t/accessing-managed-memory-during-asynchronous-copies/

So I just disabled some tests on Windows...

All of the "copy tests" are now instantiated in the same way, inside of single compilation units.

Unfortunately some issues still remain, which I'm in the process of finding / fixing.

Made the sycl_event and cuda_event types implicitly wait for their underlying events in their destructors, if the user did not wait for them explicitly. This is to avoid 99% of the asynchronous errors that I encountered during debugging. At the same time also made it possible to explicitly ignore such events, for the rare case where it may be needed. Finally, introduced VECMEM_FAIL_ON_ASYNC_ERRORS for building the project in a mode where asynchronous errors (users not explicitly ignoring or waiting for an event) would cause the program to terminate.

…ements".

It seems that the implementation of CUDA managed memory is a bit more fragile on Windows than on Linux.

krasznaa · 2024-03-04T09:34:24Z

With the conclusions coming out of

https://forums.developer.nvidia.com/t/accessing-managed-memory-during-asynchronous-copies/

I considered for a bit if the vecmem::copy code could potentially detect when it can and can't do certain copy operations asynchronously. But in the end, that would be super difficult to do. (The "copy code" should not know what type of memory resources it is dealing with.) So let's just stay with disabling some of the tests on Windows, and then remembering this limitation in case some user stumbles upon it in the future.

krasznaa changed the title ~~Copy (Test) Improvements, main branch~~ Copy (Test) Improvements, main branch (2024.02.29.) Feb 29, 2024

krasznaa added bug Something isn't working tests This issue or pull request is related to the test suite improvement Improve an existing feature labels Feb 29, 2024

krasznaa commented Mar 1, 2024

View reviewed changes

cuda/src/utils/cuda/async_copy.cpp Show resolved Hide resolved

krasznaa mentioned this pull request Mar 1, 2024

Add CMake Folders, main branch (2024.03.01.) #271

Merged

krasznaa added 5 commits March 4, 2024 10:08

Harmonized / merged the copy tests.

9b44464

All of the "copy tests" are now instantiated in the same way, inside of single compilation units.

Improved the support for asynchronous SoA operations in vecmem::copy.

6a4731d

Unfortunately some issues still remain, which I'm in the process of finding / fixing.

Fixed remaining synhronization issues identified by the "event improv…

d8083af

…ements".

Disabling some test setups on Windows.

1007af4

It seems that the implementation of CUDA managed memory is a bit more fragile on Windows than on Linux.

krasznaa force-pushed the CopyTestSync-main-20240222 branch from 8c237f1 to 1007af4 Compare March 4, 2024 09:30

krasznaa merged commit 12c6a59 into acts-project:main Mar 4, 2024
22 checks passed

krasznaa deleted the CopyTestSync-main-20240222 branch March 4, 2024 09:53

krasznaa mentioned this pull request Jul 16, 2024

Mark copy events as nodiscard #286

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Copy (Test) Improvements, main branch (2024.02.29.) #270

Copy (Test) Improvements, main branch (2024.02.29.) #270

krasznaa commented Feb 29, 2024 •

edited

Loading

krasznaa left a comment

krasznaa commented Mar 1, 2024

krasznaa commented Mar 4, 2024

Copy (Test) Improvements, main branch (2024.02.29.) #270

Copy (Test) Improvements, main branch (2024.02.29.) #270

Conversation

krasznaa commented Feb 29, 2024 • edited Loading

krasznaa left a comment

Choose a reason for hiding this comment

krasznaa commented Mar 1, 2024

krasznaa commented Mar 4, 2024

krasznaa commented Feb 29, 2024 •

edited

Loading