Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Copy (Test) Improvements, main branch (2024.02.29.) #270

Merged
merged 5 commits into from
Mar 4, 2024

Conversation

krasznaa
Copy link
Member

@krasznaa krasznaa commented Feb 29, 2024

Managed to hit "Enter" with my pinkie a bit too early... 😦

This is a follow-up from #268.

  • First off, I made the "SoA copy tests" behave in the same way as the "general copy tests" do since Copy Test(s) Rewrite, main branch (2024.02.21.) #268. Which also allowed me to start testing the asynchronous copy types in the tests.
  • After finding a bunch of new synchronization issues, and still having problems with finding the last problem(s), I made "copy events" a little smarter.
    • Introduced the vecmem::abstract_event::ignore() function, to allow clients to explicitly ignore a given synchronization event in their code if they want to. (In some rare cases this can be useful.)
    • Made both ::sycl_event and ::cuda_event implicitly wait for their underlying events in their destructors, if the user didn't do so explicitly.
    • Since the latter (not explicitly waiting on synchronization events) can often be a bug in one's code, leading to performance degradation, introduced the VECMEM_FAIL_ON_ASYNC_ERRORS CMake option for building the project in a way that it would call std::terimate() in case it detects such an error. After considering throwing an exception at first, I went with calling std::terminate() in the end, because throwing exceptions in destructors "cleanly" is just a lot of hassle. I still needed to do something "drastic", since the way that we build the project, with pretty aggressive symbol hiding, there is no easy way of debugging such issues in a client project easily without the code doing something "drastic". 🤔
  • With all this in place, fixed all of the issues with asynchronous copying in the project that I could find.
    • It lead to vecmem::copy becoming yet a bit more complicated. In order to avoid "internal waits" in its functions as much as possible.

With all of this in place, all copy related issues are gone from my tests now... 🤞

@krasznaa krasznaa changed the title Copy (Test) Improvements, main branch Copy (Test) Improvements, main branch (2024.02.29.) Feb 29, 2024
@krasznaa krasznaa added bug Something isn't working tests This issue or pull request is related to the test suite improvement Improve an existing feature labels Feb 29, 2024
Copy link
Member Author

@krasznaa krasznaa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unfortunately the code created a "SEH exception" in the CUDA tests on Windows. I'll have to hunt that down before this would go in... 🤔

cuda/src/utils/cuda/async_copy.cpp Show resolved Hide resolved
@krasznaa
Copy link
Member Author

krasznaa commented Mar 1, 2024

The "SEH exceptions" on Windows turned out to come from this:

https://forums.developer.nvidia.com/t/accessing-managed-memory-during-asynchronous-copies/

So I just disabled some tests on Windows...

krasznaa added 5 commits March 4, 2024 10:08
All of the "copy tests" are now instantiated in the same way,
inside of single compilation units.
Unfortunately some issues still remain, which I'm in the process of
finding / fixing.
Made the sycl_event and cuda_event types implicitly wait for their
underlying events in their destructors, if the user did not wait
for them explicitly. This is to avoid 99% of the asynchronous
errors that I encountered during debugging.

At the same time also made it possible to explicitly ignore such
events, for the rare case where it may be needed.

Finally, introduced VECMEM_FAIL_ON_ASYNC_ERRORS for building the
project in a mode where asynchronous errors (users not explicitly
ignoring or waiting for an event) would cause the program to
terminate.
It seems that the implementation of CUDA managed memory is
a bit more fragile on Windows than on Linux.
@krasznaa krasznaa force-pushed the CopyTestSync-main-20240222 branch from 8c237f1 to 1007af4 Compare March 4, 2024 09:30
@krasznaa
Copy link
Member Author

krasznaa commented Mar 4, 2024

With the conclusions coming out of

https://forums.developer.nvidia.com/t/accessing-managed-memory-during-asynchronous-copies/

I considered for a bit if the vecmem::copy code could potentially detect when it can and can't do certain copy operations asynchronously. But in the end, that would be super difficult to do. (The "copy code" should not know what type of memory resources it is dealing with.) So let's just stay with disabling some of the tests on Windows, and then remembering this limitation in case some user stumbles upon it in the future.

@krasznaa krasznaa merged commit 12c6a59 into acts-project:main Mar 4, 2024
22 checks passed
@krasznaa krasznaa deleted the CopyTestSync-main-20240222 branch March 4, 2024 09:53
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working improvement Improve an existing feature tests This issue or pull request is related to the test suite
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant