-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: Cuda Plugin Improvements, master branch (2020.08.26.) #398
feat: Cuda Plugin Improvements, master branch (2020.08.26.) #398
Conversation
…points at once. This was done with a **lot** of different changes, which were developed in a separate branch. This is just a cleaned up version of all of those developments. The code now includes the ability to use CUDA streams, and now manages CUDA device memory using its own manager class (Acts::Cuda::MemoryManager).
Mainly to be able to specify which CUDA device to run on, and how much memory to use from that device.
Forgot to ping a few people... @czangela, @cgleggett, @vpascuzz, @beomki-yeo, @XiaocongAi. |
Codecov Report
@@ Coverage Diff @@
## master #398 +/- ##
==========================================
- Coverage 49.18% 49.16% -0.02%
==========================================
Files 329 329
Lines 16191 16179 -12
Branches 7488 7484 -4
==========================================
- Hits 7964 7955 -9
+ Misses 2951 2950 -1
+ Partials 5276 5274 -2
Continue to review full report at Codecov.
|
I was hoping, when I opened this PR at first, that I would have more time to spend on this. But I didn't. 😦 I tried to make the code even smarter a number of weeks ago, but that just didn't go anywhere... So... Could we come back to merging this in? Unfortunately I'm not super familiar with the GitHub interface. Is my feature branch still compatible with the master branch, or do I need to resolve some conflicts by now? @XiaocongAi, whenever you organise the next Acts parallelisation meeting, I would be happy to talk about this code a bit. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @krasznaa , I finally got to take a look at this PR. The memory manager looks great to me! I think we should get it in.
Regarding the talk, sure, I will put you in the list then. Thank you!
}; | ||
|
||
/// Object holding information about memory allocations on all devices | ||
std::vector<DeviceMemory> m_memory; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My understanding is that, the vector index is implicitly used as the device id, right? Is it possible to use a 'std::map' instead? But I guess if the device id is a small value, the current approach might be even better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Indeed. Since CUDA assigns integer numbers starting from 0 to the GPUs available, this seemed the most appropriate. Note that the DeviceMemory
object itself is tiny. (As long as no memory allocation is made on a given GPU.) So even if we have let's say 8 GPUs in a system and only use the 8th for Acts, the overhead of creating 7 dummy DeviceMemory
objects would be tiny.
Since the memory manager will use this variable very often, I would really want to avoid using an associate container if at all possible. So I'd really like to just stay with the std::vector<...>
implementation. 😉
I hit the update button and that went through without conflicts. If the CI succeeds, I think we can merge. |
@@ -30,10 +32,12 @@ template <typename external_spacepoint_t> | |||
SeedFinder<external_spacepoint_t>::SeedFinder( | |||
Acts::SeedfinderConfig<external_spacepoint_t> commonConfig, | |||
const SeedFilterConfig& seedFilterConfig, | |||
const TripletFilterConfig& tripletFilterConfig) | |||
const TripletFilterConfig& tripletFilterConfig, int device, | |||
Acts::Logging::Level loggerLevel) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you consider accepting a logger instance here, and storing as a member variable? You can then default it to Acts::getDefaultLogger
. This way, other logging backends (like then Athena logging for example) can potentially be passed in.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sure thing. I'll do that later today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If you look around, the pattern we usually use is accept an std::unique_ptr<Logger>
, store as a member, and then provide a const Logger& logger()
method that the macros call. But I'm sure you figured that out anyway 😉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Done. Please have a look. 😄
MemoryManager& MemoryManager::instance() { | ||
static MemoryManager mm; | ||
return mm; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume there's no point in having multiple memory managers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason I chose to use a singleton design is to not have to pass a memory manager object explicitly to every single Acts::Cuda::device_array<...>(...)
object that the code creates. And yes, the idea would be that we should centralise all memory allocations/de-allocations. So the singleton design would seem to fit the use case quite nicely.
But I definitely don't claim that this would be some great implementation for a CUDA memory manager. I would like to make this code a lot more advanced later on. Using code inspired by (stolen from... 😛) from Allen for instance. 😉
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok. I don't think the singleton pattern is problematic here, for now at least.
It was left out of the code by mistake so far...
…ject#398) * Made the CUDA seed finding process triplets for multiple middle spacepoints at once. This was done with a **lot** of different changes, which were developed in a separate branch. This is just a cleaned up version of all of those developments. The code now includes the ability to use CUDA streams, and now manages CUDA device memory using its own manager class (Acts::Cuda::MemoryManager). * Taught the unit test about some of the new plugin features. Mainly to be able to specify which CUDA device to run on, and how much memory to use from that device. * Updated Acts::Cuda::SeedFinder to allow the user to give it a custom logger. * Added an explicit specification for which CUDA device should be used. It was left out of the code by mistake so far... Co-authored-by: Andreas Salzburger <Andreas.Salzburger@cern.ch> Co-authored-by: Paul Gessinger <paul.gessinger@cern.ch> Co-authored-by: robertlangenberg <56069170+robertlangenberg@users.noreply.github.com>
As we discussed in #371, there was very much room for improvement in my implementation of the seed finding algorithm.
Let me jump right to the conclusion. Compared with the results of the current/old code (see #371 for those numbers), I can run the updated code with the following results:
As you can see, things turned out exactly how we imagined them. While for "small numbers" of spacepoints this updated algorithm became much faster than the current one in the repository, for large numbers its performance is actually slightly worse than with the current algorithm.
Now... this PR is more meant as a reference for now, as I plan to discuss about some of my observations in either an Acts or an ATLAS meeting in the next weeks. But in summary, I did the following:
Acts::Cuda::device_array<T>
andActs::Cuda::host_array<T>
types not to use CUDA memory (de-)allocation calls directly anymore. I just found that memory deallocation in particular was giving a huge overhead in my updated code.Acts::Cuda::MemoryManager
that allocates a big blob of memory in one go, and then hands out parts of that toActs::Cuda::make_device_array<...>(...)
calls. This "memory manager" is super trivial. It just assumes that for "one calculation" you need to keep adding memory, until you're done with the calculation. So it doesn't handle memory deallocations. It can just be told to start reusing its memory block from scratch.But just to demonstrate the difference in the implementation a bit, this is what the current/old code does to find all the seeds in the first 2 spacepoint groups of my
sp2.txt
test file:And this is how that looks like with the new implementation:
But more discussion on this should really go to an actual meeting... 😉