Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Cuda Plugin Improvements, master branch (2020.08.26.) #398

Merged

Conversation

krasznaa
Copy link
Member

As we discussed in #371, there was very much room for improvement in my implementation of the seed finding algorithm.

Let me jump right to the conclusion. Compared with the results of the current/old code (see #371 for those numbers), I can run the updated code with the following results:

[bash][Elrond]:build-new > ./bin/ActsUnitTestSeedfinderCuda2 -f ../seeds/sp1.txt  
Read 85808 spacepoints from file: ../seeds/sp1.txt
Allocating 1583 MB memory on device:
 /-- Device ID 0 -------------------------------\
 | Name: GeForce GTX 960                        |
 | Max. threads per block: 1024                 |
 | Concurrent kernels: true                     |
 | Total memory: 1979.69 MB                     |
 \----------------------------------------------/
Done with the seedfinding on the host
Done with the seedfinding on the device

-------------------------- Results ---------------------------
|          |     Host     |    Device    | Speedup/agreement |
--------------------------------------------------------------
| Time [s] |    0.783000  |       0.239  |       3.276151    |
|   Seeds  |       19759  |       19759  |      99.994939    |
--------------------------------------------------------------

[bash][Elrond]:build-new > ./bin/ActsUnitTestSeedfinderCuda2 -f ../seeds/sp2.txt 
Read 158051 spacepoints from file: ../seeds/sp2.txt
Allocating 1583 MB memory on device:
 /-- Device ID 0 -------------------------------\
 | Name: GeForce GTX 960                        |
 | Max. threads per block: 1024                 |
 | Concurrent kernels: true                     |
 | Total memory: 1979.69 MB                     |
 \----------------------------------------------/
Done with the seedfinding on the host
Done with the seedfinding on the device

-------------------------- Results ---------------------------
|          |     Host     |    Device    | Speedup/agreement |
--------------------------------------------------------------
| Time [s] |    4.918000  |       1.601  |       3.071830    |
|   Seeds  |       80638  |       80637  |      99.869788    |
--------------------------------------------------------------

[bash][Elrond]:build-new > ./bin/ActsUnitTestSeedfinderCuda2 -f ../seeds/sp5.txt -n 100
Read 342211 spacepoints from file: ../seeds/sp5.txt
Allocating 1583 MB memory on device:
 /-- Device ID 0 -------------------------------\
 | Name: GeForce GTX 960                        |
 | Max. threads per block: 1024                 |
 | Concurrent kernels: true                     |
 | Total memory: 1979.69 MB                     |
 \----------------------------------------------/
Done with the seedfinding on the host
Done with the seedfinding on the device

-------------------------- Results ---------------------------
|          |     Host     |    Device    | Speedup/agreement |
--------------------------------------------------------------
| Time [s] |   81.715000  |      43.668  |       1.871279    |
|   Seeds  |      124614  |      124614  |      99.744812    |
--------------------------------------------------------------

[bash][Elrond]:build-new >

As you can see, things turned out exactly how we imagined them. While for "small numbers" of spacepoints this updated algorithm became much faster than the current one in the repository, for large numbers its performance is actually slightly worse than with the current algorithm.

Now... this PR is more meant as a reference for now, as I plan to discuss about some of my observations in either an Acts or an ATLAS meeting in the next weeks. But in summary, I did the following:

  • As discussed earlier, the triplet search and filtering is now done on as many middle spacepoints at once as many can fit into the memory of the NVidia device in use. (One can control the amount of used memory from the command line of the test executable.)
  • I updated the implementation of the Acts::Cuda::device_array<T> and Acts::Cuda::host_array<T> types not to use CUDA memory (de-)allocation calls directly anymore. I just found that memory deallocation in particular was giving a huge overhead in my updated code.
    • I introduced a singleton class called Acts::Cuda::MemoryManager that allocates a big blob of memory in one go, and then hands out parts of that to Acts::Cuda::make_device_array<...>(...) calls. This "memory manager" is super trivial. It just assumes that for "one calculation" you need to keep adding memory, until you're done with the calculation. So it doesn't handle memory deallocations. It can just be told to start reusing its memory block from scratch.

But just to demonstrate the difference in the implementation a bit, this is what the current/old code does to find all the seeds in the first 2 spacepoint groups of my sp2.txt test file:

Screenshot 2020-08-26 17:37:39

And this is how that looks like with the new implementation:

Screenshot 2020-08-26 17:38:12

But more discussion on this should really go to an actual meeting... 😉

…points at once.

This was done with a **lot** of different changes, which were developed in
a separate branch. This is just a cleaned up version of all of those
developments.

The code now includes the ability to use CUDA streams, and now manages
CUDA device memory using its own manager class (Acts::Cuda::MemoryManager).
Mainly to be able to specify which CUDA device to run on, and
how much memory to use from that device.
@krasznaa
Copy link
Member Author

Forgot to ping a few people...

@czangela, @cgleggett, @vpascuzz, @beomki-yeo, @XiaocongAi.

@codecov
Copy link

codecov bot commented Aug 26, 2020

Codecov Report

Merging #398 into master will decrease coverage by 0.01%.
The diff coverage is n/a.

Impacted file tree graph

@@            Coverage Diff             @@
##           master     #398      +/-   ##
==========================================
- Coverage   49.18%   49.16%   -0.02%     
==========================================
  Files         329      329              
  Lines       16191    16179      -12     
  Branches     7488     7484       -4     
==========================================
- Hits         7964     7955       -9     
+ Misses       2951     2950       -1     
+ Partials     5276     5274       -2     
Impacted Files Coverage Δ
Core/include/Acts/EventData/MeasurementHelpers.hpp 50.00% <0.00%> (-3.34%) ⬇️
...nclude/Acts/TrackFinding/CKFSourceLinkSelector.hpp 40.00% <0.00%> (-1.94%) ⬇️
...ore/include/Acts/Geometry/GeometryHierarchyMap.hpp 58.57% <0.00%> (-0.59%) ⬇️
Core/include/Acts/TrackFitting/KalmanFitter.hpp 37.67% <0.00%> (-0.44%) ⬇️
...de/Acts/TrackFinding/CombinatorialKalmanFilter.hpp 28.52% <0.00%> (-0.22%) ⬇️
Core/src/Utilities/AnnealingUtility.cpp 100.00% <0.00%> (ø)
.../Acts/TrackFitting/detail/VoidKalmanComponents.hpp 100.00% <0.00%> (ø)
...re/include/Acts/Vertexing/GaussianTrackDensity.ipp 66.66% <0.00%> (+0.41%) ⬆️
Core/include/Acts/Utilities/AnnealingUtility.hpp 100.00% <0.00%> (+20.00%) ⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 442545e...4345ec3. Read the comment docs.

@msmk0 msmk0 added Component - Plugins Affects one or more Plugins Improvement Changes to an existing feature labels Aug 26, 2020
@acts-issue-bot acts-issue-bot bot removed the Triage label Aug 26, 2020
@msmk0 msmk0 added this to the next milestone Aug 27, 2020
@asalzburger asalzburger self-requested a review September 22, 2020 15:11
@asalzburger asalzburger changed the title Cuda Plugin Improvements, master branch (2020.08.26.) feat: Cuda Plugin Improvements, master branch (2020.08.26.) Sep 22, 2020
@krasznaa
Copy link
Member Author

krasznaa commented Oct 6, 2020

I was hoping, when I opened this PR at first, that I would have more time to spend on this. But I didn't. 😦 I tried to make the code even smarter a number of weeks ago, but that just didn't go anywhere...

So... Could we come back to merging this in? Unfortunately I'm not super familiar with the GitHub interface. Is my feature branch still compatible with the master branch, or do I need to resolve some conflicts by now?

@XiaocongAi, whenever you organise the next Acts parallelisation meeting, I would be happy to talk about this code a bit.

Copy link
Contributor

@XiaocongAi XiaocongAi left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @krasznaa , I finally got to take a look at this PR. The memory manager looks great to me! I think we should get it in.

Regarding the talk, sure, I will put you in the list then. Thank you!

};

/// Object holding information about memory allocations on all devices
std::vector<DeviceMemory> m_memory;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that, the vector index is implicitly used as the device id, right? Is it possible to use a 'std::map' instead? But I guess if the device id is a small value, the current approach might be even better.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Indeed. Since CUDA assigns integer numbers starting from 0 to the GPUs available, this seemed the most appropriate. Note that the DeviceMemory object itself is tiny. (As long as no memory allocation is made on a given GPU.) So even if we have let's say 8 GPUs in a system and only use the 8th for Acts, the overhead of creating 7 dummy DeviceMemory objects would be tiny.

Since the memory manager will use this variable very often, I would really want to avoid using an associate container if at all possible. So I'd really like to just stay with the std::vector<...> implementation. 😉

@paulgessinger
Copy link
Member

I hit the update button and that went through without conflicts. If the CI succeeds, I think we can merge.

@@ -30,10 +32,12 @@ template <typename external_spacepoint_t>
SeedFinder<external_spacepoint_t>::SeedFinder(
Acts::SeedfinderConfig<external_spacepoint_t> commonConfig,
const SeedFilterConfig& seedFilterConfig,
const TripletFilterConfig& tripletFilterConfig)
const TripletFilterConfig& tripletFilterConfig, int device,
Acts::Logging::Level loggerLevel)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you consider accepting a logger instance here, and storing as a member variable? You can then default it to Acts::getDefaultLogger. This way, other logging backends (like then Athena logging for example) can potentially be passed in.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sure thing. I'll do that later today.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If you look around, the pattern we usually use is accept an std::unique_ptr<Logger>, store as a member, and then provide a const Logger& logger() method that the macros call. But I'm sure you figured that out anyway 😉

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done. Please have a look. 😄

Comment on lines +33 to +36
MemoryManager& MemoryManager::instance() {
static MemoryManager mm;
return mm;
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume there's no point in having multiple memory managers?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The reason I chose to use a singleton design is to not have to pass a memory manager object explicitly to every single Acts::Cuda::device_array<...>(...) object that the code creates. And yes, the idea would be that we should centralise all memory allocations/de-allocations. So the singleton design would seem to fit the use case quite nicely.

But I definitely don't claim that this would be some great implementation for a CUDA memory manager. I would like to make this code a lot more advanced later on. Using code inspired by (stolen from... 😛) from Allen for instance. 😉

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok. I don't think the singleton pattern is problematic here, for now at least.

@robertlangenberg robertlangenberg self-requested a review October 13, 2020 15:29
@robertlangenberg robertlangenberg merged commit 02fd513 into acts-project:master Oct 13, 2020
osbornjd pushed a commit to osbornjd/acts that referenced this pull request Oct 13, 2020
…ject#398)

* Made the CUDA seed finding process triplets for multiple middle spacepoints at once.

This was done with a **lot** of different changes, which were developed in
a separate branch. This is just a cleaned up version of all of those
developments.

The code now includes the ability to use CUDA streams, and now manages
CUDA device memory using its own manager class (Acts::Cuda::MemoryManager).

* Taught the unit test about some of the new plugin features.

Mainly to be able to specify which CUDA device to run on, and
how much memory to use from that device.

* Updated Acts::Cuda::SeedFinder to allow the user to give it a custom logger.

* Added an explicit specification for which CUDA device should be used.

It was left out of the code by mistake so far...

Co-authored-by: Andreas Salzburger <Andreas.Salzburger@cern.ch>
Co-authored-by: Paul Gessinger <paul.gessinger@cern.ch>
Co-authored-by: robertlangenberg <56069170+robertlangenberg@users.noreply.github.com>
@paulgessinger paulgessinger modified the milestones: next, v2.0.0 Nov 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component - Plugins Affects one or more Plugins Improvement Changes to an existing feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants