feat: Cuda Plugin Improvements, master branch (2020.08.26.) #398

krasznaa · 2020-08-26T15:41:29Z

As we discussed in #371, there was very much room for improvement in my implementation of the seed finding algorithm.

Let me jump right to the conclusion. Compared with the results of the current/old code (see #371 for those numbers), I can run the updated code with the following results:

[bash][Elrond]:build-new > ./bin/ActsUnitTestSeedfinderCuda2 -f ../seeds/sp1.txt  
Read 85808 spacepoints from file: ../seeds/sp1.txt
Allocating 1583 MB memory on device:
 /-- Device ID 0 -------------------------------\
 | Name: GeForce GTX 960                        |
 | Max. threads per block: 1024                 |
 | Concurrent kernels: true                     |
 | Total memory: 1979.69 MB                     |
 \----------------------------------------------/
Done with the seedfinding on the host
Done with the seedfinding on the device

-------------------------- Results ---------------------------
|          |     Host     |    Device    | Speedup/agreement |
--------------------------------------------------------------
| Time [s] |    0.783000  |       0.239  |       3.276151    |
|   Seeds  |       19759  |       19759  |      99.994939    |
--------------------------------------------------------------

[bash][Elrond]:build-new > ./bin/ActsUnitTestSeedfinderCuda2 -f ../seeds/sp2.txt 
Read 158051 spacepoints from file: ../seeds/sp2.txt
Allocating 1583 MB memory on device:
 /-- Device ID 0 -------------------------------\
 | Name: GeForce GTX 960                        |
 | Max. threads per block: 1024                 |
 | Concurrent kernels: true                     |
 | Total memory: 1979.69 MB                     |
 \----------------------------------------------/
Done with the seedfinding on the host
Done with the seedfinding on the device

-------------------------- Results ---------------------------
|          |     Host     |    Device    | Speedup/agreement |
--------------------------------------------------------------
| Time [s] |    4.918000  |       1.601  |       3.071830    |
|   Seeds  |       80638  |       80637  |      99.869788    |
--------------------------------------------------------------

[bash][Elrond]:build-new > ./bin/ActsUnitTestSeedfinderCuda2 -f ../seeds/sp5.txt -n 100
Read 342211 spacepoints from file: ../seeds/sp5.txt
Allocating 1583 MB memory on device:
 /-- Device ID 0 -------------------------------\
 | Name: GeForce GTX 960                        |
 | Max. threads per block: 1024                 |
 | Concurrent kernels: true                     |
 | Total memory: 1979.69 MB                     |
 \----------------------------------------------/
Done with the seedfinding on the host
Done with the seedfinding on the device

-------------------------- Results ---------------------------
|          |     Host     |    Device    | Speedup/agreement |
--------------------------------------------------------------
| Time [s] |   81.715000  |      43.668  |       1.871279    |
|   Seeds  |      124614  |      124614  |      99.744812    |
--------------------------------------------------------------

[bash][Elrond]:build-new >

As you can see, things turned out exactly how we imagined them. While for "small numbers" of spacepoints this updated algorithm became much faster than the current one in the repository, for large numbers its performance is actually slightly worse than with the current algorithm.

Now... this PR is more meant as a reference for now, as I plan to discuss about some of my observations in either an Acts or an ATLAS meeting in the next weeks. But in summary, I did the following:

As discussed earlier, the triplet search and filtering is now done on as many middle spacepoints at once as many can fit into the memory of the NVidia device in use. (One can control the amount of used memory from the command line of the test executable.)
I updated the implementation of the Acts::Cuda::device_array<T> and Acts::Cuda::host_array<T> types not to use CUDA memory (de-)allocation calls directly anymore. I just found that memory deallocation in particular was giving a huge overhead in my updated code.
- I introduced a singleton class called Acts::Cuda::MemoryManager that allocates a big blob of memory in one go, and then hands out parts of that to Acts::Cuda::make_device_array<...>(...) calls. This "memory manager" is super trivial. It just assumes that for "one calculation" you need to keep adding memory, until you're done with the calculation. So it doesn't handle memory deallocations. It can just be told to start reusing its memory block from scratch.

But just to demonstrate the difference in the implementation a bit, this is what the current/old code does to find all the seeds in the first 2 spacepoint groups of my sp2.txt test file:

And this is how that looks like with the new implementation:

But more discussion on this should really go to an actual meeting... 😉

…points at once. This was done with a **lot** of different changes, which were developed in a separate branch. This is just a cleaned up version of all of those developments. The code now includes the ability to use CUDA streams, and now manages CUDA device memory using its own manager class (Acts::Cuda::MemoryManager).

Mainly to be able to specify which CUDA device to run on, and how much memory to use from that device.

krasznaa · 2020-08-26T15:42:53Z

Forgot to ping a few people...

@czangela, @cgleggett, @vpascuzz, @beomki-yeo, @XiaocongAi.

codecov · 2020-08-26T16:32:17Z

Codecov Report

Merging #398 into master will decrease coverage by 0.01%.
The diff coverage is n/a.

@@            Coverage Diff             @@
##           master     #398      +/-   ##
==========================================
- Coverage   49.18%   49.16%   -0.02%     
==========================================
  Files         329      329              
  Lines       16191    16179      -12     
  Branches     7488     7484       -4     
==========================================
- Hits         7964     7955       -9     
+ Misses       2951     2950       -1     
+ Partials     5276     5274       -2

Impacted Files	Coverage Δ
Core/include/Acts/EventData/MeasurementHelpers.hpp	`50.00% <0.00%> (-3.34%)`	⬇️
...nclude/Acts/TrackFinding/CKFSourceLinkSelector.hpp	`40.00% <0.00%> (-1.94%)`	⬇️
...ore/include/Acts/Geometry/GeometryHierarchyMap.hpp	`58.57% <0.00%> (-0.59%)`	⬇️
Core/include/Acts/TrackFitting/KalmanFitter.hpp	`37.67% <0.00%> (-0.44%)`	⬇️
...de/Acts/TrackFinding/CombinatorialKalmanFilter.hpp	`28.52% <0.00%> (-0.22%)`	⬇️
Core/src/Utilities/AnnealingUtility.cpp	`100.00% <0.00%> (ø)`
.../Acts/TrackFitting/detail/VoidKalmanComponents.hpp	`100.00% <0.00%> (ø)`
...re/include/Acts/Vertexing/GaussianTrackDensity.ipp	`66.66% <0.00%> (+0.41%)`	⬆️
Core/include/Acts/Utilities/AnnealingUtility.hpp	`100.00% <0.00%> (+20.00%)`	⬆️

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 442545e...4345ec3. Read the comment docs.

krasznaa · 2020-10-06T14:34:48Z

I was hoping, when I opened this PR at first, that I would have more time to spend on this. But I didn't. 😦 I tried to make the code even smarter a number of weeks ago, but that just didn't go anywhere...

So... Could we come back to merging this in? Unfortunately I'm not super familiar with the GitHub interface. Is my feature branch still compatible with the master branch, or do I need to resolve some conflicts by now?

@XiaocongAi, whenever you organise the next Acts parallelisation meeting, I would be happy to talk about this code a bit.

XiaocongAi

Hi @krasznaa , I finally got to take a look at this PR. The memory manager looks great to me! I think we should get it in.

Regarding the talk, sure, I will put you in the list then. Thank you!

XiaocongAi · 2020-10-08T21:00:01Z

Plugins/Cuda/include/Acts/Plugins/Cuda/Utilities/MemoryManager.hpp

+  };
+
+  /// Object holding information about memory allocations on all devices
+  std::vector<DeviceMemory> m_memory;


My understanding is that, the vector index is implicitly used as the device id, right? Is it possible to use a 'std::map' instead? But I guess if the device id is a small value, the current approach might be even better.

Indeed. Since CUDA assigns integer numbers starting from 0 to the GPUs available, this seemed the most appropriate. Note that the DeviceMemory object itself is tiny. (As long as no memory allocation is made on a given GPU.) So even if we have let's say 8 GPUs in a system and only use the 8th for Acts, the overhead of creating 7 dummy DeviceMemory objects would be tiny.

Since the memory manager will use this variable very often, I would really want to avoid using an associate container if at all possible. So I'd really like to just stay with the std::vector<...> implementation. 😉

paulgessinger · 2020-10-09T06:54:22Z

I hit the update button and that went through without conflicts. If the CI succeeds, I think we can merge.

paulgessinger · 2020-10-09T06:56:41Z

Plugins/Cuda/include/Acts/Plugins/Cuda/Seeding2/SeedFinder.ipp

@@ -30,10 +32,12 @@ template <typename external_spacepoint_t>
 SeedFinder<external_spacepoint_t>::SeedFinder(
    Acts::SeedfinderConfig<external_spacepoint_t> commonConfig,
    const SeedFilterConfig& seedFilterConfig,
-    const TripletFilterConfig& tripletFilterConfig)
+    const TripletFilterConfig& tripletFilterConfig, int device,
+    Acts::Logging::Level loggerLevel)


Can you consider accepting a logger instance here, and storing as a member variable? You can then default it to Acts::getDefaultLogger. This way, other logging backends (like then Athena logging for example) can potentially be passed in.

Sure thing. I'll do that later today.

If you look around, the pattern we usually use is accept an std::unique_ptr<Logger>, store as a member, and then provide a const Logger& logger() method that the macros call. But I'm sure you figured that out anyway 😉

Done. Please have a look. 😄

paulgessinger · 2020-10-09T07:01:32Z

Plugins/Cuda/src/Utilities/MemoryManager.cu

+MemoryManager& MemoryManager::instance() {
+  static MemoryManager mm;
+  return mm;
+}


I assume there's no point in having multiple memory managers?

The reason I chose to use a singleton design is to not have to pass a memory manager object explicitly to every single Acts::Cuda::device_array<...>(...) object that the code creates. And yes, the idea would be that we should centralise all memory allocations/de-allocations. So the singleton design would seem to fit the use case quite nicely.

But I definitely don't claim that this would be some great implementation for a CUDA memory manager. I would like to make this code a lot more advanced later on. Using code inspired by (stolen from... 😛) from Allen for instance. 😉

Ok. I don't think the singleton pattern is problematic here, for now at least.

…logger.

It was left out of the code by mistake so far...

…ject#398) * Made the CUDA seed finding process triplets for multiple middle spacepoints at once. This was done with a **lot** of different changes, which were developed in a separate branch. This is just a cleaned up version of all of those developments. The code now includes the ability to use CUDA streams, and now manages CUDA device memory using its own manager class (Acts::Cuda::MemoryManager). * Taught the unit test about some of the new plugin features. Mainly to be able to specify which CUDA device to run on, and how much memory to use from that device. * Updated Acts::Cuda::SeedFinder to allow the user to give it a custom logger. * Added an explicit specification for which CUDA device should be used. It was left out of the code by mistake so far... Co-authored-by: Andreas Salzburger <Andreas.Salzburger@cern.ch> Co-authored-by: Paul Gessinger <paul.gessinger@cern.ch> Co-authored-by: robertlangenberg <56069170+robertlangenberg@users.noreply.github.com>

krasznaa added 2 commits August 26, 2020 15:29

Taught the unit test about some of the new plugin features.

367748b

Mainly to be able to specify which CUDA device to run on, and how much memory to use from that device.

acts-issue-bot bot added the Triage label Aug 26, 2020

msmk0 added Component - Plugins Affects one or more Plugins Improvement Changes to an existing feature labels Aug 26, 2020

acts-issue-bot bot removed the Triage label Aug 26, 2020

msmk0 added this to the next milestone Aug 27, 2020

paulgessinger mentioned this pull request Aug 27, 2020

Conventional commits #402

Closed

Merge branch 'master' into cuda-improvements-20200826

bb8d842

asalzburger self-requested a review September 22, 2020 15:11

asalzburger changed the title ~~Cuda Plugin Improvements, master branch (2020.08.26.)~~ feat: Cuda Plugin Improvements, master branch (2020.08.26.) Sep 22, 2020

Merge branch 'master' into cuda-improvements-20200826

fc1c88f

XiaocongAi approved these changes Oct 8, 2020

View reviewed changes

Merge branch 'master' into cuda-improvements-20200826

48a4c25

paulgessinger reviewed Oct 9, 2020

View reviewed changes

krasznaa and others added 3 commits October 9, 2020 14:05

Updated Acts::Cuda::SeedFinder to allow the user to give it a custom …

b51e9dd

…logger.

Added an explicit specification for which CUDA device should be used.

6f08696

It was left out of the code by mistake so far...

Merge branch 'master' into cuda-improvements-20200826

4345ec3

robertlangenberg self-requested a review October 13, 2020 15:29

robertlangenberg approved these changes Oct 13, 2020

View reviewed changes

robertlangenberg merged commit 02fd513 into acts-project:master Oct 13, 2020

paulgessinger modified the milestones: next, v2.0.0 Nov 8, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: Cuda Plugin Improvements, master branch (2020.08.26.) #398

feat: Cuda Plugin Improvements, master branch (2020.08.26.) #398

krasznaa commented Aug 26, 2020

krasznaa commented Aug 26, 2020

codecov bot commented Aug 26, 2020 •

edited

Loading

krasznaa commented Oct 6, 2020

XiaocongAi left a comment

XiaocongAi Oct 8, 2020

krasznaa Oct 9, 2020

paulgessinger commented Oct 9, 2020

paulgessinger Oct 9, 2020

krasznaa Oct 9, 2020

paulgessinger Oct 9, 2020

krasznaa Oct 9, 2020

paulgessinger Oct 9, 2020

krasznaa Oct 9, 2020

paulgessinger Oct 9, 2020

feat: Cuda Plugin Improvements, master branch (2020.08.26.) #398

feat: Cuda Plugin Improvements, master branch (2020.08.26.) #398

Conversation

krasznaa commented Aug 26, 2020

krasznaa commented Aug 26, 2020

codecov bot commented Aug 26, 2020 • edited Loading

Codecov Report

krasznaa commented Oct 6, 2020

XiaocongAi left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

paulgessinger commented Oct 9, 2020

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

codecov bot commented Aug 26, 2020 •

edited

Loading