Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a "second" CUDA based SeedFinder implementation, master branch (2020.08.11.) #371

Merged
merged 19 commits into from
Aug 20, 2020

Conversation

krasznaa
Copy link
Member

This is meant as a replacement for #353. As discussed in that PR, it turned out that the "current" CUDA SeedFinder implementation is really not behaving great for the test files @asalzburger gave me a while ago.

So I set out to write my own implementation for the seed finding with CUDA, to see what's what. This is the result. 😉 Note that I don't touch the existing CUDA implementation in the PR. Mainly to allow for comparisons between the two code, since (as you'll see further down) my implementation is definitely not perfect either.

To put this implementation in place, I first of all had to make some small changes in the core code.

  • I moved Acts::SeedFilterConfig into its own header, so that I could more easily use it independent of the Acts::SeedFilter class;
  • I added the Acts::BinnedSPGroupIterator::operator!=(...) operator to make it easier to iterate using such objects in a for(...) loop.

Since the current CUDA code is compiled in a non-optimal way "CMake configuration-wise", I decided to introduce a separate shared library (ActsCuda2Plugin) for the new source files. It would take most of #353 to be able to merge all code into the same shared library in a nice way. It could very well be possible in the future, but in this PR I wanted to modify the existing code as little as possible.

I also introduced a new test executable for the new plugin, under Tests/UnitTests/Plugins/Cuda/Seeding2. This was necessary as setting up this new seedfinder class works a bit differently than the previous one. (And the build setup for the test's source files is also quite different.)

I'll discuss the implementation details in further comments on the PR, here let me just give some of the results that it can produce. I get the following results with it on my private desktop that sports an Intel i7-6700 CPU and an NVidia GTX 960 GPU. (So an "okay" CPU, and a fairly old/low-end GPU.)

[bash][Elrond]:build-opt > ./bin/SeedFinderCuda2Test -f ../seeds/sp1.txt
Read 85808 spacepoints from file: ../seeds/sp1.txt
Done with the seedfinding on the host
Done with the seedfinding on the device

-------------------------- Results ---------------------------
|          |     Host     |    Device    | Speedup/agreement |
--------------------------------------------------------------
| Time [s] |    0.761000  |       1.093  |       0.696249    |
|   Seeds  |       19759  |       19759  |      99.994939    |
--------------------------------------------------------------

[bash][Elrond]:build-opt > ./bin/SeedFinderCuda2Test -f ../seeds/sp2.txt
Read 158051 spacepoints from file: ../seeds/sp2.txt
Done with the seedfinding on the host
Done with the seedfinding on the device

-------------------------- Results ---------------------------
|          |     Host     |    Device    | Speedup/agreement |
--------------------------------------------------------------
| Time [s] |    4.881000  |       3.683  |       1.325278    |
|   Seeds  |       80638  |       80637  |      99.646569    |
--------------------------------------------------------------

[bash][Elrond]:build-opt > ./bin/SeedFinderCuda2Test -f ../seeds/sp5.txt -n 100
Read 342211 spacepoints from file: ../seeds/sp5.txt
Done with the seedfinding on the host
Done with the seedfinding on the device

-------------------------- Results ---------------------------
|          |     Host     |    Device    | Speedup/agreement |
--------------------------------------------------------------
| Time [s] |   81.315000  |      40.664  |       1.999680    |
|   Seeds  |      124614  |      124614  |      98.992890    |
--------------------------------------------------------------

[bash][Elrond]:build-opt >

The last test I ran only on part of the spacepoints, so that it wouldn't take too long. And note that I did not run the test on the largest test file that I have available, as I do not fit into the 2 GB memory of my GTX card with that test file. 😦

As you can see, for a small number of spacepoints it behaves worse than the CPU implementation. But this is not the end of the world. During development I was able to write an algorithm that was faster than my CPU for small numbers of spacepoints. It just didn't allow me to run on larger numbers of spacepoints. So the idea would be to set up the code eventually to choose an algorithm based on the complexity of the event. But that's also something for a future development... 😉

About the non-perfect agreement: Some of it is due to the floating point precision differences between my CPU and GPU. But some of it seems to be due to some weirdness in the test files that I use. (They seem to hold multiple spacepoints with the exact same x-y-z coordinates. Making it impossible to reproduce the CPU results on the GPU.) Hopefully some new test files (already in the works...) will improve things around this later on.

Pinging @cgleggett, @vpascuzz, @beomki-yeo, @paulgessinger and @czangela.

Specifying explicitly which architectures to generate code for,
specifying that debug builds should attach debug symbols to both
the host and device CUDA code, and making all of these configurations
overridable from the command line.
In order not to interfere with the previously implemented
CUDA based seed finding, I've put my own implementation into
a separate plugin library.

The separate library is necessary because this library uses
a simpler build setup than ActsCudaPlugin.
This code is used to figure out the exact parallelisation for the
triplet finding code.
To make it easier to use it in the CUDA plugin code.
I forgot that the variable would be pre-defined as a cache variable
by CMake itself. So my setting was not doing anything.
As an alternative to Acts::Seedfinder<..., Acts::Cuda>. Unfortunately
the code still behaves a bit worryingly, but at least it more or less
works...
…ode.

The setup is a little clumsy, as function pointers are a bit
difficult with CUDA, but this seemed to be the most reasonable setup.
Based very much on the existing seedfinder tests. Only modifying
it to make it a bit sleeker in some places, and to accommodate the
setup requirements of Acts::Cuda::SeedFinder.
For some reason function pointers do not work correctly with
optimised CUDA builds. (I developed the code on a debug build...)
The results become incorrect without any error from CUDA, in such
optimised builds. :-/
I forgot about improving the code this way after introducing the
Acts::Cuda::Details::DubletCounts::maxTriplets variable.
@cgleggett
Copy link
Contributor

How are the spacepoint files you're using different from the ones we had before? Do you understand why performance is so much worse with them? Do we know if either of them is "realistic"?

@codecov
Copy link

codecov bot commented Aug 11, 2020

Codecov Report

Merging #371 into master will not change coverage.
The diff coverage is n/a.

Impacted file tree graph

@@           Coverage Diff           @@
##           master     #371   +/-   ##
=======================================
  Coverage   49.06%   49.06%           
=======================================
  Files         325      325           
  Lines       16036    16036           
  Branches     7395     7395           
=======================================
  Hits         7868     7868           
  Misses       2957     2957           
  Partials     5211     5211           
Impacted Files Coverage Δ
Core/include/Acts/Seeding/BinnedSPGroup.hpp 0.00% <ø> (ø)
Core/include/Acts/Seeding/SeedFilter.hpp 0.00% <ø> (ø)

Continue to review full report at Codecov.

Legend - Click here to learn more
Δ = absolute <relative> (impact), ø = not affected, ? = missing data
Powered by Codecov. Last update 536c651...4d7adfe. Read the comment docs.

@krasznaa
Copy link
Member Author

None of this is realistic in any way Charles. 😦

From what I can tell, Beomki was using files generated with the ATLAS geometry "back in the day". And adjusted the selection cuts accordingly.

I on the other hand received files generated with the TrackML geometry, while keeping all the test cuts exactly the same as Beomki had them. So the code is not doing anything nearly reasonale on my test files.

So I'm not trying to throw shade about the performance numbers that I saw with Beomki's code. If we use unrealistic selection cuts, it's fairly random whether the CPU or GPU implementation will start behaving better.

My big concern was that with these non-optimal cuts the current code (just so I wouldn't keep referring to it as "Beomki's code" all the time 😛) was not reproducing the CPU code's results all too well for high multiplicities. As I showed in:

It's the low matching rate that I started my experiments over. I very much believe that the current code makes some assumptions about the inputs that it receives, which may not hold for all inputs. (Again, the jobs that I'm running are not realistic. Still, the algorithm should either stop with an error message if one of its assumptions is not fulfilled, or provide the correct results. Even if only slowly.)

@cgleggett
Copy link
Contributor

thanks Attila - I'm not at all trying to question your motivations ;-) I'm just trying to understand the differences. Obviously it makes no sense to try to develop algorithms against unrealistic data sets and cuts. But if different experiments and geometries require very different algorithms to be performant, then that should be taken into consideration. It may not be possible to do this generically with a single implementation. It would be nice if we could classify the inputs / geometries so that we can decide, if it's like "this", do it on the CPU (ie, small number of spacepoints). If it's like "that", do it on the GPU with "this" implementation. If it like "this other thing", do it on the GPU with yet another implementation.

@krasznaa
Copy link
Member Author

Phew... I didn't actually test the code on our RTX 2060 card until now... But it does deliver. 😄

[bash][atspot01]:build > ./bin/SeedFinderCuda2Test -f ../seeds/sp1.txt 
Read 85808 spacepoints from file: ../seeds/sp1.txt
Done with the seedfinding on the host
Done with the seedfinding on the device

-------------------------- Results ---------------------------
|          |     Host     |    Device    | Speedup/agreement |
--------------------------------------------------------------
| Time [s] |    0.616000  |       0.744  |       0.827957    |
|   Seeds  |       19759  |       19759  |      99.994939    |
--------------------------------------------------------------

[bash][atspot01]:build > ./bin/SeedFinderCuda2Test -f ../seeds/sp2.txt 
Read 158051 spacepoints from file: ../seeds/sp2.txt
Done with the seedfinding on the host
Done with the seedfinding on the device

-------------------------- Results ---------------------------
|          |     Host     |    Device    | Speedup/agreement |
--------------------------------------------------------------
| Time [s] |    4.006000  |       2.018  |       1.985134    |
|   Seeds  |       80638  |       80637  |      99.671371    |
--------------------------------------------------------------

[bash][atspot01]:build > ./bin/SeedFinderCuda2Test -f ../seeds/sp5.txt -n 100
Read 342211 spacepoints from file: ../seeds/sp5.txt
Done with the seedfinding on the host
Done with the seedfinding on the device

-------------------------- Results ---------------------------
|          |     Host     |    Device    | Speedup/agreement |
--------------------------------------------------------------
| Time [s] |   62.044000  |      25.899  |       2.395614    |
|   Seeds  |      124614  |      124614  |      98.961593    |
--------------------------------------------------------------

[bash][atspot01]:build > ./bin/SeedFinderCuda2Test -f ../seeds/sp10.txt -n 50
Read 749018 spacepoints from file: ../seeds/sp10.txt
Done with the seedfinding on the host
Done with the seedfinding on the device

-------------------------- Results ---------------------------
|          |     Host     |    Device    | Speedup/agreement |
--------------------------------------------------------------
| Time [s] |   67.951000  |      17.531  |       3.876048    |
|   Seeds  |      128086  |      128086  |      99.094358    |
--------------------------------------------------------------

[bash][atspot01]:build >

As you can see, the implementation fits into the 6 GBs of memory of that card with my "sp10.txt" file, even if it doesn't fit into the 2 GBs of memory of my personal card... (In this way it behaves exactly like Beomki's code btw.)

@krasznaa
Copy link
Member Author

So... A little bit about the implementation...

The code, of course, starts with setting up simple blobs of memory with the properties of the spacepoints, so they could be moved to the GPU.

https://github.com/krasznaa/acts/blob/cuda2-plugin-20200810/Plugins/Cuda/include/Acts/Plugins/Cuda/Seeding2/SeedFinder.ipp#L64-L120

Note however that I don't just use generic float arrays for this, but rather created a custom struct (Acts::Cuda::Details::SpacePoint), that I make arrays out of.

https://github.com/krasznaa/acts/blob/cuda2-plugin-20200810/Plugins/Cuda/include/Acts/Plugins/Cuda/Seeding2/Details/Types.hpp#L18-L26

At this point I expect that a number of people will hiss up, that I'm using an array of structs, instead of a struct of arrays. But what I learned in the last weeks is that arrays of structs are not as evil with GPUs as I was lead to believe...

First off, memory-copy-wise it makes no difference that I lay out the memory like this. The memory for these objects is allocated in a typeless way, and then the memory contents are copied in a typeless way. So no difference there.

But where I do think it makes a difference, is how efficiently the GPU threads can load the properties of spacepoints while the kernel is running. In my kernels I decided to load all properties of the spacepoints in single operations into the threads. Like I do here for instance:

https://github.com/krasznaa/acts/blob/cuda2-plugin-20200810/Plugins/Cuda/src/Seeding2/FindTriplets.cu#L281

Instead of loading all required properties of a single spacepoint with N operations. Since the seed finding algorithm doesn't simply linearly loop over spacepoints, having the same kinds of coordinates of all the spacepoints tightly packed in memory just doesn't seem to make a difference.

And CUDA was able to produce much faster code when I implemented my kernels with these helper structs, than when I was accessing individual elements of large float arrays. (The way that I explained this to myself was that it can probably better hardcode information about the layout of my variables in memory than when I index large arrays using a bunch of helper variables.)

Then... After copying the spacepoints to the device, I run "dublet finding" on them using the Acts::Cuda::Kernels::findDublets kernel.

https://github.com/krasznaa/acts/blob/cuda2-plugin-20200810/Plugins/Cuda/src/Seeding2/FindDublets.cu

I'm actually quite happy with how that code turned out. 😄 I start the kernel twice (per Acts::Cuda::SeedFinder::createSeedsForGroup(...) call that is). Once evaluating all the middle-bottom spacepoint pairs at once, and once evaluating all the middle-top pairs. Using the 2-D iteration capabilities of CUDA itself. I use the same templated kernel in both cases, which I specialise with the type of "other" (bottom or top) spacepoint that it should be matching to the middle one. (This specialisation needs to trickle down all the way to the "deltaR" and "cotTheta" calculations.)

For the output of the dublet finding, I create 4 arrays:

  • middleBottomCounts: A 1-D array of size "nMiddleSPs" storing the number of compatible bottom spacepoints found for each middle spacepoint.
  • middleBottomDublets: A 2-D matrix of size "nMiddleSPs" x "nBottomSPs" that stores the indices of all compatible bottom spacepoints for every middle spacepoint.
  • middleTopCounts and middleTopDublets in the same way, for the middle-top spacepoint combinations.

Okay, let me stop here for now. 😄 I'll describe more about the triplet finding and filtering in subsequent posts.

@beomki-yeo
Copy link
Contributor

beomki-yeo commented Aug 12, 2020

My big concern was that with these non-optimal cuts the current code (just so I wouldn't keep referring to it as "Beomki's code" all the time ) was not reproducing the CPU code's results all too well for high multiplicities. As I showed in:

It's the low matching rate that I started my experiments over. I very much believe that the current code makes some assumptions about the inputs that it receives, which may not hold for all inputs. (Again, the jobs that I'm running are not realistic. Still, the algorithm should either stop with an error message if one of its assumptions is not fulfilled, or provide the correct results. Even if only slowly.)

For reproducibility, you have to give a cmake configuration flag of "-fmad=false". I clarified this in PR104
GPU adopted fast math mode where the operation of multiplication and addition is fused together while CPU didn't, and this is where their results diverge somehow. What "-fmad=false" does is turning off such fast math mode to make their results same. I have checked 100% reproducibility only upto 200k spacepoint data files, but I will be happy if you test it for larger datafiles.

For more information on the floating point operation of CPU and GPU, check here

@beomki-yeo
Copy link
Contributor

I'm just trying to understand the differences. Obviously it makes no sense to try to develop algorithms against unrealistic data sets and cuts.

I totally agree on this. What I would have done once I found that there is a difference in speedup between two different data sets is looking for source of difference. For example:

  1. At which function among doublet finding, transform coordinate, and triplet finding the computation gets slower?
  2. How many is the number of doublets/triplets produced per group?
  3. The difference feature of space points such as variance of z and R can make any difference?
  4. How different the spatial density of data points in different two data set, and can it make any difference in speedup?

Unless we don't understand this clearly we will encounter same problem with other data sets in the future.

@krasznaa
Copy link
Member Author

For reproducibility, you have to give a cmake configuration flag of "-fmad=false". I clarified this in PR104
GPU adopted fast math mode where the operation of multiplication and addition is fused together while CPU didn't, and this is where their results diverge somehow. What "-fmad=false" does is turning off such fast math mode to make their results same. I have checked 100% reproducibility only upto 200k spacepoint data files, but I will be happy if you test it for larger datafiles.

Hmm... 😕 I was absolutely convinced that I tested your code in debug mode as well. (-DCMAKE_BUILD_TYPE=Debug). But I may just have imagined it. 😕

Indeed, in a debug build (which also implies not using fast math) the differences in matching rate between the two algorithms are not that large. And for the sp1.txt file your algorithm also produces a 100% agreement.

[bash][Elrond]:build > ./bin/SeedfinderCudaTest -f ../seeds/sp1.txt -q
read 85808 SP from file ../seeds/sp1.txt

 GPU Device 0: "GeForce GTX 960" with compute capability 5.2

Preprocess Time: 0.316429
Analyzed 260 groups for CPU
Analyzed 260 groups for CUDA

----------------------- Time Metric -----------------------
                       CPU          CUDA        Speedup 
Seedfinding_Time     9.998187      3.24665     3.079540
Wall_time           10.314616      3.56308     2.894861
-----------------------------------------------------------

Number of Seeds (CPU | CUDA): 19759 | 19759
19759 seeds are matched
Matching rate: 100%


[bash][Elrond]:build > ./bin/SeedFinderCuda2Test -f ../seeds/sp1.txt 
profiling:<master>:Version mismatch - expected A84* got A75*
profiling:/data/projects/acts/build/Tests/UnitTests/Plugins/Cuda/Seeding2/CMakeFiles/SeedFinderCuda2Test.dir/main.cpp.gcda:Version mismatch - expected A75* got A84*
profiling:/data/projects/acts/build/Tests/UnitTests/Plugins/Cuda/Seeding2/CMakeFiles/SeedFinderCuda2Test.dir/CommandLineArguments.cpp.gcda:Version mismatch - expected A75* got A84*
profiling:/data/projects/acts/build/Tests/UnitTests/Plugins/Cuda/Seeding2/CMakeFiles/SeedFinderCuda2Test.dir/ReadSeedFile.cpp.gcda:Version mismatch - expected A75* got A84*
profiling:/data/projects/acts/build/Tests/UnitTests/Plugins/Cuda/Seeding2/CMakeFiles/SeedFinderCuda2Test.dir/TestSpacePoint.cpp.gcda:Version mismatch - expected A75* got A84*
profiling:/data/projects/acts/build/Tests/UnitTests/Plugins/Cuda/Seeding2/CMakeFiles/SeedFinderCuda2Test.dir/TestHostCuts.cpp.gcda:Version mismatch - expected A75* got A84*
Read 85808 spacepoints from file: ../seeds/sp1.txt
Done with the seedfinding on the host
Done with the seedfinding on the device

-------------------------- Results ---------------------------
|          |     Host     |    Device    | Speedup/agreement |
--------------------------------------------------------------
| Time [s] |    9.995000  |       4.778  |       2.091879    |
|   Seeds  |       19759  |       19759  |     100.000000    |
--------------------------------------------------------------

[bash][Elrond]:build > ./bin/SeedfinderCudaTest -f ../seeds/sp2.txt -q 
read 158051 SP from file ../seeds/sp2.txt

 GPU Device 0: "GeForce GTX 960" with compute capability 5.2

Preprocess Time: 0.538892
Analyzed 260 groups for CPU
Analyzed 260 groups for CUDA

----------------------- Time Metric -----------------------
                       CPU          CUDA        Speedup 
Seedfinding_Time    47.872138      68.0168     0.703828
Wall_time           48.411030      68.5557     0.706156
-----------------------------------------------------------

Number of Seeds (CPU | CUDA): 80638 | 80638
79593 seeds are matched
Matching rate: 98.7041%


[bash][Elrond]:build > ./bin/SeedFinderCuda2Test -f ../seeds/sp2.txt 
profiling:<master>:Version mismatch - expected A84* got A75*
profiling:/data/projects/acts/build/Tests/UnitTests/Plugins/Cuda/Seeding2/CMakeFiles/SeedFinderCuda2Test.dir/main.cpp.gcda:Version mismatch - expected A75* got A84*
profiling:/data/projects/acts/build/Tests/UnitTests/Plugins/Cuda/Seeding2/CMakeFiles/SeedFinderCuda2Test.dir/CommandLineArguments.cpp.gcda:Version mismatch - expected A75* got A84*
profiling:/data/projects/acts/build/Tests/UnitTests/Plugins/Cuda/Seeding2/CMakeFiles/SeedFinderCuda2Test.dir/ReadSeedFile.cpp.gcda:Version mismatch - expected A75* got A84*
profiling:/data/projects/acts/build/Tests/UnitTests/Plugins/Cuda/Seeding2/CMakeFiles/SeedFinderCuda2Test.dir/TestSpacePoint.cpp.gcda:Version mismatch - expected A75* got A84*
profiling:/data/projects/acts/build/Tests/UnitTests/Plugins/Cuda/Seeding2/CMakeFiles/SeedFinderCuda2Test.dir/TestHostCuts.cpp.gcda:Version mismatch - expected A75* got A84*
Read 158051 spacepoints from file: ../seeds/sp2.txt
Done with the seedfinding on the host
Done with the seedfinding on the device

-------------------------- Results ---------------------------
|          |     Host     |    Device    | Speedup/agreement |
--------------------------------------------------------------
| Time [s] |   47.802000  |      27.431  |       1.742627    |
|   Seeds  |       80638  |       80638  |      99.673851    |
--------------------------------------------------------------

[bash][Elrond]:build >

(I also just wanted to show how interestingly the speed of the two algorithms differ in a debug build... 😄)

We should have a discussion at one point. Let's try to set it up through Charles, to discuss a bit about the implementation details of the algorithms. (I'll write some more "posts" about my "new" algorithm's inner workings today.)

Forgot to remove those assert statements when I removed the usage of
the user defined filter functions. (As these do not work correctly
in optimised builds for some reason. I'm following up about that
at the moment...)
@beomki-yeo
Copy link
Contributor

/bin/SeedfinderCudaTest -f ../seeds/sp2.txt -q

Try with "-l 2" or "-l 3" or "-l 4" option where "-l 1" is the default. This flag controls the data space in triplet finding

@krasznaa
Copy link
Member Author

/bin/SeedfinderCudaTest -f ../seeds/sp2.txt -q

Try with "-l 2" or "-l 3" or "-l 4" option where "-l 1" is the default. This flag controls the data space in triplet finding

It was a worthwhile check, but it only seems to make a very small difference in this particular case.

[bash][Elrond]:build > ./bin/SeedfinderCudaTest -f ../seeds/sp2.txt -q -l 4
read 158051 SP from file ../seeds/sp2.txt

 GPU Device 0: "GeForce GTX 960" with compute capability 5.2

Preprocess Time: 0.528106
Analyzed 260 groups for CPU
Analyzed 260 groups for CUDA

----------------------- Time Metric -----------------------
                       CPU          CUDA        Speedup 
Seedfinding_Time    48.098040      68.5208     0.701948
Wall_time           48.626146      69.0489     0.704227
-----------------------------------------------------------

Number of Seeds (CPU | CUDA): 80638 | 80638
79668 seeds are matched
Matching rate: 98.7971%


[bash][Elrond]:build > ./bin/SeedfinderCudaTest -f ../seeds/sp2.txt -q -l 3
read 158051 SP from file ../seeds/sp2.txt

 GPU Device 0: "GeForce GTX 960" with compute capability 5.2

Preprocess Time: 0.541842
Analyzed 260 groups for CPU
Analyzed 260 groups for CUDA

----------------------- Time Metric -----------------------
                       CPU          CUDA        Speedup 
Seedfinding_Time    47.956361      68.2705     0.702446
Wall_time           48.498202      68.8123     0.704789
-----------------------------------------------------------

Number of Seeds (CPU | CUDA): 80638 | 80638
79635 seeds are matched
Matching rate: 98.7562%


[bash][Elrond]:build > ./bin/SeedfinderCudaTest -f ../seeds/sp2.txt -q -l 2
read 158051 SP from file ../seeds/sp2.txt

 GPU Device 0: "GeForce GTX 960" with compute capability 5.2

Preprocess Time: 0.526624
Analyzed 260 groups for CPU
Analyzed 260 groups for CUDA

----------------------- Time Metric -----------------------
                       CPU          CUDA        Speedup 
Seedfinding_Time    47.884428      68.0517     0.703647
Wall_time           48.411052      68.5784     0.705923
-----------------------------------------------------------

Number of Seeds (CPU | CUDA): 80638 | 80638
79597 seeds are matched
Matching rate: 98.709%


[bash][Elrond]:build >

@beomki-yeo
Copy link
Contributor

Thanks for check. Since the matching rate does not converge, It seems that the expected number of triplets per bottom space-point is much larger than the data files that I used.

It seems that some input files are not perfect, they contain multiple
spacepoints with the exact same spacial coordinates. This makes the
seed finging's reproducibility on a GPU pretty much impossible.

At the same time added a print operator for TestSpacePoint, as I was
using it for debugging while writing the duplet searching code.
@krasznaa
Copy link
Member Author

So... As I mentioned in my opening remarks as well, unfortunately the test files I use right now are very far from perfect. I now added the ability to my test executable to look for, and remove spacepoints that have the exact same x-y-z coordinates. And there are a good number of them... 😦

[bash][Elrond]:build-opt > ./bin/SeedFinderCuda2Test -f ../seeds/sp1.txt -d
2 duplicates found in: ../seeds/sp1.txt
Read 85808 spacepoints from file: ../seeds/sp1.txt
Done with the seedfinding on the host
Done with the seedfinding on the device

-------------------------- Results ---------------------------
|          |     Host     |    Device    | Speedup/agreement |
--------------------------------------------------------------
| Time [s] |    0.761000  |       1.103  |       0.689937    |
|   Seeds  |       19759  |       19759  |      99.994939    |
--------------------------------------------------------------

[bash][Elrond]:build-opt > ./bin/SeedFinderCuda2Test -f ../seeds/sp2.txt -d
1298 duplicates found in: ../seeds/sp2.txt
Read 158051 spacepoints from file: ../seeds/sp2.txt
Done with the seedfinding on the host
Done with the seedfinding on the device

-------------------------- Results ---------------------------
|          |     Host     |    Device    | Speedup/agreement |
--------------------------------------------------------------
| Time [s] |    4.898000  |       3.673  |       1.333515    |
|   Seeds  |       80638  |       80637  |      99.611846    |
--------------------------------------------------------------

[bash][Elrond]:build-opt > ./bin/SeedFinderCuda2Test -f ../seeds/sp5.txt -d
25228 duplicates found in: ../seeds/sp5.txt
Read 342211 spacepoints from file: ../seeds/sp5.txt
Done with the seedfinding on the host
Done with the seedfinding on the device

-------------------------- Results ---------------------------
|          |     Host     |    Device    | Speedup/agreement |
--------------------------------------------------------------
| Time [s] |  483.843000  |     238.468  |       2.028964    |
|   Seeds  |      358440  |      358440  |      98.595023    |
--------------------------------------------------------------

[bash][Elrond]:build-opt >

These results are from an optimised build, so 100% agreement is not expected. It just takes super long to count the duplicates on the host in debug mode, so I didn't want to run like that too much. Plus the one test that I ran with the duplicates removed, didn't actually make the agreement of my algorithm 100% unfortunately.

Still, I would argue that not getting 100% agreement with these files is not the end of the world. It's only once the agreement is very small that things become suspicious.

Hopefully I'll get some better files within the next week, then we could make some more meaningful comparisons between the algorithms. 😉

@paulgessinger paulgessinger added this to the Backlog milestone Aug 12, 2020
@krasznaa
Copy link
Member Author

Let me continue my description from #371 (comment). (Not to point fingers, but being able to discuss in "threads" on GitLab does have it uses... 😛)

I described how the code would form dublet pairs, and what sort of data structures it would store the results of the dublet search in. In the next step the code needs to start executing code on these dublets.

But to do that, it has to know how many dublets there are. Remember, the GPU at this point just filled 2 nMiddleSPs sized arrays with the number of middle-bottom and middle-top dublets that survive the initial selections. One possibility would be to just move those arrays back to the host and figure out from them what to do next, but I instead created an additional kernel, before the triplet search, which just summarises some information from those two arrays. That kernel creates an object of type Acts::Cuda::Details::DubletCounts (https://github.com/krasznaa/acts/blob/cuda2-plugin-20200810/Plugins/Cuda/include/Acts/Plugins/Cuda/Seeding2/Details/Types.hpp#L28-L40) using the following code:

https://github.com/krasznaa/acts/blob/cuda2-plugin-20200810/Plugins/Cuda/src/Seeding2/CountDublets.cu

What I need to do is a textbook "reduction" operation. I need to sum up some values, and find maxima for some others. So I just looked at the CUDA samples to figure out how to do this. 😛

https://docs.nvidia.com/cuda/cuda-samples/index.html#cuda-parallel-reduction

I went with the implementation requiring the fewest features from the GPU, as we really don't need this code to be super optimised. (In the end it accounts for O(0.1%) of the total GPU usage in my test jobs.)

Now, knowing the total number of dublets that were found, the total number of triplets that may be reconstructed out of them, and a couple of more parameters (have a look at the declaration of Acts::Cuda::Details::DubletCounts...), the code can start looking for tripets.

But before doing that, just like Beomki's code, my code also creates Acts::Cuda::Details::LinCircle objects (https://github.com/krasznaa/acts/blob/cuda2-plugin-20200810/Plugins/Cuda/include/Acts/Plugins/Cuda/Seeding2/Details/Types.hpp#L42-L50) for every middle-bottom and middle-top dublet that was found earlier. The implementation here is a bit tricky to be able to loop over every middle-bottom and middle-top index pair efficiently in a single kernel launch, but it works...

https://github.com/krasznaa/acts/blob/cuda2-plugin-20200810/Plugins/Cuda/src/Seeding2/FindTriplets.cu#L31-L167

This is where my implementation is becoming non-optimal. 😦 In the next step, in order to be able to run the "Sp2" filtering on the created triplets efficiently later on, I need to create triplet candidates separately for every middle-bottom spacepoint combination. In total this would mean creating an integer (index) array of size nMiddleSPs x maxMBDublets x maxMTDublets. (The latter parameters are coming from Acts::Cuda::Details::DubletCounts.) Which, with the non-optimal cuts that I have in my code at the moment, can account for up to >1 GB of memory by itself. And at this point these are just integer indices. I also need to store the information about the "actual triplet candidates" somewhere. (I decided from the start that I would do that in a 1-dimensional array of size maxTriplets, and in any 2/3-D matrix I would just store indices into that array.)

Long story short, I instead loop over the middle spacepoints on the host, in a for loop. 😦 This way I only need maxMBDublets x maxMTDublets amount of memory at any given time.

https://github.com/krasznaa/acts/blob/cuda2-plugin-20200810/Plugins/Cuda/src/Seeding2/FindTriplets.cu#L624-L636

Plus the 1-D arrays for the triplet objects.

https://github.com/krasznaa/acts/blob/cuda2-plugin-20200810/Plugins/Cuda/src/Seeding2/FindTriplets.cu#L617-L622

Since this "post" is becoming rather long, let me continue in a separate one...

@krasznaa
Copy link
Member Author

Continuing from #371 (comment)...

For every middle spacepoint the code ends up first launching the Acts::Cuda::Kernels::findTriplets kernel.

https://github.com/krasznaa/acts/blob/cuda2-plugin-20200810/Plugins/Cuda/src/Seeding2/FindTriplets.cu#L664-L693

It does so in a 2-D loop, over nMiddleBottomDublets x nMiddleTopDublets elements. (Not the maxima, but the actual numbers for that particular middle spacepoint.) Remember, if I would have enough memory, I would rather launch a single kernel over a 3-D nMiddleSPs x maxMBDublets x maxMTDublets grid. But c'est la vie...

Similar to the dublet finding, this kernel also outputs its results in 2 variables.

  • tripletsPerBottomDublet: A 1-D array of size nMiddleBottomDublets, storing the number of triplets that were found for that middle-bottom pair.
  • tripletIndices: A 2-D matrix of size nMiddleBottomDublets x nMiddleTopDublets, storing the indices of all the triplet candidates that were found for a particular middle-bottom pair.

(It also stores the Acts::Cuda::Details::Triplet objects in a single array and sets up some counters in global memory through atomic operations, but that's a bit too much detail for this description.)

And next comes the final step that my implementation still executes on the GPU. The filtering of the triplets for every fixed middle-bottom spacepoint combination. This is done using a 2-D iteration over an nMiddleBottomDublets x maxTripletsPerSpB grid.

https://github.com/krasznaa/acts/blob/cuda2-plugin-20200810/Plugins/Cuda/src/Seeding2/FindTriplets.cu#L707-L735

The latter is a counter that the findTriplets kernel set up in global memory while reconstructing all the triplets for that one middle spacepoint, and holds the maximal number of triplets found for any middle-bottom spacepoint pair. (This way, as always, some of the GPU threads will have nothing to do, but setting up iterations like this is just much-much easier than trying to avoid non-valid elements of a 2-D grid.)

The filter kernel (Acts::Cuda::Kernels::filterTriplets2Sp) is the single biggest contributor to the execution time on the GPU. This is because it needs to do a loop inside the kernel for every other triplet that was found in the same middle-bottom spacepoint that it is running on. Note however that it's only doing a single loop!

https://github.com/krasznaa/acts/blob/cuda2-plugin-20200810/Plugins/Cuda/src/Seeding2/FindTriplets.cu#L489-L542

In Beomki's code, as far as I understood, this was implemented using a "triple" loop.

https://github.com/acts-project/acts/blob/master/Plugins/Cuda/src/Seeding/Kernels.cu#L721-L778

I instead left all but one loop up to CUDA to parallelise for me.

This filter outputs all the triplets surviving the selections into a single Acts::Cuda::Details::Triplet array. That array is then copied back to the host, to await the "1Sp" filtering once the triplets from every middle spacepoint have been created.

https://github.com/krasznaa/acts/blob/cuda2-plugin-20200810/Plugins/Cuda/include/Acts/Plugins/Cuda/Seeding2/SeedFinder.ipp#L173-L193

I just have one more post to write, then I'll leave you guys alone for now... 😛

@krasznaa
Copy link
Member Author

Continuing from #371 (comment)...

The final filtering is home for another imperfection in my implementation. 😦 In the CPU implementation the triplet filtering is done using user-defined functions. With the user providing a class that implements the Acts::IExperimentCuts interface. Like this class does:

For the GPU code however we would need to let the user give us device code that would perform similar filtering.

And for a brief on Monday-Tuesday I thought I figured out how to do this. Introducing the Acts::Cuda::TripletFilterConfig type, that we could use to point at the functions implemented by the user for triplet filtering.

https://github.com/krasznaa/acts/blob/cuda2-plugin-20200810/Plugins/Cuda/include/Acts/Plugins/Cuda/Seeding2/TripletFilterConfig.hpp

The user would give such an object to Acts::Cuda::SeedFinder in its constructor, and the Acts::Cuda::Kernels::filterTriplets2Sp kernel would use those functions to perform the triplet filtering.

https://github.com/krasznaa/acts/blob/cuda2-plugin-20200810/Plugins/Cuda/src/Seeding2/FindTriplets.cu#L718-L719

The user would have to implement fairly complicated code for setting up these function pointers correctly, but it would at least be technically possible.

https://github.com/krasznaa/acts/blob/cuda2-plugin-20200810/Tests/UnitTests/Plugins/Cuda/Seeding2/TestDeviceCuts.cu

And as I wrote earlier, for a while I thought it would work. In my debug build test runs this worked beautifully. But then I switched to running tests in an optimised build, and everything broke. 😦 Without any error message from CUDA, the results from my test jobs became very different from the CPU code's results. 😕

I'll be following up about that with the (one) NVidia developer that I know. But for now I had to give up on using a user-defined function in the GPU code. And just hardcoded the settings used by Beomki's code as well, into the "central" CUDA code.

https://github.com/krasznaa/acts/blob/cuda2-plugin-20200810/Plugins/Cuda/src/Seeding2/FindTriplets.cu#L372-L399

In conclusion, the current code "sort of" works. But as we discussed on Monday, the final filtering step of the CPU code is very hard to do on a GPU. As long as I only have enough memory for a single middle spacepoint's data, it's a filtering that can only happen on a single thread. So in the current implementation it is done on the CPU. But for larger spacepoint multiplicities this makes it very clear that my GPU is not nearly 100% utilised. As doing the filtering on the CPU takes a noticeable time between kernel launches. 😦

So there's definitely a lot to improve. But I do think that there are enough interesting ideas in this code to be added to the plugin like this for now.

This is necessary to make it possible to give used-defined filter
functions to the triplet finding code. (It's a long-standing
issue in CUDA that device code split across shared libraries does
not work well...)
@krasznaa
Copy link
Member Author

😕 I'm becoming senile...

The issue with not being able to plug in "user functions" into the triplet filtering code is something that I've seen since 2 years... See:

https://github.com/krasznaa/CUDALinkTest

It's the same problem here. If the kernel is located in a shared library, and the user's filter function in a separate executable (or shared library), then they can just not "find each other correctly" at runtime.

So I now turned the plugin library into a static one. Unfortunately I don't see a way around this for now. It's not great to implement it like this, but it's also probably not the end of the world. We just need to make sure that the plugin would be linked into the final shared libraries and executables such that symbols would be found correctly in larger jobs. (As long as glibc cooperates, this should not be an issue. But then again, CUDA should've handled symbol lookups correctly as well...)

@asalzburger
Copy link
Contributor

Hi @beomki-yeo, could you please drop a line if you are fine with the suggested changes?

@beomki-yeo
Copy link
Contributor

I am good with the changes

@paulgessinger
Copy link
Member

Do we want to give @robertlangenberg a chance to also have a look before merging?

@asalzburger asalzburger modified the milestones: Backlog, v0.32.00 Aug 14, 2020
@asalzburger
Copy link
Contributor

I think we definitively should target the next release for getting this in.

@robertlangenberg
Copy link
Contributor

Hi @krasznaa,

thanks for the work you put in the PR and the detailed commentary! To avoid losing this somewhere in the PR archives, I think it would be great to have this in the documentation (in another PR, this one is big enough as is).

Long story short, I instead loop over the middle spacepoints on the host, in a for loop. 😦 This way I only need maxMBDublets x maxMTDublets amount of memory at any given time.

this would mean "number of SP kernel calls", does this cost significant amount of time? Would it make a difference to call it for several (but not all) middleSP at once?

As long as I only have enough memory for a single middle spacepoint's data, it's a filtering that can only happen on a single thread.

I assume you mean here "only one if the only other option is ALL middle SPs"? Similarly as the case above, would it be possible to filter "many" at once?

Indeed, in a debug build (which also implies not using fast math) the differences in matching rate between the two algorithms are not that large. And for the sp1.txt file your algorithm also produces a 100% agreement.

I'm a little nervous to agree that having diverging algorithms would be harmless... did you check that the agreement was NOT 100% between CPU impl and Beomki's impl with YOUR testfile minus the duplicate SP? I can't tell which impl you're testing in the timing output. I think it would make further development on GPU and CPU code quite difficult if 100% agreement cannot be achieved.

@krasznaa
Copy link
Member Author

Hi Robert,

Thanks for the comments! I agree that the current naive implementation should definitely still be improved.

I agree with you, that probably a good algorithm for the triplet finding would be that the code checks how much memory it may use, and then figure out how many middle spacepoints it can process at once with that much memory. And then divide up the computation into that many steps. Instead of processing every single middle spacepoint one-by-one.

Since doing this one-by-one is definitely a significant overhead. 😦 The current implementation is definitely pretty slow. My thinking was just that after the initial PR these improvements would come in additional PRs.

As for the agreement: My current understanding is that the triplet filtering unfortunately does depend on which order it sees the triplets in. Just because I think the current filtering that the example does can easily assign the same weight to multiple triplets. (Beomki was also struggling with this as far as I understand.)

My hope is that with more meaningful filtering the differences should disappear completely. If not, we should debug the issue at that point. With the current test setup I didn't think it would be useful to chase those tiny differences any further.

Cheers,
Attila

P.S. Your proposal about adding a description of the code to the documentation is a good one. We will do that for both the CUDA and the upcoming SYCL implementation.

@robertlangenberg
Copy link
Contributor

robertlangenberg commented Aug 17, 2020

As for the agreement: My current understanding is that the triplet filtering unfortunately does depend on which order it sees the triplets in.

This is why Beomki added an additional sorting criterion to order by SP radius if two seeds otherwise have the exact same weight. This would only be expected to happen with exact same impact parameters, which could happen when there's more than one SP at the exact same location or there's a perfectly homogeneous magnetic field and no material interaction. A third possibility may be that float isn't actually accurate enough to distinguish very close SP, the z coordinate easily exceed 1000mm and x and y can come close. I'm assuming GPU floats are similar to IEEE754 single precision floats, for those Wikipedia claims there are between 6 and 9 significant decimals.

@beomki-yeo
Copy link
Contributor

A third possibility may be that float isn't actually accurate enough to distinguish very close SP, the z coordinate easily exceed 1000mm and x and y can come close. I'm assuming GPU floats are similar to IEEE754 single precision floats, for those Wikipedia claims there are between 6 and 9 significant decimals.

This might not be the culprit as he used debug build where the fast mode turned off.

This would only be expected to happen with exact same impact parameters, which could happen when there's more than one SP at the exact same location or there's a perfectly homogeneous magnetic field and no material interaction.

This could be right. But we don't know until we check it thoroughly. As I know @krasznaa removed SPs with same location?

@robertlangenberg
Copy link
Contributor

robertlangenberg commented Aug 18, 2020

This might not be the culprit as he used debug build where the fast mode turned off.

This is just a property of the data type (i.e. it is independent of the math involved).

One more thing that came to my mind why there may be a difference between CPU and GPU implementations is a bug fix for CPU which I didn't implement in the GPU code, which would lead to a small difference between the two! #358

@krasznaa
Copy link
Member Author

This is just a property of the data type (i.e. it is independent of the math involved). One more thing that came to my mind why there may be a difference between CPU and GPU implementations is a bug fix which I didn't implement in the GPU code, which would lead to a small difference between the two! #358

It may play a role. (I'll test that tomorrow.) But I doubt it. Since I never observed any count differences after the "dublet search" between the CPU and (my) GPU code.

As I wrote before, I really don't want to debug the differences in detail on these very imperfect tests. 😦 (For Christ's sake, even the selection of bottom/middle/top spacepoints is ridiculously busted in my test.) I'd much rather that we wait for the proper testing setup that we've been promised, and do the "serious debugging" with that.

Turns out that the code was only counting the number of spacepoint
duplicates so far, it was not discarding them...

At the same time updated the name of the executable to be in line with
Andi's "new naming".
@krasznaa
Copy link
Member Author

The update in #358 didn't make much of a difference, as I suspected. But I thought for a moment that I found what was the reason for my issues.

Previously the code was just complaining about spacepoint duplicates coming from the input file, but wasn't actually removing any of them from the test. It was yet another instance of exiting for-loops with continue/break incorrectly. 😦

Plus I also gave in, and started comparing the spacepoints at the end of seed-finding not by pointers, but rather using their equality operators. (Previously I used pointer comparisons mainly for speed. And to test that the two algorithms grouped the exact same objects in memory into triplets.)

Unfortunately after all of this, the agreement is still not perfect. 😦 In a debug build I see this:

[bash][Elrond]:build-dbg > time ./bin/ActsUnitTestSeedfinderCuda2 -f ../seeds/sp2.txt -d
profiling:/data/projects/acts/build-dbg/Tests/UnitTests/Plugins/Cuda/Seeding2/CMakeFiles/ActsUnitTestSeedfinderCuda2.dir/main.cpp.gcda:Version mismatch - expected A75* got A84*
profiling:/data/projects/acts/build-dbg/Tests/UnitTests/Plugins/Cuda/Seeding2/CMakeFiles/ActsUnitTestSeedfinderCuda2.dir/CommandLineArguments.cpp.gcda:Version mismatch - expected A75* got A84*
profiling:/data/projects/acts/build-dbg/Tests/UnitTests/Plugins/Cuda/Seeding2/CMakeFiles/ActsUnitTestSeedfinderCuda2.dir/ReadSeedFile.cpp.gcda:Version mismatch - expected A75* got A84*
profiling:/data/projects/acts/build-dbg/Tests/UnitTests/Plugins/Cuda/Seeding2/CMakeFiles/ActsUnitTestSeedfinderCuda2.dir/TestSpacePoint.cpp.gcda:Version mismatch - expected A75* got A84*
profiling:/data/projects/acts/build-dbg/Tests/UnitTests/Plugins/Cuda/Seeding2/CMakeFiles/ActsUnitTestSeedfinderCuda2.dir/TestHostCuts.cpp.gcda:Version mismatch - expected A75* got A84*
1275 duplicates found in: ../seeds/sp2.txt
Read 156776 spacepoints from file: ../seeds/sp2.txt
Done with the seedfinding on the host
Done with the seedfinding on the device

-------------------------- Results ---------------------------
|          |     Host     |    Device    | Speedup/agreement |
--------------------------------------------------------------
| Time [s] |   46.488000  |      26.468  |       1.756385    |
|   Seeds  |       79798  |       79798  |      99.877190    |
--------------------------------------------------------------


real    11m4.112s
user    11m2.647s
sys     0m1.256s
[bash][Elrond]:build-dbg > time ./bin/ActsUnitTestSeedfinderCuda2 -f ../seeds/sp2.txt
profiling:/data/projects/acts/build-dbg/Tests/UnitTests/Plugins/Cuda/Seeding2/CMakeFiles/ActsUnitTestSeedfinderCuda2.dir/main.cpp.gcda:Version mismatch - expected A75* got A84*
profiling:/data/projects/acts/build-dbg/Tests/UnitTests/Plugins/Cuda/Seeding2/CMakeFiles/ActsUnitTestSeedfinderCuda2.dir/CommandLineArguments.cpp.gcda:Version mismatch - expected A75* got A84*
profiling:/data/projects/acts/build-dbg/Tests/UnitTests/Plugins/Cuda/Seeding2/CMakeFiles/ActsUnitTestSeedfinderCuda2.dir/ReadSeedFile.cpp.gcda:Version mismatch - expected A75* got A84*
profiling:/data/projects/acts/build-dbg/Tests/UnitTests/Plugins/Cuda/Seeding2/CMakeFiles/ActsUnitTestSeedfinderCuda2.dir/TestSpacePoint.cpp.gcda:Version mismatch - expected A75* got A84*
profiling:/data/projects/acts/build-dbg/Tests/UnitTests/Plugins/Cuda/Seeding2/CMakeFiles/ActsUnitTestSeedfinderCuda2.dir/TestHostCuts.cpp.gcda:Version mismatch - expected A75* got A84*
Read 158051 spacepoints from file: ../seeds/sp2.txt
Done with the seedfinding on the host
Done with the seedfinding on the device

-------------------------- Results ---------------------------
|          |     Host     |    Device    | Speedup/agreement |
--------------------------------------------------------------
| Time [s] |   47.768000  |      27.333  |       1.747631    |
|   Seeds  |       80638  |       80638  |      99.890870    |
--------------------------------------------------------------


real    1m16.932s
user    1m15.664s
sys     0m1.172s
[bash][Elrond]:build-dbg >

While in an optimised build:

[bash][Elrond]:build > ./bin/ActsUnitTestSeedfinderCuda2 -f ../seeds/sp2.txt -d
1275 duplicates found in: ../seeds/sp2.txt
Read 156776 spacepoints from file: ../seeds/sp2.txt
Done with the seedfinding on the host
Done with the seedfinding on the device

-------------------------- Results ---------------------------
|          |     Host     |    Device    | Speedup/agreement |
--------------------------------------------------------------
| Time [s] |    4.757000  |       3.606  |       1.319190    |
|   Seeds  |       79798  |       79797  |      99.883456    |
--------------------------------------------------------------

[bash][Elrond]:build > ./bin/ActsUnitTestSeedfinderCuda2 -f ../seeds/sp2.txt
Read 158051 spacepoints from file: ../seeds/sp2.txt
Done with the seedfinding on the host
Done with the seedfinding on the device

-------------------------- Results ---------------------------
|          |     Host     |    Device    | Speedup/agreement |
--------------------------------------------------------------
| Time [s] |    4.898000  |       3.676  |       1.332427    |
|   Seeds  |       80638  |       80637  |      99.880949    |
--------------------------------------------------------------

[bash][Elrond]:build >

(I only used time in the debug test to show how ridiculous time it takes to look for spacepoint duplicates in the debug build...)

At this point I'm really out of ideas. But just the same as before, I would really prefer to do the rest of the debugging on better defined/implemented tests. So... any chance that this could go in at one point? At this stage I don't want to make any large changes to the PR anymore. But there would be a couple of things that I could do in future PRs.

Copy link
Contributor

@robertlangenberg robertlangenberg left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While some debugging is still required, we agreed to push this in as it doesn't affect any other code.

@robertlangenberg robertlangenberg merged commit d7fa3c8 into acts-project:master Aug 20, 2020
msmk0 pushed a commit that referenced this pull request Aug 21, 2020
This adds two things:
1) Change the eqaul operator (again...) to consider data at the same location, which happened in [PR371](#371)
2) Fix a bug in Kernels.cu by putting syncthreads() in the right place. 
Without it, the code freezes with the data file provided at [PR371](#371)

I also tried to check brefiely when the agreement gets 100% by increasing the limits for the number of triplets stored
```
./ActsUnitTestSeedfinderCuda -f sp2.txt -q -l 2 -m 10    #-> 2.6 speedup | 72.6% matching rate
./ActsUnitTestSeedfinderCuda -f sp2.txt -q -l 2 -m 100   #-> 1.3 speedup | 98.6% matching rate  
./ActsUnitTestSeedfinderCuda -f sp2.txt -q -l 10 -m 500  #-> 0.7 speedup | 99.9% matching rate
./ActsUnitTestSeedfinderCuda -f sp2.txt -q -l 20 -m 500  #-> 0.6 speedup | 100% matching rate
```
`-l` is  a limit on the average number of triplets per bottom spacepoint. This is used for determining matrix size for triplets per middle space point.
`-m` is a limit on the number of triplets per bottom spacepoint.

Giving large values for those options slow the process in several ways: (1) Large CPU & GPU DRAM memory is declared (2) Large shared memory is declared which reduces the number of GPU blocks launched in parallel.

Since the number of triplets per bottom space point varies a lot, large value should be given to `-m` flag to get a good matching rate. Probably counting their numbers in advance before launching kernel function could be better for this kind of data distribution. However, doing it efficiently would be another task.
@krasznaa krasznaa deleted the cuda2-plugin-20200810 branch August 27, 2020 14:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component - Plugins Affects one or more Plugins Feature Development to integrate a new feature
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants