-
Notifications
You must be signed in to change notification settings - Fork 174
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a "second" CUDA based SeedFinder implementation, master branch (2020.08.11.) #371
Add a "second" CUDA based SeedFinder implementation, master branch (2020.08.11.) #371
Conversation
Specifying explicitly which architectures to generate code for, specifying that debug builds should attach debug symbols to both the host and device CUDA code, and making all of these configurations overridable from the command line.
In order not to interfere with the previously implemented CUDA based seed finding, I've put my own implementation into a separate plugin library. The separate library is necessary because this library uses a simpler build setup than ActsCudaPlugin.
This code is used to figure out the exact parallelisation for the triplet finding code.
To make it easier to use it in the CUDA plugin code.
I forgot that the variable would be pre-defined as a cache variable by CMake itself. So my setting was not doing anything.
As an alternative to Acts::Seedfinder<..., Acts::Cuda>. Unfortunately the code still behaves a bit worryingly, but at least it more or less works...
…ode. The setup is a little clumsy, as function pointers are a bit difficult with CUDA, but this seemed to be the most reasonable setup.
Based very much on the existing seedfinder tests. Only modifying it to make it a bit sleeker in some places, and to accommodate the setup requirements of Acts::Cuda::SeedFinder.
For some reason function pointers do not work correctly with optimised CUDA builds. (I developed the code on a debug build...) The results become incorrect without any error from CUDA, in such optimised builds. :-/
I forgot about improving the code this way after introducing the Acts::Cuda::Details::DubletCounts::maxTriplets variable.
How are the spacepoint files you're using different from the ones we had before? Do you understand why performance is so much worse with them? Do we know if either of them is "realistic"? |
Codecov Report
@@ Coverage Diff @@
## master #371 +/- ##
=======================================
Coverage 49.06% 49.06%
=======================================
Files 325 325
Lines 16036 16036
Branches 7395 7395
=======================================
Hits 7868 7868
Misses 2957 2957
Partials 5211 5211
Continue to review full report at Codecov.
|
None of this is realistic in any way Charles. 😦 From what I can tell, Beomki was using files generated with the ATLAS geometry "back in the day". And adjusted the selection cuts accordingly.
I on the other hand received files generated with the TrackML geometry, while keeping all the test cuts exactly the same as Beomki had them. So the code is not doing anything nearly reasonale on my test files. So I'm not trying to throw shade about the performance numbers that I saw with Beomki's code. If we use unrealistic selection cuts, it's fairly random whether the CPU or GPU implementation will start behaving better. My big concern was that with these non-optimal cuts the current code (just so I wouldn't keep referring to it as "Beomki's code" all the time 😛) was not reproducing the CPU code's results all too well for high multiplicities. As I showed in:
It's the low matching rate that I started my experiments over. I very much believe that the current code makes some assumptions about the inputs that it receives, which may not hold for all inputs. (Again, the jobs that I'm running are not realistic. Still, the algorithm should either stop with an error message if one of its assumptions is not fulfilled, or provide the correct results. Even if only slowly.) |
thanks Attila - I'm not at all trying to question your motivations ;-) I'm just trying to understand the differences. Obviously it makes no sense to try to develop algorithms against unrealistic data sets and cuts. But if different experiments and geometries require very different algorithms to be performant, then that should be taken into consideration. It may not be possible to do this generically with a single implementation. It would be nice if we could classify the inputs / geometries so that we can decide, if it's like "this", do it on the CPU (ie, small number of spacepoints). If it's like "that", do it on the GPU with "this" implementation. If it like "this other thing", do it on the GPU with yet another implementation. |
Phew... I didn't actually test the code on our RTX 2060 card until now... But it does deliver. 😄
As you can see, the implementation fits into the 6 GBs of memory of that card with my " |
So... A little bit about the implementation... The code, of course, starts with setting up simple blobs of memory with the properties of the spacepoints, so they could be moved to the GPU. Note however that I don't just use generic float arrays for this, but rather created a custom struct ( At this point I expect that a number of people will hiss up, that I'm using an array of structs, instead of a struct of arrays. But what I learned in the last weeks is that arrays of structs are not as evil with GPUs as I was lead to believe... First off, memory-copy-wise it makes no difference that I lay out the memory like this. The memory for these objects is allocated in a typeless way, and then the memory contents are copied in a typeless way. So no difference there. But where I do think it makes a difference, is how efficiently the GPU threads can load the properties of spacepoints while the kernel is running. In my kernels I decided to load all properties of the spacepoints in single operations into the threads. Like I do here for instance: Instead of loading all required properties of a single spacepoint with N operations. Since the seed finding algorithm doesn't simply linearly loop over spacepoints, having the same kinds of coordinates of all the spacepoints tightly packed in memory just doesn't seem to make a difference. And CUDA was able to produce much faster code when I implemented my kernels with these helper structs, than when I was accessing individual elements of large float arrays. (The way that I explained this to myself was that it can probably better hardcode information about the layout of my variables in memory than when I index large arrays using a bunch of helper variables.) Then... After copying the spacepoints to the device, I run "dublet finding" on them using the I'm actually quite happy with how that code turned out. 😄 I start the kernel twice (per For the output of the dublet finding, I create 4 arrays:
Okay, let me stop here for now. 😄 I'll describe more about the triplet finding and filtering in subsequent posts. |
For reproducibility, you have to give a cmake configuration flag of "-fmad=false". I clarified this in PR104 For more information on the floating point operation of CPU and GPU, check here |
I totally agree on this. What I would have done once I found that there is a difference in speedup between two different data sets is looking for source of difference. For example:
Unless we don't understand this clearly we will encounter same problem with other data sets in the future. |
Hmm... 😕 I was absolutely convinced that I tested your code in debug mode as well. ( Indeed, in a debug build (which also implies not using fast math) the differences in matching rate between the two algorithms are not that large. And for the
(I also just wanted to show how interestingly the speed of the two algorithms differ in a debug build... 😄) We should have a discussion at one point. Let's try to set it up through Charles, to discuss a bit about the implementation details of the algorithms. (I'll write some more "posts" about my "new" algorithm's inner workings today.) |
Forgot to remove those assert statements when I removed the usage of the user defined filter functions. (As these do not work correctly in optimised builds for some reason. I'm following up about that at the moment...)
Try with "-l 2" or "-l 3" or "-l 4" option where "-l 1" is the default. This flag controls the data space in triplet finding |
It was a worthwhile check, but it only seems to make a very small difference in this particular case.
|
Thanks for check. Since the matching rate does not converge, It seems that the expected number of triplets per bottom space-point is much larger than the data files that I used. |
It seems that some input files are not perfect, they contain multiple spacepoints with the exact same spacial coordinates. This makes the seed finging's reproducibility on a GPU pretty much impossible. At the same time added a print operator for TestSpacePoint, as I was using it for debugging while writing the duplet searching code.
So... As I mentioned in my opening remarks as well, unfortunately the test files I use right now are very far from perfect. I now added the ability to my test executable to look for, and remove spacepoints that have the exact same x-y-z coordinates. And there are a good number of them... 😦
These results are from an optimised build, so 100% agreement is not expected. It just takes super long to count the duplicates on the host in debug mode, so I didn't want to run like that too much. Plus the one test that I ran with the duplicates removed, didn't actually make the agreement of my algorithm 100% unfortunately. Still, I would argue that not getting 100% agreement with these files is not the end of the world. It's only once the agreement is very small that things become suspicious. Hopefully I'll get some better files within the next week, then we could make some more meaningful comparisons between the algorithms. 😉 |
Let me continue my description from #371 (comment). (Not to point fingers, but being able to discuss in "threads" on GitLab does have it uses... 😛) I described how the code would form dublet pairs, and what sort of data structures it would store the results of the dublet search in. In the next step the code needs to start executing code on these dublets. But to do that, it has to know how many dublets there are. Remember, the GPU at this point just filled 2 What I need to do is a textbook "reduction" operation. I need to sum up some values, and find maxima for some others. So I just looked at the CUDA samples to figure out how to do this. 😛 https://docs.nvidia.com/cuda/cuda-samples/index.html#cuda-parallel-reduction I went with the implementation requiring the fewest features from the GPU, as we really don't need this code to be super optimised. (In the end it accounts for O(0.1%) of the total GPU usage in my test jobs.) Now, knowing the total number of dublets that were found, the total number of triplets that may be reconstructed out of them, and a couple of more parameters (have a look at the declaration of But before doing that, just like Beomki's code, my code also creates This is where my implementation is becoming non-optimal. 😦 In the next step, in order to be able to run the "Sp2" filtering on the created triplets efficiently later on, I need to create triplet candidates separately for every middle-bottom spacepoint combination. In total this would mean creating an integer (index) array of size Long story short, I instead loop over the middle spacepoints on the host, in a Plus the 1-D arrays for the triplet objects. Since this "post" is becoming rather long, let me continue in a separate one... |
Continuing from #371 (comment)... For every middle spacepoint the code ends up first launching the It does so in a 2-D loop, over Similar to the dublet finding, this kernel also outputs its results in 2 variables.
(It also stores the And next comes the final step that my implementation still executes on the GPU. The filtering of the triplets for every fixed middle-bottom spacepoint combination. This is done using a 2-D iteration over an The latter is a counter that the The filter kernel ( In Beomki's code, as far as I understood, this was implemented using a "triple" loop. https://github.com/acts-project/acts/blob/master/Plugins/Cuda/src/Seeding/Kernels.cu#L721-L778 I instead left all but one loop up to CUDA to parallelise for me. This filter outputs all the triplets surviving the selections into a single I just have one more post to write, then I'll leave you guys alone for now... 😛 |
Continuing from #371 (comment)... The final filtering is home for another imperfection in my implementation. 😦 In the CPU implementation the triplet filtering is done using user-defined functions. With the user providing a class that implements the Acts::IExperimentCuts interface. Like this class does:
For the GPU code however we would need to let the user give us device code that would perform similar filtering. And for a brief on Monday-Tuesday I thought I figured out how to do this. Introducing the The user would give such an object to The user would have to implement fairly complicated code for setting up these function pointers correctly, but it would at least be technically possible. And as I wrote earlier, for a while I thought it would work. In my debug build test runs this worked beautifully. But then I switched to running tests in an optimised build, and everything broke. 😦 Without any error message from CUDA, the results from my test jobs became very different from the CPU code's results. 😕 I'll be following up about that with the (one) NVidia developer that I know. But for now I had to give up on using a user-defined function in the GPU code. And just hardcoded the settings used by Beomki's code as well, into the "central" CUDA code. In conclusion, the current code "sort of" works. But as we discussed on Monday, the final filtering step of the CPU code is very hard to do on a GPU. As long as I only have enough memory for a single middle spacepoint's data, it's a filtering that can only happen on a single thread. So in the current implementation it is done on the CPU. But for larger spacepoint multiplicities this makes it very clear that my GPU is not nearly 100% utilised. As doing the filtering on the CPU takes a noticeable time between kernel launches. 😦 So there's definitely a lot to improve. But I do think that there are enough interesting ideas in this code to be added to the plugin like this for now. |
This is necessary to make it possible to give used-defined filter functions to the triplet finding code. (It's a long-standing issue in CUDA that device code split across shared libraries does not work well...)
😕 I'm becoming senile... The issue with not being able to plug in "user functions" into the triplet filtering code is something that I've seen since 2 years... See: https://github.com/krasznaa/CUDALinkTest It's the same problem here. If the kernel is located in a shared library, and the user's filter function in a separate executable (or shared library), then they can just not "find each other correctly" at runtime. So I now turned the plugin library into a static one. Unfortunately I don't see a way around this for now. It's not great to implement it like this, but it's also probably not the end of the world. We just need to make sure that the plugin would be linked into the final shared libraries and executables such that symbols would be found correctly in larger jobs. (As long as glibc cooperates, this should not be an issue. But then again, CUDA should've handled symbol lookups correctly as well...) |
Hi @beomki-yeo, could you please drop a line if you are fine with the suggested changes? |
I am good with the changes |
Do we want to give @robertlangenberg a chance to also have a look before merging? |
I think we definitively should target the next release for getting this in. |
Hi @krasznaa, thanks for the work you put in the PR and the detailed commentary! To avoid losing this somewhere in the PR archives, I think it would be great to have this in the documentation (in another PR, this one is big enough as is).
this would mean "number of SP kernel calls", does this cost significant amount of time? Would it make a difference to call it for several (but not all) middleSP at once?
I assume you mean here "only one if the only other option is ALL middle SPs"? Similarly as the case above, would it be possible to filter "many" at once?
I'm a little nervous to agree that having diverging algorithms would be harmless... did you check that the agreement was NOT 100% between CPU impl and Beomki's impl with YOUR testfile minus the duplicate SP? I can't tell which impl you're testing in the timing output. I think it would make further development on GPU and CPU code quite difficult if 100% agreement cannot be achieved. |
Hi Robert, Thanks for the comments! I agree that the current naive implementation should definitely still be improved. I agree with you, that probably a good algorithm for the triplet finding would be that the code checks how much memory it may use, and then figure out how many middle spacepoints it can process at once with that much memory. And then divide up the computation into that many steps. Instead of processing every single middle spacepoint one-by-one. Since doing this one-by-one is definitely a significant overhead. 😦 The current implementation is definitely pretty slow. My thinking was just that after the initial PR these improvements would come in additional PRs. As for the agreement: My current understanding is that the triplet filtering unfortunately does depend on which order it sees the triplets in. Just because I think the current filtering that the example does can easily assign the same weight to multiple triplets. (Beomki was also struggling with this as far as I understand.) My hope is that with more meaningful filtering the differences should disappear completely. If not, we should debug the issue at that point. With the current test setup I didn't think it would be useful to chase those tiny differences any further. Cheers, P.S. Your proposal about adding a description of the code to the documentation is a good one. We will do that for both the CUDA and the upcoming SYCL implementation. |
This is why Beomki added an additional sorting criterion to order by SP radius if two seeds otherwise have the exact same weight. This would only be expected to happen with exact same impact parameters, which could happen when there's more than one SP at the exact same location or there's a perfectly homogeneous magnetic field and no material interaction. A third possibility may be that float isn't actually accurate enough to distinguish very close SP, the z coordinate easily exceed 1000mm and x and y can come close. I'm assuming GPU floats are similar to IEEE754 single precision floats, for those Wikipedia claims there are between 6 and 9 significant decimals. |
This might not be the culprit as he used debug build where the fast mode turned off.
This could be right. But we don't know until we check it thoroughly. As I know @krasznaa removed SPs with same location? |
This is just a property of the data type (i.e. it is independent of the math involved). One more thing that came to my mind why there may be a difference between CPU and GPU implementations is a bug fix for CPU which I didn't implement in the GPU code, which would lead to a small difference between the two! #358 |
It may play a role. (I'll test that tomorrow.) But I doubt it. Since I never observed any count differences after the "dublet search" between the CPU and (my) GPU code. As I wrote before, I really don't want to debug the differences in detail on these very imperfect tests. 😦 (For Christ's sake, even the selection of bottom/middle/top spacepoints is ridiculously busted in my test.) I'd much rather that we wait for the proper testing setup that we've been promised, and do the "serious debugging" with that. |
Turns out that the code was only counting the number of spacepoint duplicates so far, it was not discarding them... At the same time updated the name of the executable to be in line with Andi's "new naming".
The update in #358 didn't make much of a difference, as I suspected. But I thought for a moment that I found what was the reason for my issues. Previously the code was just complaining about spacepoint duplicates coming from the input file, but wasn't actually removing any of them from the test. It was yet another instance of exiting for-loops with Plus I also gave in, and started comparing the spacepoints at the end of seed-finding not by pointers, but rather using their equality operators. (Previously I used pointer comparisons mainly for speed. And to test that the two algorithms grouped the exact same objects in memory into triplets.) Unfortunately after all of this, the agreement is still not perfect. 😦 In a debug build I see this:
While in an optimised build:
(I only used At this point I'm really out of ideas. But just the same as before, I would really prefer to do the rest of the debugging on better defined/implemented tests. So... any chance that this could go in at one point? At this stage I don't want to make any large changes to the PR anymore. But there would be a couple of things that I could do in future PRs. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While some debugging is still required, we agreed to push this in as it doesn't affect any other code.
This adds two things: 1) Change the eqaul operator (again...) to consider data at the same location, which happened in [PR371](#371) 2) Fix a bug in Kernels.cu by putting syncthreads() in the right place. Without it, the code freezes with the data file provided at [PR371](#371) I also tried to check brefiely when the agreement gets 100% by increasing the limits for the number of triplets stored ``` ./ActsUnitTestSeedfinderCuda -f sp2.txt -q -l 2 -m 10 #-> 2.6 speedup | 72.6% matching rate ./ActsUnitTestSeedfinderCuda -f sp2.txt -q -l 2 -m 100 #-> 1.3 speedup | 98.6% matching rate ./ActsUnitTestSeedfinderCuda -f sp2.txt -q -l 10 -m 500 #-> 0.7 speedup | 99.9% matching rate ./ActsUnitTestSeedfinderCuda -f sp2.txt -q -l 20 -m 500 #-> 0.6 speedup | 100% matching rate ``` `-l` is a limit on the average number of triplets per bottom spacepoint. This is used for determining matrix size for triplets per middle space point. `-m` is a limit on the number of triplets per bottom spacepoint. Giving large values for those options slow the process in several ways: (1) Large CPU & GPU DRAM memory is declared (2) Large shared memory is declared which reduces the number of GPU blocks launched in parallel. Since the number of triplets per bottom space point varies a lot, large value should be given to `-m` flag to get a good matching rate. Probably counting their numbers in advance before launching kernel function could be better for this kind of data distribution. However, doing it efficiently would be another task.
This is meant as a replacement for #353. As discussed in that PR, it turned out that the "current" CUDA SeedFinder implementation is really not behaving great for the test files @asalzburger gave me a while ago.
So I set out to write my own implementation for the seed finding with CUDA, to see what's what. This is the result. 😉 Note that I don't touch the existing CUDA implementation in the PR. Mainly to allow for comparisons between the two code, since (as you'll see further down) my implementation is definitely not perfect either.
To put this implementation in place, I first of all had to make some small changes in the core code.
Acts::SeedFilterConfig
into its own header, so that I could more easily use it independent of theActs::SeedFilter
class;Acts::BinnedSPGroupIterator::operator!=(...)
operator to make it easier to iterate using such objects in afor(...)
loop.Since the current CUDA code is compiled in a non-optimal way "CMake configuration-wise", I decided to introduce a separate shared library (
ActsCuda2Plugin
) for the new source files. It would take most of #353 to be able to merge all code into the same shared library in a nice way. It could very well be possible in the future, but in this PR I wanted to modify the existing code as little as possible.I also introduced a new test executable for the new plugin, under
Tests/UnitTests/Plugins/Cuda/Seeding2
. This was necessary as setting up this new seedfinder class works a bit differently than the previous one. (And the build setup for the test's source files is also quite different.)I'll discuss the implementation details in further comments on the PR, here let me just give some of the results that it can produce. I get the following results with it on my private desktop that sports an Intel i7-6700 CPU and an NVidia GTX 960 GPU. (So an "okay" CPU, and a fairly old/low-end GPU.)
The last test I ran only on part of the spacepoints, so that it wouldn't take too long. And note that I did not run the test on the largest test file that I have available, as I do not fit into the 2 GB memory of my GTX card with that test file. 😦
As you can see, for a small number of spacepoints it behaves worse than the CPU implementation. But this is not the end of the world. During development I was able to write an algorithm that was faster than my CPU for small numbers of spacepoints. It just didn't allow me to run on larger numbers of spacepoints. So the idea would be to set up the code eventually to choose an algorithm based on the complexity of the event. But that's also something for a future development... 😉
About the non-perfect agreement: Some of it is due to the floating point precision differences between my CPU and GPU. But some of it seems to be due to some weirdness in the test files that I use. (They seem to hold multiple spacepoints with the exact same x-y-z coordinates. Making it impossible to reproduce the CPU results on the GPU.) Hopefully some new test files (already in the works...) will improve things around this later on.
Pinging @cgleggett, @vpascuzz, @beomki-yeo, @paulgessinger and @czangela.