-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Changed device clusterization to flat EDM & cell-level parallelism #299
Changed device clusterization to flat EDM & cell-level parallelism #299
Conversation
I ran our throughput tests at ttbar_mu300 with 10K events and it seems that (running on a NVIDIA RTX A4000) we're getting benefits on the single-threaded throughput of about 60% on both SYCL and CUDA. However, on the multi-threaded application I only saw an improvement of about 20% for CUDA and less than 10% for SYCL. |
Brilliant, very good! When we get this in, can we get rid of the old FastSV implementation? I assume this supersedes it. |
For sure. The only real difference between the version I implemented here and that on the CCA directory is that I chose not to have a distinction between number of threads and number of cells per partition, so that all threads only ever deal with 1 cell, unlike what happened in the original algorithm. From my experiments, there was no gain performancewise in having the distinction between these 2. |
The CI is currently running into some issues with sycl as I was using the 2020 version on my end. Any plans to update the dependencies on the CI machine to also run the newer version or should I change my code to be compatible with the older one? |
As an additional note, these changes should probably only be made after ensuring the CPU-only algorithms do not incur in any performance loss from changing to this flat EDM at event read-in (the change from cells to alt-cells introduced in #288 , any other EDM types internal to clusterization don't need to change if there are losses there), as we can't really have 2 different ways of reading input data depending on whether said event is going to run on a CPU or a GPU. |
f45c5be
to
1028f75
Compare
I think this should be reasonably ready for a review at this point. For whoever is assigned this heroic task, I suggest taking a look at the initial comment I made when opening this PR which explains / organises things nicely |
1028f75
to
54812a1
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wow... is this a monster PR...
device/common/include/traccc/seeding/device/count_grid_capacities.hpp
Outdated
Show resolved
Hide resolved
426ac95
to
4afb4af
Compare
I have addressed the comments above. Also added the functionality in #308 but, instead of entirely removing the partition class, I made it into an alias of unsigned int. I am, however, fine with just removing it. |
I also intend on adding #309 on top of this but in a separate PR. |
4afb4af
to
cdee30f
Compare
f38c896
to
116578e
Compare
scalar weight; | ||
scalar z_vertex; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can discuss this later, perhaps on a different PR, but this information should not really live in the seed EDM.
#include "traccc/clusterization/device/count_cluster_cells.hpp" | ||
#include "traccc/clusterization/device/create_measurements.hpp" | ||
#include "traccc/clusterization/device/find_clusters.hpp" | ||
#include "traccc/clusterization/device/aggregate_cluster.hpp" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am going to assume that this is just a copy of the CUDA code in component_connection.cu
, so I will skip over it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Pretty much. With the difference of using simply 1 cell per thread (which I wanted to address in a separate PR with the unrolling we discussed)
@@ -9,45 +9,325 @@ | |||
#include "../utils/get_queue.hpp" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have this code hooked up to our CCA tests? Would be useful to see whether this code does what it's supposed to do. I trust that it's a faithful translation but I am worried about subtle differences in memory and execution models between CUDA and SYCL.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This would definately be good but if I understand correctly we currently don't have a way of executing SYCL code in our CI tests. I can however double check this by doing some further testing locally.
I think we should move forward with this, flattening the EDM is going to be great for our throughput + will allow a lot more future improvements to the code. |
6b6c66e
to
ea896e1
Compare
8d9fccd
to
330eef4
Compare
I have a question in Another question is that if the speed gain is mainly either from the new EDM or the CCA algorithm. If the latter is the case and CCA algorithm can be applied to the old EDM, we might not have to change the EDM at all, I suppose 🤔 |
@beomki-yeo even with the current EDM we are not using the detector modules for anything outside of clustering, and this information is not passed down to seeding. If we need the detector modules for CKF that's something we'd need to add at that point. Regarding the performance gain, it's down to both, as the CCA algorithm heavily relies on having a flat EDM, and I don't see why we would want to make it less efficient by sticking to the old one. |
…e with deprecated accessors
Changed partition class into an alias of unsigned
330eef4
to
bf4ed85
Compare
@stephenswat can clarify this but I would think that a flat EDM and CCA are orthogonal. Could you educate me about how a flat EDM contributes to the performance gain? Also have you tested the original algorithm with a flat EDM and compare with the current performance? And.. is it necessary to remove the original algorithm? 🤔 |
@beomki-yeo on top of jagged vectors just being inherently slightly slower to handle with copying, our "prefix sums", etc. the module-based jaggedness results in very differently sized subvectors. By having all cells in a single vector it eases the creation of these partitions which all have around the same size which makes the algorithm a lot faster by fully using all GPU threads. |
I have not, this could be interesting to look at though |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay this has been sitting here for more than long enough, if the CI is happy with the changes let's put them in.
This depends on #288now mergedThis very extensive PR changes the device clusterization algorithm to a handling single vectors' only, without any jaggeds.
The main changes in this PR are:
alt_measurement
. Similarly toalt_cell
, this holds a measurement and a link to a module to store all information in a single collection rather than a container with item + header.reduce_problem_cell
andaggregate_cluster
.Further changes are mostly minor, and some are almost just duplication of code. Feel free to be somewhat less thourough in reviewing these.
There's also still some stuff missing which I opted not to add in this PR so as not to make it even longer, namely creating "comparator" code for comparing a collection with a container. Adding a "read_alt" function which reads more than one event in parallel using TBB with the new alt_cell_collection. Create physics performance testing code using the new EDM types. Check whether CPU/Core algorithms benefit from using the jagged clusterization rather than a non-jaggged one. If it does not, this would be quite useful, as we could just ditch this duplicate code and delete the "old" edm by replacing it with the "alt".