Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Changed device clusterization to flat EDM & cell-level parallelism #299

Merged

Conversation

guilhermeAlmeida1
Copy link
Collaborator

@guilhermeAlmeida1 guilhermeAlmeida1 commented Dec 21, 2022

This depends on #288 now merged
This very extensive PR changes the device clusterization algorithm to a handling single vectors' only, without any jaggeds.

The main changes in this PR are:

  • Introduced Partitioning algorithm (on the CPU), which separates the cells' vector into similarly sized groups for the device clusterization algorithm.
  • Introduced alt_measurement. Similarly to alt_cell, this holds a measurement and a link to a module to store all information in a single collection rather than a container with item + header.
  • Removed all old common device kernels in clusterization and introduced new ones working with the new EDM types: reduce_problem_celland aggregate_cluster.
  • Changed clusterization algorithm to use new EDM types and no jaggeds. This comes with the small reversal that a part of our code is no longer single source, as the usage of barriers and group voting requires backend specific functions.

Further changes are mostly minor, and some are almost just duplication of code. Feel free to be somewhat less thourough in reviewing these.

  • Changed seeding to take as input a collection of spacepoints rather than a container. (Seeding algorithm, seed finding, and the various seeding kernels.)
  • Changed the sycl/cuda executables (seq_examples, seeding_examples, and full_chain_algorithms) to use new clusterization.
  • Created alternative throughput tests which use new clusterization. This duplicates the already existing throughput_st / mt code as the CPU and GPU algorithms can no longer run on the same ones.

There's also still some stuff missing which I opted not to add in this PR so as not to make it even longer, namely creating "comparator" code for comparing a collection with a container. Adding a "read_alt" function which reads more than one event in parallel using TBB with the new alt_cell_collection. Create physics performance testing code using the new EDM types. Check whether CPU/Core algorithms benefit from using the jagged clusterization rather than a non-jaggged one. If it does not, this would be quite useful, as we could just ditch this duplicate code and delete the "old" edm by replacing it with the "alt".

@guilhermeAlmeida1
Copy link
Collaborator Author

I ran our throughput tests at ttbar_mu300 with 10K events and it seems that (running on a NVIDIA RTX A4000) we're getting benefits on the single-threaded throughput of about 60% on both SYCL and CUDA. However, on the multi-threaded application I only saw an improvement of about 20% for CUDA and less than 10% for SYCL.

@stephenswat
Copy link
Member

Brilliant, very good! When we get this in, can we get rid of the old FastSV implementation? I assume this supersedes it.

@guilhermeAlmeida1
Copy link
Collaborator Author

guilhermeAlmeida1 commented Dec 22, 2022

For sure. The only real difference between the version I implemented here and that on the CCA directory is that I chose not to have a distinction between number of threads and number of cells per partition, so that all threads only ever deal with 1 cell, unlike what happened in the original algorithm. From my experiments, there was no gain performancewise in having the distinction between these 2.

@guilhermeAlmeida1
Copy link
Collaborator Author

guilhermeAlmeida1 commented Dec 22, 2022

The CI is currently running into some issues with sycl as I was using the 2020 version on my end. Any plans to update the dependencies on the CI machine to also run the newer version or should I change my code to be compatible with the older one?
Edit: we opted to go with code compatible with an older version of sycl

@guilhermeAlmeida1
Copy link
Collaborator Author

guilhermeAlmeida1 commented Jan 5, 2023

As an additional note, these changes should probably only be made after ensuring the CPU-only algorithms do not incur in any performance loss from changing to this flat EDM at event read-in (the change from cells to alt-cells introduced in #288 , any other EDM types internal to clusterization don't need to change if there are losses there), as we can't really have 2 different ways of reading input data depending on whether said event is going to run on a CPU or a GPU.
I fully expect not to see any performance drop, though I'll still verify this in the coming days.

@guilhermeAlmeida1 guilhermeAlmeida1 marked this pull request as draft January 16, 2023 14:03
@guilhermeAlmeida1 guilhermeAlmeida1 force-pushed the wipClusterizationNewBoth branch 5 times, most recently from f45c5be to 1028f75 Compare January 17, 2023 14:26
@guilhermeAlmeida1 guilhermeAlmeida1 marked this pull request as ready for review January 17, 2023 15:56
@guilhermeAlmeida1
Copy link
Collaborator Author

I think this should be reasonably ready for a review at this point. For whoever is assigned this heroic task, I suggest taking a look at the initial comment I made when opening this PR which explains / organises things nicely

Copy link
Member

@krasznaa krasznaa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow... is this a monster PR...

core/include/traccc/edm/alt_measurement.hpp Outdated Show resolved Hide resolved
core/include/traccc/edm/ccl_partition.hpp Outdated Show resolved Hide resolved
core/include/traccc/edm/ccl_partition.hpp Outdated Show resolved Hide resolved
core/include/traccc/edm/internal_spacepoint.hpp Outdated Show resolved Hide resolved
core/src/clusterization/partitioning_algorithm.cpp Outdated Show resolved Hide resolved
device/cuda/src/clusterization/clusterization_algorithm.cu Outdated Show resolved Hide resolved
examples/run/common/throughput_mt_alt.hpp Show resolved Hide resolved
examples/run/cuda/full_chain_algorithm.cpp Outdated Show resolved Hide resolved
examples/run/cuda/seeding_example_cuda.cpp Show resolved Hide resolved
@stephenswat stephenswat added cuda Changes related to CUDA improvement Improve an existing feature sycl Changes related to SYCL edm Changes to the data model kokkos Changes related to Kokkos labels Jan 25, 2023
@guilhermeAlmeida1
Copy link
Collaborator Author

I have addressed the comments above. Also added the functionality in #308 but, instead of entirely removing the partition class, I made it into an alias of unsigned int. I am, however, fine with just removing it.

@guilhermeAlmeida1 guilhermeAlmeida1 requested review from krasznaa and stephenswat and removed request for stephenswat January 27, 2023 13:27
@guilhermeAlmeida1
Copy link
Collaborator Author

I also intend on adding #309 on top of this but in a separate PR.

@stephenswat stephenswat changed the title Changed device clusterization to flat edm Changed device clusterization to flat EDM Jan 30, 2023
@guilhermeAlmeida1 guilhermeAlmeida1 force-pushed the wipClusterizationNewBoth branch 2 times, most recently from f38c896 to 116578e Compare January 31, 2023 12:17
core/include/traccc/edm/alt_measurement.hpp Outdated Show resolved Hide resolved
Comment on lines +25 to +26
scalar weight;
scalar z_vertex;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can discuss this later, perhaps on a different PR, but this information should not really live in the seed EDM.

#include "traccc/clusterization/device/count_cluster_cells.hpp"
#include "traccc/clusterization/device/create_measurements.hpp"
#include "traccc/clusterization/device/find_clusters.hpp"
#include "traccc/clusterization/device/aggregate_cluster.hpp"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am going to assume that this is just a copy of the CUDA code in component_connection.cu, so I will skip over it.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pretty much. With the difference of using simply 1 cell per thread (which I wanted to address in a separate PR with the unrolling we discussed)

@@ -9,45 +9,325 @@
#include "../utils/get_queue.hpp"
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have this code hooked up to our CCA tests? Would be useful to see whether this code does what it's supposed to do. I trust that it's a faithful translation but I am worried about subtle differences in memory and execution models between CUDA and SYCL.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This would definately be good but if I understand correctly we currently don't have a way of executing SYCL code in our CI tests. I can however double check this by doing some further testing locally.

examples/run/kokkos/seeding_example_kokkos.cpp Outdated Show resolved Hide resolved
examples/run/kokkos/seeding_example_kokkos.cpp Outdated Show resolved Hide resolved
examples/run/sycl/full_chain_algorithm.hpp Show resolved Hide resolved
@stephenswat
Copy link
Member

I think we should move forward with this, flattening the EDM is going to be great for our throughput + will allow a lot more future improvements to the code.

@guilhermeAlmeida1 guilhermeAlmeida1 force-pushed the wipClusterizationNewBoth branch 2 times, most recently from 6b6c66e to ea896e1 Compare February 1, 2023 09:02
@guilhermeAlmeida1 guilhermeAlmeida1 force-pushed the wipClusterizationNewBoth branch 3 times, most recently from 8d9fccd to 330eef4 Compare February 6, 2023 15:55
@beomki-yeo
Copy link
Contributor

beomki-yeo commented Feb 11, 2023

I have a question in alt_measurement implementation. How would I retrieve the set of measurements in a certain module, for example in combinatorial kalman filtering? @guilhermeAlmeida1 @krasznaa

Another question is that if the speed gain is mainly either from the new EDM or the CCA algorithm. If the latter is the case and CCA algorithm can be applied to the old EDM, we might not have to change the EDM at all, I suppose 🤔

@guilhermeAlmeida1
Copy link
Collaborator Author

@beomki-yeo even with the current EDM we are not using the detector modules for anything outside of clustering, and this information is not passed down to seeding. If we need the detector modules for CKF that's something we'd need to add at that point.

Regarding the performance gain, it's down to both, as the CCA algorithm heavily relies on having a flat EDM, and I don't see why we would want to make it less efficient by sticking to the old one.

@beomki-yeo
Copy link
Contributor

beomki-yeo commented Feb 13, 2023

the CCA algorithm heavily relies on having a flat EDM

@stephenswat can clarify this but I would think that a flat EDM and CCA are orthogonal. Could you educate me about how a flat EDM contributes to the performance gain? Also have you tested the original algorithm with a flat EDM and compare with the current performance?

And.. is it necessary to remove the original algorithm? 🤔

@guilhermeAlmeida1
Copy link
Collaborator Author

guilhermeAlmeida1 commented Feb 13, 2023

@beomki-yeo on top of jagged vectors just being inherently slightly slower to handle with copying, our "prefix sums", etc. the module-based jaggedness results in very differently sized subvectors. By having all cells in a single vector it eases the creation of these partitions which all have around the same size which makes the algorithm a lot faster by fully using all GPU threads.
I've nothing against keeping the old algorithm in, but I don't really see any major benefit to it.

@guilhermeAlmeida1
Copy link
Collaborator Author

guilhermeAlmeida1 commented Feb 13, 2023

Also have you tested the original algorithm with a flat EDM and compare with the current performance?

I have not, this could be interesting to look at though

@guilhermeAlmeida1 guilhermeAlmeida1 changed the title Changed device clusterization to flat EDM Changed device clusterization to flat EDM & cell-level parallelism Feb 15, 2023
Copy link
Member

@stephenswat stephenswat left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay this has been sitting here for more than long enough, if the CI is happy with the changes let's put them in.

@stephenswat stephenswat merged commit 90d37f8 into acts-project:main Feb 15, 2023
@guilhermeAlmeida1 guilhermeAlmeida1 deleted the wipClusterizationNewBoth branch February 15, 2023 13:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
cuda Changes related to CUDA edm Changes to the data model improvement Improve an existing feature kokkos Changes related to Kokkos sycl Changes related to SYCL
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants