Changed device clusterization to flat EDM & cell-level parallelism #299

guilhermeAlmeida1 · 2022-12-21T17:00:01Z

~~This depends on #288~~ now merged
This very extensive PR changes the device clusterization algorithm to a handling single vectors' only, without any jaggeds.

The main changes in this PR are:

Introduced Partitioning algorithm (on the CPU), which separates the cells' vector into similarly sized groups for the device clusterization algorithm.
Introduced alt_measurement. Similarly to alt_cell, this holds a measurement and a link to a module to store all information in a single collection rather than a container with item + header.
Removed all old common device kernels in clusterization and introduced new ones working with the new EDM types: reduce_problem_celland aggregate_cluster.
Changed clusterization algorithm to use new EDM types and no jaggeds. This comes with the small reversal that a part of our code is no longer single source, as the usage of barriers and group voting requires backend specific functions.

Further changes are mostly minor, and some are almost just duplication of code. Feel free to be somewhat less thourough in reviewing these.

Changed seeding to take as input a collection of spacepoints rather than a container. (Seeding algorithm, seed finding, and the various seeding kernels.)
Changed the sycl/cuda executables (seq_examples, seeding_examples, and full_chain_algorithms) to use new clusterization.
Created alternative throughput tests which use new clusterization. This duplicates the already existing throughput_st / mt code as the CPU and GPU algorithms can no longer run on the same ones.

There's also still some stuff missing which I opted not to add in this PR so as not to make it even longer, namely creating "comparator" code for comparing a collection with a container. Adding a "read_alt" function which reads more than one event in parallel using TBB with the new alt_cell_collection. Create physics performance testing code using the new EDM types. Check whether CPU/Core algorithms benefit from using the jagged clusterization rather than a non-jaggged one. If it does not, this would be quite useful, as we could just ditch this duplicate code and delete the "old" edm by replacing it with the "alt".

guilhermeAlmeida1 · 2022-12-21T18:18:35Z

I ran our throughput tests at ttbar_mu300 with 10K events and it seems that (running on a NVIDIA RTX A4000) we're getting benefits on the single-threaded throughput of about 60% on both SYCL and CUDA. However, on the multi-threaded application I only saw an improvement of about 20% for CUDA and less than 10% for SYCL.

stephenswat · 2022-12-21T20:23:16Z

Brilliant, very good! When we get this in, can we get rid of the old FastSV implementation? I assume this supersedes it.

guilhermeAlmeida1 · 2022-12-22T09:28:07Z

For sure. The only real difference between the version I implemented here and that on the CCA directory is that I chose not to have a distinction between number of threads and number of cells per partition, so that all threads only ever deal with 1 cell, unlike what happened in the original algorithm. From my experiments, there was no gain performancewise in having the distinction between these 2.

guilhermeAlmeida1 · 2022-12-22T09:29:43Z

The CI is currently running into some issues with sycl as I was using the 2020 version on my end. Any plans to update the dependencies on the CI machine to also run the newer version or should I change my code to be compatible with the older one?
Edit: we opted to go with code compatible with an older version of sycl

guilhermeAlmeida1 · 2023-01-05T15:37:28Z

As an additional note, these changes should probably only be made after ensuring the CPU-only algorithms do not incur in any performance loss from changing to this flat EDM at event read-in (the change from cells to alt-cells introduced in #288 , any other EDM types internal to clusterization don't need to change if there are losses there), as we can't really have 2 different ways of reading input data depending on whether said event is going to run on a CPU or a GPU.
I fully expect not to see any performance drop, though I'll still verify this in the coming days.

guilhermeAlmeida1 · 2023-01-17T16:00:33Z

I think this should be reasonably ready for a review at this point. For whoever is assigned this heroic task, I suggest taking a look at the initial comment I made when opening this PR which explains / organises things nicely

krasznaa

Wow... is this a monster PR...

core/include/traccc/edm/alt_measurement.hpp

core/include/traccc/edm/ccl_partition.hpp

core/include/traccc/edm/internal_spacepoint.hpp

core/src/clusterization/partitioning_algorithm.cpp

device/common/include/traccc/seeding/device/count_grid_capacities.hpp

device/cuda/src/clusterization/clusterization_algorithm.cu

examples/run/common/throughput_mt_alt.hpp

examples/run/cuda/full_chain_algorithm.cpp

examples/run/cuda/seeding_example_cuda.cpp

guilhermeAlmeida1 · 2023-01-27T13:27:02Z

I have addressed the comments above. Also added the functionality in #308 but, instead of entirely removing the partition class, I made it into an alias of unsigned int. I am, however, fine with just removing it.

guilhermeAlmeida1 · 2023-01-27T13:28:51Z

I also intend on adding #309 on top of this but in a separate PR.

core/include/traccc/edm/alt_measurement.hpp

stephenswat · 2023-01-31T12:23:32Z

core/include/traccc/edm/alt_seed.hpp

+    scalar weight;
+    scalar z_vertex;


We can discuss this later, perhaps on a different PR, but this information should not really live in the seed EDM.

device/common/src/clusterization/partitioning_algorithm.cpp

stephenswat · 2023-01-31T12:35:45Z

device/cuda/src/clusterization/clusterization_algorithm.cu

-#include "traccc/clusterization/device/count_cluster_cells.hpp"
-#include "traccc/clusterization/device/create_measurements.hpp"
-#include "traccc/clusterization/device/find_clusters.hpp"
+#include "traccc/clusterization/device/aggregate_cluster.hpp"


I am going to assume that this is just a copy of the CUDA code in component_connection.cu, so I will skip over it.

Pretty much. With the difference of using simply 1 cell per thread (which I wanted to address in a separate PR with the unrolling we discussed)

stephenswat · 2023-01-31T12:38:57Z

device/sycl/src/clusterization/clusterization_algorithm.sycl

@@ -9,45 +9,325 @@
 #include "../utils/get_queue.hpp"


Do we have this code hooked up to our CCA tests? Would be useful to see whether this code does what it's supposed to do. I trust that it's a faithful translation but I am worried about subtle differences in memory and execution models between CUDA and SYCL.

This would definately be good but if I understand correctly we currently don't have a way of executing SYCL code in our CI tests. I can however double check this by doing some further testing locally.

examples/run/kokkos/seeding_example_kokkos.cpp

examples/run/sycl/full_chain_algorithm.hpp

stephenswat · 2023-01-31T12:56:46Z

I think we should move forward with this, flattening the EDM is going to be great for our throughput + will allow a lot more future improvements to the code.

beomki-yeo · 2023-02-11T16:45:35Z

I have a question in alt_measurement implementation. How would I retrieve the set of measurements in a certain module, for example in combinatorial kalman filtering? @guilhermeAlmeida1 @krasznaa

Another question is that if the speed gain is mainly either from the new EDM or the CCA algorithm. If the latter is the case and CCA algorithm can be applied to the old EDM, we might not have to change the EDM at all, I suppose 🤔

guilhermeAlmeida1 · 2023-02-13T08:17:10Z

@beomki-yeo even with the current EDM we are not using the detector modules for anything outside of clustering, and this information is not passed down to seeding. If we need the detector modules for CKF that's something we'd need to add at that point.

Regarding the performance gain, it's down to both, as the CCA algorithm heavily relies on having a flat EDM, and I don't see why we would want to make it less efficient by sticking to the old one.

…o previous

…e with deprecated accessors

Changed partition class into an alias of unsigned

beomki-yeo · 2023-02-13T10:18:53Z

the CCA algorithm heavily relies on having a flat EDM

@stephenswat can clarify this but I would think that a flat EDM and CCA are orthogonal. Could you educate me about how a flat EDM contributes to the performance gain? Also have you tested the original algorithm with a flat EDM and compare with the current performance?

And.. is it necessary to remove the original algorithm? 🤔

guilhermeAlmeida1 · 2023-02-13T10:23:21Z

@beomki-yeo on top of jagged vectors just being inherently slightly slower to handle with copying, our "prefix sums", etc. the module-based jaggedness results in very differently sized subvectors. By having all cells in a single vector it eases the creation of these partitions which all have around the same size which makes the algorithm a lot faster by fully using all GPU threads.
I've nothing against keeping the old algorithm in, but I don't really see any major benefit to it.

guilhermeAlmeida1 · 2023-02-13T10:26:43Z

Also have you tested the original algorithm with a flat EDM and compare with the current performance?

I have not, this could be interesting to look at though

stephenswat

Okay this has been sitting here for more than long enough, if the CI is happy with the changes let's put them in.

guilhermeAlmeida1 marked this pull request as draft January 16, 2023 14:03

guilhermeAlmeida1 force-pushed the wipClusterizationNewBoth branch 5 times, most recently from f45c5be to 1028f75 Compare January 17, 2023 14:26

guilhermeAlmeida1 marked this pull request as ready for review January 17, 2023 15:56

guilhermeAlmeida1 requested review from krasznaa and stephenswat January 17, 2023 16:08

guilhermeAlmeida1 mentioned this pull request Jan 19, 2023

Changed CPU clustering to use flat vectors of alt_cells as input and alt_measurements internally. #304

Closed

guilhermeAlmeida1 force-pushed the wipClusterizationNewBoth branch from 1028f75 to 54812a1 Compare January 23, 2023 09:04

krasznaa requested changes Jan 24, 2023

View reviewed changes

stephenswat added cuda Changes related to CUDA improvement Improve an existing feature sycl Changes related to SYCL edm Changes to the data model kokkos Changes related to Kokkos labels Jan 25, 2023

guilhermeAlmeida1 force-pushed the wipClusterizationNewBoth branch from 426ac95 to 4afb4af Compare January 27, 2023 13:24

guilhermeAlmeida1 requested review from krasznaa and stephenswat and removed request for stephenswat January 27, 2023 13:27

guilhermeAlmeida1 force-pushed the wipClusterizationNewBoth branch from 4afb4af to cdee30f Compare January 30, 2023 08:37

stephenswat changed the title ~~Changed device clusterization to flat edm~~ Changed device clusterization to flat EDM Jan 30, 2023

guilhermeAlmeida1 force-pushed the wipClusterizationNewBoth branch 2 times, most recently from f38c896 to 116578e Compare January 31, 2023 12:17

stephenswat reviewed Jan 31, 2023

View reviewed changes

guilhermeAlmeida1 force-pushed the wipClusterizationNewBoth branch 2 times, most recently from 6b6c66e to ea896e1 Compare February 1, 2023 09:02

guilhermeAlmeida1 requested a review from stephenswat February 2, 2023 08:33

guilhermeAlmeida1 force-pushed the wipClusterizationNewBoth branch 3 times, most recently from 8d9fccd to 330eef4 Compare February 6, 2023 15:55

guilhermeAlmeida1 added 9 commits February 13, 2023 09:21

Added clusterization algorithm based on collections only (no jaggeds).

983a6ea

Added new clusterization to sycl reconstruction chains. Minor fixes t…

bcf545b

…o previous

Changed kokkos to use new clusterization

2bbcf40

changed sycl clusterization/seed finding kernel calls to be compatibl…

605c370

…e with deprecated accessors

Addressed comments in PR#299

14df078

Changed partition class into an alias of unsigned

Changed max_cells_per_partition to input variable in executables

ce6015e

Changed partition class and partitioning algorithm to traccc::device

cd04b94

Minor fixes

3b023e8

Fixed alt multi-threaded throughput to use different events

bf4ed85

guilhermeAlmeida1 force-pushed the wipClusterizationNewBoth branch from 330eef4 to bf4ed85 Compare February 13, 2023 08:21

guilhermeAlmeida1 changed the title ~~Changed device clusterization to flat EDM~~ Changed device clusterization to flat EDM & cell-level parallelism Feb 15, 2023

stephenswat approved these changes Feb 15, 2023

View reviewed changes

stephenswat merged commit 90d37f8 into acts-project:main Feb 15, 2023

guilhermeAlmeida1 deleted the wipClusterizationNewBoth branch February 15, 2023 13:42

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Changed device clusterization to flat EDM & cell-level parallelism #299

Changed device clusterization to flat EDM & cell-level parallelism #299

guilhermeAlmeida1 commented Dec 21, 2022 •

edited

Loading

guilhermeAlmeida1 commented Dec 21, 2022

stephenswat commented Dec 21, 2022

guilhermeAlmeida1 commented Dec 22, 2022 •

edited

Loading

guilhermeAlmeida1 commented Dec 22, 2022 •

edited

Loading

guilhermeAlmeida1 commented Jan 5, 2023 •

edited

Loading

guilhermeAlmeida1 commented Jan 17, 2023

krasznaa left a comment

guilhermeAlmeida1 commented Jan 27, 2023

guilhermeAlmeida1 commented Jan 27, 2023

stephenswat Jan 31, 2023

stephenswat Jan 31, 2023

guilhermeAlmeida1 Jan 31, 2023

stephenswat Jan 31, 2023

guilhermeAlmeida1 Feb 1, 2023

stephenswat commented Jan 31, 2023

beomki-yeo commented Feb 11, 2023 •

edited

Loading

guilhermeAlmeida1 commented Feb 13, 2023

beomki-yeo commented Feb 13, 2023 •

edited

Loading

guilhermeAlmeida1 commented Feb 13, 2023 •

edited

Loading

guilhermeAlmeida1 commented Feb 13, 2023 •

edited

Loading

stephenswat left a comment

Changed device clusterization to flat EDM & cell-level parallelism #299

Changed device clusterization to flat EDM & cell-level parallelism #299

Conversation

guilhermeAlmeida1 commented Dec 21, 2022 • edited Loading

guilhermeAlmeida1 commented Dec 21, 2022

stephenswat commented Dec 21, 2022

guilhermeAlmeida1 commented Dec 22, 2022 • edited Loading

guilhermeAlmeida1 commented Dec 22, 2022 • edited Loading

guilhermeAlmeida1 commented Jan 5, 2023 • edited Loading

guilhermeAlmeida1 commented Jan 17, 2023

krasznaa left a comment

Choose a reason for hiding this comment

guilhermeAlmeida1 commented Jan 27, 2023

guilhermeAlmeida1 commented Jan 27, 2023

stephenswat Jan 31, 2023

Choose a reason for hiding this comment

stephenswat Jan 31, 2023

Choose a reason for hiding this comment

guilhermeAlmeida1 Jan 31, 2023

Choose a reason for hiding this comment

stephenswat Jan 31, 2023

Choose a reason for hiding this comment

guilhermeAlmeida1 Feb 1, 2023

Choose a reason for hiding this comment

stephenswat commented Jan 31, 2023

beomki-yeo commented Feb 11, 2023 • edited Loading

guilhermeAlmeida1 commented Feb 13, 2023

beomki-yeo commented Feb 13, 2023 • edited Loading

guilhermeAlmeida1 commented Feb 13, 2023 • edited Loading

guilhermeAlmeida1 commented Feb 13, 2023 • edited Loading

stephenswat left a comment

Choose a reason for hiding this comment

guilhermeAlmeida1 commented Dec 21, 2022 •

edited

Loading

guilhermeAlmeida1 commented Dec 22, 2022 •

edited

Loading

guilhermeAlmeida1 commented Dec 22, 2022 •

edited

Loading

guilhermeAlmeida1 commented Jan 5, 2023 •

edited

Loading

beomki-yeo commented Feb 11, 2023 •

edited

Loading

beomki-yeo commented Feb 13, 2023 •

edited

Loading

guilhermeAlmeida1 commented Feb 13, 2023 •

edited

Loading

guilhermeAlmeida1 commented Feb 13, 2023 •

edited

Loading