-
Notifications
You must be signed in to change notification settings - Fork 51
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Tracking Throughput Measurements, main branch (2024.05.04.) #572
Tracking Throughput Measurements, main branch (2024.05.04.) #572
Conversation
To give some "evidence" of what I see. 😉
But honestly, VS Code was using more of my GPU in that time window than the throughput test. 🙃 |
Oof, that's grim. But at least it works! |
Out of curiosity, can you run it through nvprof to see the time distribution of the kernels? |
Tomorrow. Yes, I'll be able to. (I expect huge valleys between the kernels... Since even nvtop showed the GPU not doing anything for half the time.) |
This is good! |
As I wrote yesterday, there should be some easy wins here. Even if I don't yet know what our current bottleneck is. 🤔 Running 100 The kernels are big. 🤔 But for some reason no kernel is currently running in most of the time. 😕 That's the thing that we'll need to understand first and foremost. As it should give us almost an order of magnitude speedup right away. 🙏 |
track_candidates); | ||
|
||
// Return the final container, copied back to the host. | ||
return m_result_copy(track_states); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Turns out, it is this line that is killing us. 🤔
Right now the throughput measurement applications use a vecmem::cuda::host_memory_resource
, wrapped by a vecmem::binary_page_memory_resource
as the "host resource" during this copy back to the host. And I see our caching memory resource using all the CPU time in the world during this copy back to the host. 😦
The result jagged vector is very big as it seems. Since when I turn off host memory caching, things get even slower, this time around the CUDA memory allocations and de-allocations taking forever...
If I remove the copy back to the host at the end of the algorithm chain, I get:
[bash][Legolas]:traccc > ./out/build/sycl/bin/traccc_throughput_st_cuda --detector-file=geometries/odd/odd-detray_geometry_detray.json --grid-file=geometries/odd/odd-detray_surface_grids_detray.json --use-detray-detector --digitization-file=geometries/odd/odd-digi-geometric-config.json --input-directory=odd/geant4_ttbar_mu200/ --input-events=10 --processed-events=100
Running Single-threaded CUDA GPU throughput tests
>>> Detector Options <<<
Detector file : geometries/odd/odd-detray_geometry_detray.json
Material file :
Surface rid file : geometries/odd/odd-detray_surface_grids_detray.json
Use detray::detector: yes
Digitization file : geometries/odd/odd-digi-geometric-config.json
>>> Input Data Options <<<
Input data format : csv
Input directory : odd/geant4_ttbar_mu200/
Number of input events : 10
Number of input events to skip: 0
>>> Clusterization Options <<<
Target cells per partition: 1024
>>> Track Seeding Options <<<
None
>>> Track Finding Options <<<
Track candidates range : 3:100
Minimum step length for the next surface: 0.5 [mm]
Maximum step counts for the next surface: 100
Maximum Chi2 : 30
Maximum branches per step: 4294967295
Maximum number of skipped steps per candidates: 3
>>> Track Propagation Options <<<
Constraint step size : 3.40282e+38 [mm]
Overstep tolerance : -100 [um]
Minimum mask tolerance: 1e-05 [mm]
Maximum mask tolerance: 1 [mm]
Search window : 0 x 0
Runge-Kutta tolerance : 0.0001
>>> Throughput Measurement Options <<<
Cold run event(s) : 10
Processed event(s): 100
Log file :
WARNING: No material in detector
WARNING: No entries in volume finder
Detector check: OK
WARNING: No material in detector
WARNING: No entries in volume finder
Detector check: OK
WARNING: No material in detector
WARNING: No entries in volume finder
Detector check: OK
WARNING: @traccc::io::csv::read_cells: 19157 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/odd/geant4_ttbar_mu200/event000000000-cells.csv
WARNING: @traccc::io::csv::read_cells: 24524 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/odd/geant4_ttbar_mu200/event000000001-cells.csv
WARNING: @traccc::io::csv::read_cells: 17547 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/odd/geant4_ttbar_mu200/event000000002-cells.csv
WARNING: @traccc::io::csv::read_cells: 20889 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/odd/geant4_ttbar_mu200/event000000003-cells.csv
WARNING: @traccc::io::csv::read_cells: 15151 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/odd/geant4_ttbar_mu200/event000000004-cells.csv
WARNING: @traccc::io::csv::read_cells: 21299 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/odd/geant4_ttbar_mu200/event000000005-cells.csv
WARNING: @traccc::io::csv::read_cells: 20111 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/odd/geant4_ttbar_mu200/event000000006-cells.csv
WARNING: @traccc::io::csv::read_cells: 17117 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/odd/geant4_ttbar_mu200/event000000007-cells.csv
WARNING: @traccc::io::csv::read_cells: 14836 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/odd/geant4_ttbar_mu200/event000000008-cells.csv
WARNING: @traccc::io::csv::read_cells: 14147 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/odd/geant4_ttbar_mu200/event000000009-cells.csv
Using CUDA device: NVIDIA GeForce RTX 3080 [id: 0, bus: 1, device: 0]
Reconstructed track parameters: 0
Time totals:
File reading 7505 ms
Warm-up processing 1187 ms
Event processing 9605 ms
Throughput:
Warm-up processing 118.77 ms/event, 8.41966 events/s
Event processing 96.0576 ms/event, 10.4104 events/s
[bash][Legolas]:traccc >
Which is at least semi-competitive against my 64 CPU threads.
[bash][Legolas]:traccc > ./out/build/sycl/bin/traccc_throughput_mt --detector-file=geometries/odd/odd-detray_geometry_detray.json --grid-file=geometries/odd/odd-detray_surface_grids_detray.json --use-detray-detector --digitization-file=geometries/odd/odd-digi-geometric-config.json --input-directory=odd/geant4_ttbar_mu200/ --input-events=10 --processed-events=100 --cpu-threads=64
Running Multi-threaded host-only throughput tests
>>> Detector Options <<<
Detector file : geometries/odd/odd-detray_geometry_detray.json
Material file :
Surface rid file : geometries/odd/odd-detray_surface_grids_detray.json
Use detray::detector: yes
Digitization file : geometries/odd/odd-digi-geometric-config.json
>>> Input Data Options <<<
Input data format : csv
Input directory : odd/geant4_ttbar_mu200/
Number of input events : 10
Number of input events to skip: 0
>>> Clusterization Options <<<
Target cells per partition: 1024
>>> Track Seeding Options <<<
None
>>> Track Finding Options <<<
Track candidates range : 3:100
Minimum step length for the next surface: 0.5 [mm]
Maximum step counts for the next surface: 100
Maximum Chi2 : 30
Maximum branches per step: 4294967295
Maximum number of skipped steps per candidates: 3
>>> Track Propagation Options <<<
Constraint step size : 3.40282e+38 [mm]
Overstep tolerance : -100 [um]
Minimum mask tolerance: 1e-05 [mm]
Maximum mask tolerance: 1 [mm]
Search window : 0 x 0
Runge-Kutta tolerance : 0.0001
>>> Throughput Measurement Options <<<
Cold run event(s) : 10
Processed event(s): 100
Log file :
>>> Multi-Threading Options <<<
CPU threads: 64
WARNING: No material in detector
WARNING: No entries in volume finder
Detector check: OK
WARNING: No material in detector
WARNING: No entries in volume finder
Detector check: OK
WARNING: No material in detector
WARNING: No entries in volume finder
Detector check: OK
WARNING: @traccc::io::csv::read_cells: 19157 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/odd/geant4_ttbar_mu200/event000000000-cells.csv
WARNING: @traccc::io::csv::read_cells: 24524 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/odd/geant4_ttbar_mu200/event000000001-cells.csv
WARNING: @traccc::io::csv::read_cells: 17547 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/odd/geant4_ttbar_mu200/event000000002-cells.csv
WARNING: @traccc::io::csv::read_cells: 20889 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/odd/geant4_ttbar_mu200/event000000003-cells.csv
WARNING: @traccc::io::csv::read_cells: 15151 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/odd/geant4_ttbar_mu200/event000000004-cells.csv
WARNING: @traccc::io::csv::read_cells: 21299 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/odd/geant4_ttbar_mu200/event000000005-cells.csv
WARNING: @traccc::io::csv::read_cells: 20111 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/odd/geant4_ttbar_mu200/event000000006-cells.csv
WARNING: @traccc::io::csv::read_cells: 17117 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/odd/geant4_ttbar_mu200/event000000007-cells.csv
WARNING: @traccc::io::csv::read_cells: 14836 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/odd/geant4_ttbar_mu200/event000000008-cells.csv
WARNING: @traccc::io::csv::read_cells: 14147 duplicate cells found in /data/ssd-1tb/projects/traccc/traccc/data/odd/geant4_ttbar_mu200/event000000009-cells.csv
Reconstructed track parameters: 2831983
Time totals:
File reading 6869 ms
Warm-up processing 4840 ms
Event processing 13350 ms
Throughput:
Warm-up processing 484.024 ms/event, 2.06601 events/s
Event processing 133.508 ms/event, 7.49017 events/s
[bash][Legolas]:traccc >
So: Once again we'll be in the business of EDM / memory management optimizations!
So that the throughput examples could use it for reading in their input data.
Taught the host and CUDA algorithms how to perform track finding and fitting when a Detray geometry is used. Updated the common code of the throughput applications to deal with reading in a Detray geometry when appropriate, and to create the processing algorithms with their new interfaces. Updated the SYCL algorithm to mimic the host and CUDA algorithms. Even though it will not perform any track finding or fitting for the moment.
513a4b8
to
5ce8138
Compare
Thanks for the profiling. Some comments:
Before KF, the CKF kernels are being smaller and smaller due to the smaller amount of tracks. I observed that there is only one track running in the GPU CKF after ~15 step (= 15 th surface-to-surface transport), which is pretty waste of GPU. We will need to think about how to improve this. (Not an easy thing to fix though)
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Approved. Now we need 10^5 GPUs to meet 1 MHz data rate.
It looks to me like the detector is being copied to device properly What I would like to understand is how the navigator is used in propagate_to_next surface: I.e. is it allowed to hold its state in between runs or is the navigation state recreated every time so that the cache has to be rebuilt at every new surface that is reached? The latter option could be costly |
The latter is the case unfortunately. Meanwhile, I am not sure if the major bottleneck is coming from rebuilding the navigation state or matrix operations in stepping. (I would bet on the matrix operations though) |
Yes, that would be interesting to know. I think @andiwand just managed to speed up the ACTS stepper by a lot. However, having to query the grid that often and then doing the intersection of up to a 50 surfaces and subsequent sorting of the cache could have a sizeable impact, too |
I guess you have something similar to the That said these 2x speedup only results in a 15% speedup of the track finding because more time is spent in the navigation, KF math and some suboptimal measurement calibration step. |
This PR is made on top of #571. We'll need to merge in that one first. 😉
With the host and CUDA track finding and fitting more or less working on the ODD ttbar simulations, I spent a bit of time to update the throughput measurement applications to be able to use the ODD files. This is what came out of it. 😉
The code technically works. But boy, will we have some work with making it fast... Because right now it's really not. 😦 I can keep up with what my RTX 3080 is doing, with about 8 CPU threads. 🤔 And this relationship is pretty similar across all pileup values. (Though I didn't do a thorough measurement yet.)
The "good news" is that nvtop shows that my GPU is not getting too busy throughout the measurements, so there should be some big bottlenecks in the code right now. Also, at$\mu$ = 200 I can not use more than a single CPU thread for the jobs anymore with
traccc_throughput_mt_cuda
, otherwise I run out of GPU memory.So... Technically things work, but we have plenty of good work ahead of us to make this all perform well! 😉
P.S. The applications for now continue to work on our old TML files as well. For now it was not a major pain to keep that alive. But soon we'll probably want to get rid of the TML functionality from this code. 🤔