You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Using this issue to keep track of some notes on timings discussed on Gitter :
(from @ronawho)
As a rule of thumb -- ordered fine-grained comm can achieve ~80 MB/s per node injection rate, unordered fine grain comm can achieve ~400 MB/s injection rate, and ordered bulk comm can achieve 8 GB/s per node.
aries is capable of ~8 GB/s uni-directional, and I think 15 GB/s bidirectional.
"""For applications in which traffic is uniformly distributed from each node to each of the other nodes
(e.g., all-to-all), global bandwidth controls performance rather than the bisection — and all the optical
links contribute. Peak global bandwidth is 11.7 GB/s per node for a full network. With the payload
efficiency of 64 percent this equates to 7.5 GB/s per direction. """
The NPB-FT class D problem is a 32GB array. Our default configuration happens to be 8 locales, so 4 GB per locales. A single FFT requires both GET/PUT -ing this data twice, so a total of 4x4GB or 16 GB per FFT. If we don't overlap communication, then the best we can do (ignoring the actual FFT time) is 2s per FFT.
The YZ FFT (which is completely local) takes ~0.2s per iteration, so the total time for XYZ might be estimated at 0.3s per FFT.
So an estimate of the amount of time is ~2.3s per FFT or about 60s total. So maybe #4 is already at this limit, or very close.
The text was updated successfully, but these errors were encountered:
Using this issue to keep track of some notes on timings discussed on Gitter :
(from @ronawho)
As a rule of thumb -- ordered fine-grained comm can achieve ~80 MB/s per node injection rate, unordered fine grain comm can achieve ~400 MB/s injection rate, and ordered bulk comm can achieve 8 GB/s per node.
aries is capable of ~8 GB/s uni-directional, and I think 15 GB/s bidirectional.
"""For applications in which traffic is uniformly distributed from each node to each of the other nodes
(e.g., all-to-all), global bandwidth controls performance rather than the bisection — and all the optical
links contribute. Peak global bandwidth is 11.7 GB/s per node for a full network. With the payload
efficiency of 64 percent this equates to 7.5 GB/s per direction. """
Refs --
https://chapel-lang.org/perf/16-node-xc/?configs=gnuugniqthreads&graphs=smallarraygetperformance,largearraygetperformance,smallarrayputperformance,largearrayputperformance has those numbers
https://www.cray.com/sites/default/files/resources/CrayXCNetwork.pdf has more information on that than you'll ever want
The NPB-FT class D problem is a 32GB array. Our default configuration happens to be 8 locales, so 4 GB per locales. A single FFT requires both GET/PUT -ing this data twice, so a total of 4x4GB or 16 GB per FFT. If we don't overlap communication, then the best we can do (ignoring the actual FFT time) is 2s per FFT.
The YZ FFT (which is completely local) takes ~0.2s per iteration, so the total time for XYZ might be estimated at 0.3s per FFT.
So an estimate of the amount of time is ~2.3s per FFT or about 60s total. So maybe #4 is already at this limit, or very close.
The text was updated successfully, but these errors were encountered: