In part 1, we implement a 3D multi-XPU diffusion solver computing the steady-state solution of a diffusive process (see equation (1)) for given physical timesteps using an implicit approach leveraging pseudo-transient acceleration (using the so-called "dual-time" method).
For additional reading related to this method, we refer the reader here.
We solve the diffusion equation on a cubic domain.
As initial condition, we define a Gaussian distribution of H centered in the domain's center with amplitude 2 and standard deviation 1.
Moreover, we enforce Dirichlet boundary condition H = 0 on all 6 faces.
In our implementation we use ParallelStencil.jl and ImplicitGlobalGrid.jl for the (multi-)XPU implementation.
We implemented the solver both using array and using kernel programming.
However, we focus on the analysis of the solver and report results only for our kernel programming implementation.
For the kernel programming approach we implemented the solver such that it can use shared memory
or hide_communication
.
Ideally we should combine both approaches, however, this is not possible with the current implementation of ParallelStencil.jl, so in our experiments we consider them mutually exclusive.
We conducted all numerical experiments on a single compute node of octopus.
Such a node has 2 Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz
(6 cores each) and 4 NVIDIA GeForce GTX TITAN X
GPUs.
An overview of the architecture of the node is given below:
ntselepidis@node15 ➜ ~ nvidia-smi topo -m
GPU0 GPU1 GPU2 GPU3 mlx4_0 mlx4_1 CPU Affinity NUMA Affinity
GPU0 X PHB SYS SYS PHB SYS 0-5 0
GPU1 PHB X SYS SYS PHB SYS 0-5 0
GPU2 SYS SYS X PHB SYS PHB 6-11 1
GPU3 SYS SYS PHB X SYS PHB 6-11 1
mlx4_0 PHB PHB SYS SYS X SYS
mlx4_1 SYS SYS PHB PHB SYS X
Legend:
X = Self
SYS = Connection traversing PCIe as well as the SMP interconnect between NUMA nodes (e.g., QPI/UPI)
NODE = Connection traversing PCIe as well as the interconnect between PCIe Host Bridges within a NUMA node
PHB = Connection traversing PCIe as well as a PCIe Host Bridge (typically the CPU)
PXB = Connection traversing multiple PCIe bridges (without traversing the PCIe Host Bridge)
PIX = Connection traversing at most a single PCIe bridge
NV# = Connection traversing a bonded set of # NVLinks
We note that for an NVIDIA GeForce GTX TITAN X
the theoretical performance for double-precision arithmetic is 209.1 GFlop/s, and the memory bandwidth is 336.6 GB/s.
For an Intel(R) Xeon(R) CPU E5-2620 v3 @ 2.40GHz
the memory bandwidth is 59 GB/s.
Below we show an animation of the 3D solution of the diffusive process. Note that we continually adjust the colorbar to allow for a better visualization.
We instrumented our code and computed the operational intensity for the two approaches. We summarize our finding in the following plot.
Using shared memory allows the reuse of local data for the stencil computations and thus reduces the number of loads and stores. Therefore, given that the number of flops for the two approaches is the same, the approach that uses shared memory leads to higher operational intensity.
In this section we present the results of our multi-XPU strong and weak scaling experiments. When running on the CPU, we map one MPI rank per core (i.e. for this experiment we do not use multi-threading). When running on GPUs, we map one MPI rank per GPU. In our strong scaling experiments we increase the total number of MPI ranks while keeping the overall work constant. In our weak scaling experiments we increase the total number of MPI ranks, and also proportionally increase the overall work, so that the work per MPI rank is constant. We run all experiments using both our kernel programming approaches, i.e.:
- using
shared memory
(withouthide_communication
), - using
hide_communication
(withoutshared memory
),
and report performance [GFlop/s], throughput [GB/s], and time [s].
Notes:
- It should be mentioned, that the operational intensity of the two approaches is different because using shared memory substantially reduces the bytes moved (see previous section). Therefore, the throughput plots do not directly show the difference in runtime. On the other hand, our performance plots can directly reflect solving time, since the total amount of floating-point operations in both our approaches is the same.
- We expect that using
shared memory
through ParallelStencil.jl does not have an impact on the CPU since this is purely a GPU feature.
Below we show the part of our benchmark code that computes the local problem sizes in our strong and weak scaling experiments along with some reasoning about our choice in the comments.
# We need 7 buffers for the solver
# memory(nx) = nx^3 * sizeof(Float64) * 7
# => with nx = ny = nz = 2^9 we have memory(nx) ≈ 7.5GB => Fits on Titan X (12 GB)
# After experimenting with 2^9 we found it very hard to converge.
# Empirically, nx = ny = nz = 2^7 gives a good trade-off between convergence speed and precision.
nx, ny, nz =
if strong_scaling
dims_dict = Dict(
1 => (1, 1, 1),
2 => (2, 1, 1),
4 => (2, 2, 1),
8 => (2, 2, 2),
)
2^7 .÷ dims_dict[MPI.Comm_size(MPI.COMM_WORLD)] # global 2^7 gets divided to determine local size
else
2^7 .* (1, 1, 1) # local 2^7 (assigned to every MPI rank)
end
We also note, that in order to ensure the consistency of our experiments we scale the physical dimensions lx, ly, lz
when we conduct weak scaling experiments, to keep the workload constant per MPI rank.
We achieve this with the code shown below:
# Allow scaling the physical problem dimensions with the number of grid points.
# This is useful for weak scaling experiments where we want to keep the work per worker (i.e. MPI rank) fixed.
# In particular, note that usually increasing nx -> 2*nx => dx -> dx/2 => dτ = dτ/4 => !nonlinear!
# Therefore, by scaling the physical size together with (nx, ny, nz) we can keep (dx, dy, dz) constant and therefore keep the work per worker constant.
if scale_physical_size
lx, ly, lz = dims .* (10., 10., 10.)
else
lx, ly, lz = 10.0, 10.0, 10.0
end
Strong and Weak Scaling on the CPU
Observations
- Both the approach that uses
shared memory
and the approach usinghide_communication
exhibit great strong as well as weak scaling on the CPU. - In particular, for weak scaling the solving time stays approximately constant and for both strong and weak scaling the throughput and performance scale proportionally to the compute resources.
- Interestingly, the fastest variant is the one that does not use
hide_communication
. Sinceshared memory
is a GPU feature, it is not clear how ParallelStencil.jl implements it on the CPU and if such a performance difference should be expected. On the other hand,hide_communication
might also slow the computation down for the CPU case. - In terms of memory throughput, the approach that uses
hide_communication
is the one that achieves higher throughput, but as mentioned earlier this is due to an incomparable operational intensity. - The successful (strong) scaling indicates that we are compute bound on the CPU.
Strong and Weak Scaling on the GPU
Observations
- Both the approach that uses
shared memory
and the approach usinghide_communication
exhibit great weak scaling on the GPU, while providing more complicated results for strong scaling. - In the strong scaling plots we see some performance increase when moving from 1 GPU to 2 GPUs in the
shared memory
case, although for thehide_communication
case there is no improvement. On the other hand, when moving from 2 GPUs to 4 GPUs we observe a small performance drop, roughly back to the level of 1 GPU. This is probably because when moving from 1 GPU to 2 GPUs we transition from a compute bound regime to a memory bound regime, whereas when moving from 2 GPUs to 4 GPUs we transition from a memory bound regime to a communication bound regime. We further note the following:- when running on 1 GPU using
shared memory
does not really impact the runtime since we are compute bound - when running on 2 GPUs using
shared memory
improves the runtime compared tohide_communication
since we are memory bound - when running on 4 GPUs using
hide_communication
leads to smaller runtime compared toshared memory
since we are communication bound
- when running on 1 GPU using
- In the weak scaling plots we see that increasing the problem size, while keeping the work fixed per GPU, leads to almost constant runtime (while increasing performance and throughput). This is the best result one could hope for.
- Moreover, in this case using
hide_communication
seems to gradually bring more and more benefit when increasing problem size (and proportionally the number of GPUs used). - Finally, we argue that
nx=ny=nz=2^7
is an appropriate local grid size, as we saturate a single GPU (as seen by the strong scaling results), and can successfully scale in the weak scaling benchmark.
In the following figures, we plot the solution behaviour (i.e. the evolution of a value from the quantity we are diffusing for a specific location) as a function of the solver's tolerance (left), and as function of the numerical grid resolution (right).
Observations
- For this setup, lowering the solver's tolerance improves precision up to approximately tol = 1e-6, after which the solution seems to have converged.
- Increasing the numerical grid resolution is important to achieve good results. It seems that the largest grid size for a single
NVIDIA GeForce GTX TITAN X
is still not enough to fully converge the solution, which further motivates the use of distributed computations.