Starting from the mpi example of computing pi
and using this Intel example construct a code which computes the value of pi
using 2 or more gpus, with 1 GPU device per MPI task.
An approximation to the value of π can be calculated from the following expression
where the answer becomes more accurate with increasing N. As each term is independent, the summation over i can be parallelized nearly trivially. The work is divided in ntasks
so that rank 0 does i=1, 2, ..., N / ntasks, rank 1 does i=N / ntasks + 1, N / ntasks + 2, ... , etc. (we assume that N is evenly divisible by the number of processes). Each tasks computes their own sum. Once finished with the calculation, all ranks (expect rank 0) send their partial sum to rank 0, which then calculates the final result and prints it out.
Starting from the mpi parallel code pi.cpp, make a version that performs the calculation using sycl for the local reduction similar to the reduction with buffer or reduction with usm examples. Remember to assign 1 GPU to 1 task similar to the MPI examples taking into account that each Mahti GPU node has 4 GPUs and each LUMI-G node has 8 GPUs.