In the initial step, you have to express parallelism available around expensive loops in the application. We assume you already have some OpenACC knowledge.
In this lab we have used the kernels directive extensively but the loop directive could also have been used.
To reduce time, we will suppose we have profiled the mini-app and we got this results (see Appendix 0):
Number of frequencies 1
Number of shots 1
Number of gradient iterations 1
Number of test iterations 1
Output directory path: results
FWI Program finished in 1059.575686 seconds
======== CPU profiling result (bottom up):
Time(%) Time Name
19.99% 211.766s IDX
16.53% 175.115s compute_component_scell_TR
16.28% 172.505s compute_component_scell_BL
15.63% 165.634s compute_component_scell_BR
9.47% 100.343s compute_component_scell_TL
5.87% 62.2217s compute_component_vcell_TR
5.83% 61.7416s compute_component_vcell_BR
5.17% 54.7615s compute_component_vcell_BL
5.15% 54.5714s compute_component_vcell_TL
======== Data collected at 100Hz frequency
======== Percentage threshold: 1%
We can see that scell
and vcell
functions dominate the execution time.
IDX
is the function that linearizes the 3D volume indexing (i,j,k) triplet into the linear index used in memory.
Usually the compiler is smart enough to inline it, but in this execution it didn't.
Since we know that IDX
is only called inside scell
and vcell
functions, we can safely split the IDX
execution time among scell
and vcell
functions.
Therefore we can safely say that scell
and vcell
accounts for the 99% of the execution time of the application.
If we take a look at those functions in src/fwi_propagator.c we will arrive to this conclusions:
- They are embarrassingly parallel
- All
TR
/TR
/BL
/BR
are very similar - We have to apply the same parallelization strategy for all scell and vcell functions.
We are going to use CUDA Unified Memory to reduce the initial porting complexity.
For that purpose we already modified CMakeLists.txt to add the managed
to the -ta=tesla
openacc target:
set(OpenACC_C_FLAGS "${OpenACC_C_FLAGS} -ta=tesla,cuda8.0,cc20,cc35,cc50,cc60,lineinfo,managed")
To facilitate your work we already implemented the majority of
openacc
pragmas leavingvcell_TL
andscell_TR
for you to implement.
You have to add a #pragma acc kernels
in the outer-most loop of src/fwi_propagator.c:166
(compute_component_vcell_TL
) and src/fwi_propagator.c:624
(compute_component_scell_TR
).
Example for vcell_TL
:
#pragma acc kernels
for(integer y=ny0; y < nyf; y++)
{
for(integer x=nx0; x < nxf; x++)
{
for(integer z=nz0; z < nzf; z++)
{
const real lrho = rho_TL(rho, z, x, y, dimmz, dimmx);
const real stx = stencil_X( _SX, sxptr, dxi, z, x, y, dimmz, dimmx);
const real sty = stencil_Y( _SY, syptr, dyi, z, x, y, dimmz, dimmx);
const real stz = stencil_Z( _SZ, szptr, dzi, z, x, y, dimmz, dimmx);
vptr[IDX(z,x,y,dimmz,dimmx)] += (stx + sty + stz) * dt * lrho;
}
}
}
Then compile fwi
(make sure to use pgcc compiler and to enable OpenACC with -DUSE_OPENACC=ON !), and observe the compiler output:
$ mkdir build
$ cd build
$ cmake -DCMAKE_C_COMPILER=pgcc -DUSE_OPENMP=OFF -DUSE_OPENACC=ON ..
$ make
...
[ 11%] Building C object src/CMakeFiles/fwi-core.dir/fwi_propagator.c.o
...
compute_component_vcell_TL:
166, Generating implicit copy(vptr[:])
170, Loop carried dependence due to exposed use of vptr[:*] prevents parallelization
Accelerator scalar kernel generated
Accelerator kernel generated
Generating Tesla code
170, #pragma acc loop seq
174, #pragma acc loop seq
180, #pragma acc loop seq
174, Loop carried dependence due to exposed use of vptr[:*] prevents parallelization
180, Complex loop carried dependence of vptr-> prevents parallelization
Loop carried dependence due to exposed use of vptr[:*] prevents parallelization
...
compute_component_scell_TR:
636, Loop carried dependence due to exposed use of sxzptr[:*],szzptr[:*],syzptr[:*],syyptr[:*],sxyptr[:*],sxxptr[:*] prevents parallelization
Accelerator scalar kernel generated
Accelerator kernel generated
Generating Tesla code
636, #pragma acc loop seq
640, #pragma acc loop seq
646, #pragma acc loop seq
640, Loop carried dependence due to exposed use of sxzptr[:*],syzptr[:*],szzptr[:*],syyptr[:*],sxyptr[:*],sxxptr[:*] prevents parallelization
646, Complex loop carried dependence of sxxptr->,syyptr-> prevents parallelization
Loop carried dependence due to exposed use of sxzptr[:*] prevents parallelization
Complex loop carried dependence of szzptr-> prevents parallelization
Loop carried dependence due to exposed use of szzptr[:*],syzptr[:*] prevents parallelization
Complex loop carried dependence of syzptr-> prevents parallelization
Loop carried dependence due to exposed use of syyptr[:*] prevents parallelization
Complex loop carried dependence of sxzptr-> prevents parallelization
Loop carried dependence due to exposed use of sxyptr[:*] prevents parallelization
Complex loop carried dependence of sxyptr-> prevents parallelization
Loop carried dependence due to exposed use of sxxptr[:*] prevents parallelization
...
Oops! the compiler detects a dependence and prevents a parallelization (it generates a scalar kernel!). Since we know that vcell_TL is embarrasingly parallel and there isn't a dependence we have to force the compiler to ignore those dependences and parallelize it.
For that we have to add #pragma acc loop independent
before each iteration level:
#pragma acc kernels
#pragma acc loop independent
for (integer y=ny0; y < nyf; y++) {
#pragma acc loop independent
for (integer x=nx0; x < nxf; x++) {
#pragma acc loop independent
for (integer z=nz0; z < nzf; z++) {
Then we can compile the application again:
$ make
Scanning dependencies of target fwi-core
[ 11%] Building C object src/CMakeFiles/fwi-core.dir/fwi_propagator.c.o
...
compute_component_vcell_TL:
166, Generating implicit copy(vptr[:])
171, Loop is parallelizable
Generating Multicore code
171, #pragma acc loop gang
176, Loop is parallelizable
183, Loop is parallelizable
Accelerator kernel generated
Generating Tesla code
171, #pragma acc loop gang /* blockIdx.y */
176, #pragma acc loop gang, vector(4) /* blockIdx.z threadIdx.y */
183, #pragma acc loop gang, vector(32) /* blockIdx.x threadIdx.x */
...
compute_component_scell_TR:
640, Loop is parallelizable
Generating Multicore code
640, #pragma acc loop gang
645, Loop is parallelizable
652, Loop is parallelizable
Accelerator kernel generated
Generating Tesla code
640, #pragma acc loop gang /* blockIdx.y */
645, #pragma acc loop gang, vector(4) /* blockIdx.z threadIdx.y */
652, #pragma acc loop gang, vector(32) /* blockIdx.x threadIdx.x */
...
[ 9%] Linking C static library ../lib/libfwi-core.a
[ 33%] Built target fwi-core
[ 38%] Linking C executable ../../bin/fwi-data-generator
[ 42%] Built target fwi-data-generator
[ 47%] Linking C executable ../../bin/fwi-sched-generator
[ 52%] Built target fwi-sched-generator
[ 57%] Linking C executable ../bin/fwi
[ 61%] Built target fwi
[ 76%] Built target Unity
[ 80%] Linking C executable ../bin/fwi-tests
[100%] Built target fwi-tests
Obs: Functions called from a device parallel region should be declared a device function. For that OpenACC provides the
#pragma acc routine <type>
directive. To speed-up the hands-on lab we already provide the code with those pragmas. We encourage you to checkinclude/fwi/fwi_propagator.h
to see how it's done.
After implementing all scell and vcell functions we can proceed to measure the execution time:
$ make irun
[ 62%] Built target fwi-core
[ 87%] Built target fwi
Scanning dependencies of target irun
[100%] outputs will be in /home/ubuntu/FWI/scripts/output/
PROJECT_SOURCE_DIR: /home/ubuntu/FWI
PROJECT_BINARY_DIR: /home/ubuntu/FWI/build/bin
COMPILER_ID: PGI
---
/home/ubuntu/FWI/build/bin/fwi fwi_schedule.txt
---
MPI rank 0 with GPU 0 (1)
Number of frequencies 1
Number of shots 1
Number of gradient iterations 1
Number of test iterations 1
Output directory path: results
FWI Program finished in 8.664497 seconds
[100%] Built target irun
That is, 13.9x faster than the OpenMP execution.
Remember you can see differences with the soluction with git diff gtc2018-step1-sol