Skip to content

Using APEX with Kokkos

Kevin Huck edited this page Sep 11, 2024 · 17 revisions

Building/Using APEX to measure Kokkos applications

General

No special support in APEX is needed to measure Kokkos loops in an application. To configure APEX for a vanilla Kokkos profiling scenario, do (changing gcc/g++ to whatever compilers you are using for your application, and modifying the installation prefix if desired):

git clone https://github.com/UO-OACISS/apex.git
cd apex
mkdir build
cd build
cmake \
-DCMAKE_C_COMPILER=`which gcc` \
-DCMAKE_CXX_COMPILER=`which g++` \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_INSTALL_PREFIX=${cwd}/install \
..
make -j8
make install

To use APEX with a sample application, you just prefix your program launch with apex_exec and any APEX options. For example, with this test program:

#include <cmath>
#include <iostream>
#include <vector>
#include <thread>
#include <Kokkos_Core.hpp>

#ifndef EXECUTION_SPACE
#define EXECUTION_SPACE DefaultExecutionSpace
#endif

void go(size_t i) {
    int n = 512;
    Kokkos::View<double*> a("a",n);
    Kokkos::View<double*> b("b",n);
    Kokkos::View<double*> c("c",n);

    auto range = Kokkos::RangePolicy<Kokkos::EXECUTION_SPACE>(0,n);

    Kokkos::parallel_for(
        "initialize", range, KOKKOS_LAMBDA(size_t const i) {
            auto x = static_cast<double>(i);
            a(i) = sin(x) * sin(x);
            b(i) = cos(x) * cos(x);
        }
    );

    Kokkos::parallel_for(
        "xpy", range, KOKKOS_LAMBDA(size_t const i) {
            c(i) = a(i) + b(i);
        }
    );

    double sum = 0.0;

    Kokkos::parallel_reduce(
        "sum", range, KOKKOS_LAMBDA(size_t const i, double &lsum) {
            lsum += c(i);
        }, sum
    );

    if (i % 10 == 0) {
        std::cout << "sum = " << sum / n << std::endl;
    }
}

int main(int argc, char *argv[]) {
    Kokkos::initialize(argc, argv);
    std::cout << "Kokkos execution space: "
        << Kokkos::DefaultExecutionSpace::name() << std::endl;
    for (size_t i = 0 ; i < 10 ; i++) {
        go(i);
    }
    Kokkos::finalize();
}

...you would run with:

$ ../apex/install/bin/apex_exec --apex:kokkos ./test_default
Program to run : ./test_default
sum = 1

Elapsed time: 0.001552 seconds
Total processes detected: 1
Cores detected on rank 0: 8
Worker Threads observed on rank 0: 1
Available CPU time on rank 0: 0.001552 seconds
Available CPU time on all ranks: 0.001552 seconds

GPU Timers                                           : #calls  |    mean  |   total  |  % total  
------------------------------------------------------------------------------------------------
------------------------------------------------------------------------------------------------

CPU Timers                                           : #calls  |    mean  |   total  |  % total  
------------------------------------------------------------------------------------------------
Kokkos::parallel_for [Type:Threads, Device: 0] in... :       10      0.000      0.000      9.085
  Kokkos::parallel_for [Type:Threads, Device: 0] xpy :       10      0.000      0.000      3.608
Kokkos::parallel_reduce [Type:Threads, Device: 0]... :       10      0.000      0.000      2.706
Kokkos::parallel_for [Type:Threads, Device: 0] hello :        1      0.000      0.000      2.577
Kokkos::parallel_for [Type:Threads, Device: 0] Ko... :       10      0.000      0.000      0.966
Kokkos::parallel_for [Type:Threads, Device: 0] Ko... :       10      0.000      0.000      0.838
Kokkos::parallel_for [Type:Threads, Device: 0] Ko... :       10      0.000      0.000      0.709
------------------------------------------------------------------------------------------------

                                           APEX Idle :                          0.001     79.510
------------------------------------------------------------------------------------------------
                                        Total timers : 61

Run apex_exec for a list of supported options:

% ./install/bin/apex_exec 

Usage:
apex_exec <APEX options> executable <executable options>

where APEX options are zero or more of:
    --apex:help             show this usage message
    --apex:debug            run with APEX in debugger
    --apex:verbose          enable verbose list of APEX environment variables
    --apex:screen           enable screen text output (on by default)
    --apex:quiet            disable screen text output
    --apex:csv              enable csv text output
    --apex:tau              enable tau profile output
    --apex:taskgraph        enable taskgraph output
                            (graphviz required for post-processing)
    --apex:tasktree         enable tasktree output
                            (graphviz required for post-processing)
    --apex:otf2             enable OTF2 trace output
    --apex:otf2path <value> specify location of OTF2 archive
                            (default: ./OTF2_archive)
    --apex:otf2name <value> specify name of OTF2 file (default: APEX)
    --apex:gtrace           enable Google Trace Events output
    --apex:scatter          enable scatterplot output
                            (python required for post-processing)
    --apex:openacc          enable OpenACC support
    --apex:kokkos           enable Kokkos support
    --apex:kokkos_tuning    enable Kokkos runtime autotuning support
    --apex:kokkos_fence     enable Kokkos fences for async kernels
    --apex:raja             enable RAJA support
    --apex:pthread          enable pthread wrapper support
    --apex:memory           enable memory wrapper support
    --apex:untied           enable tasks to migrate cores/OS threads
                            during execution (not compatible with trace output)
    --apex:cuda             enable CUDA/CUPTI measurement (default: off)
    --apex:cuda_counters    enable CUDA/CUPTI counter support (default: off)
    --apex:cuda_driver      enable CUDA driver API callbacks (default: off)
    --apex:cuda_details     enable per-kernel statistics where available (default: off)
    --apex:hip              enable HIP/ROCTracer measurement (default: off)
    --apex:hip_metrics      enable HIP/ROCProfiler metric support (default: off)
    --apex:hip_counters     enable HIP/ROCTracer counter support (default: off)
    --apex:hip_driver       enable HIP/ROCTracer KSA driver API callbacks (default: off)
    --apex:hip_details      enable per-kernel statistics where available (default: off)
    --apex:monitor_gpu      enable GPU monitoring services (CUDA NVML, ROCm SMI)
    --apex:cpuinfo          enable sampling of /proc/cpuinfo (Linux only)
    --apex:meminfo          enable sampling of /proc/meminfo (Linux only)
    --apex:net              enable sampling of /proc/net/dev (Linux only)
    --apex:status           enable sampling of /proc/self/status (Linux only)
    --apex:io               enable sampling of /proc/self/io (Linux only)
    --apex:period <value>   specify frequency of OS/HW sampling
    --apex:ompt             enable OpenMP profiling (requires runtime support)
    --apex:ompt_simple      only enable OpenMP Tools required events
    --apex:ompt_details     enable all OpenMP Tools events
    --apex:source           resolve function, file and line info for address lookups with binutils
                            (default: function only)
    --apex:preload <lib>    extra libraries to load with LD_PRELOAD _before_ APEX libraries
                            (LD_PRELOAD value is added _after_ APEX libraries)
    

CUDA Support

To enable CUDA / CUPTI support, reconfigure/recompile APEX with CUDA support enabled (assuming the CUDA environment is set up and nvcc is in your path). You may also need to provide the CUDAToolkit path with -DCUDAToolkit_ROOT=/path/to/cuda:

cmake \
-DCMAKE_C_COMPILER=`which gcc` \
-DCMAKE_CXX_COMPILER=`which g++` \
-DCMAKE_BUILD_TYPE=Release \
-DCMAKE_INSTALL_PREFIX=../install \
-DAPEX_WITH_CUDA=TRUE \
-DCUDAToolkit_ROOT=/packages/nvhpc/22.11_cuda11.8/Linux_x86_64/22.11/cuda/11.8 \
..

Then, to add measurement of CUDA, add the --apex:cuda option:

$ ../apex/install/bin/apex_exec --apex:kokkos --apex:cuda ./test_default.cuda 
Program to run : ./test_default.cuda
Kokkos execution space: Cuda
sum = 1

Elapsed time: 0.559486 seconds
Total processes detected: 1
Cores detected on rank 0: 160
Worker Threads observed on rank 0: 1
Available CPU time on rank 0: 0.559486 seconds
Available CPU time on all ranks: 0.559486 seconds

Counter                                   : #samples | minimum |    mean  |  maximum |  stddev 
------------------------------------------------------------------------------------------------
                    1 Minute Load average :        1      3.210      3.210      3.210      0.000   
                     GPU: Bytes Allocated :       37      4.000   5.20e+04   6.55e+05   1.55e+05   
                         GPU: Bytes Freed :       31      4.000   4087.871   4224.000    745.609   
      GPU: Total Bytes Occupied on Device :       68      0.000   1.65e+06   1.81e+06   4.82e+05   
        GPU: Total Bytes Occupied on Host :        3    256.000   3.61e+05   5.57e+05   2.55e+05   
        Host: Page-locked Bytes Allocated :        3    256.000   1.86e+05   5.24e+05   2.40e+05   
                           status:Threads :        1     66.000     66.000     66.000      0.000   
                            status:VmData :        1   5.48e+05   5.48e+05   5.48e+05      0.000   
                             status:VmExe :        1    896.000    896.000    896.000      0.000   
                             status:VmHWM :        1   1.85e+04   1.85e+04   1.85e+04      0.000   
                             status:VmLck :        1      0.000      0.000      0.000      0.000   
                             status:VmLib :        1   2.98e+04   2.98e+04   2.98e+04      0.000   
                             status:VmPMD :        1     16.000     16.000     16.000      0.000   
                             status:VmPTE :        1     21.000     21.000     21.000      0.000   
                            status:VmPeak :        1   6.49e+05   6.49e+05   6.49e+05      0.000   
                             status:VmPin :        1      0.000      0.000      0.000      0.000   
                             status:VmRSS :        1   1.85e+04   1.85e+04   1.85e+04      0.000   
                            status:VmSize :        1   6.49e+05   6.49e+05   6.49e+05      0.000   
                             status:VmStk :        1    192.000    192.000    192.000      0.000   
                            status:VmSwap :        1      0.000      0.000      0.000      0.000   
        status:nonvoluntary_ctxt_switches :        1      0.000      0.000      0.000      0.000   
           status:voluntary_ctxt_switches :        1     24.000     24.000     24.000      0.000   
------------------------------------------------------------------------------------------------

GPU Timers                                           : #calls  |    mean  |   total  |  % total  
------------------------------------------------------------------------------------------------
                             GPU: Stream Synchronize :       53      0.001      0.078     13.996
                            GPU: Context Synchronize :        5      0.001      0.003      0.453
GPU: void Kokkos::Impl::cuda_parallel_launch_loca... :       10      0.000      0.000      0.014
                                         GPU: Memset :       32      0.000      0.000      0.012
GPU: void Kokkos::Impl::cuda_parallel_launch_loca... :       10      0.000      0.000      0.012
                                    GPU: Memcpy HtoD :       42      0.000      0.000      0.010
GPU: void Kokkos::Impl::cuda_parallel_launch_loca... :       10      0.000      0.000      0.007
GPU: desul::(anonymous namespace)::init_lock_arra... :        1      0.000      0.000      0.005
GPU: Kokkos::(anonymous namespace)::init_lock_arr... :        1      0.000      0.000      0.001
GPU: Kokkos::(anonymous namespace)::init_lock_arr... :        1      0.000      0.000      0.001
GPU: Kokkos::Impl::(anonymous namespace)::query_c... :        1      0.000      0.000      0.001
                                    GPU: Memcpy DtoH :        1      0.000      0.000      0.000
------------------------------------------------------------------------------------------------

CPU Timers                                           : #calls  |    mean  |   total  |  % total  
------------------------------------------------------------------------------------------------
                               cudaDeviceSynchronize :        5      0.051      0.255     45.664
                                          cudaMemset :        2      0.071      0.141     25.255
                               cudaStreamSynchronize :       53      0.001      0.079     14.047
  Kokkos::parallel_reduce [Type:Cuda, Device: 0] sum :       10      0.002      0.021      3.676
                             cudaGetDeviceProperties :        4      0.000      0.002      0.321
                                          cudaMemcpy :        2      0.001      0.002      0.310
                                          cudaMalloc :       37      0.000      0.001      0.206
                                            cudaFree :       31      0.000      0.001      0.144
                                    cudaLaunchKernel :       34      0.000      0.001      0.111
                                       cudaHostAlloc :        2      0.000      0.000      0.075
                                     cudaMemcpyAsync :       33      0.000      0.000      0.063
Kokkos::parallel_for [Type:Cuda, Device: 0] initi... :       10      0.000      0.000      0.056
                                     cudaMemsetAsync :       30      0.000      0.000      0.053
     Kokkos::parallel_for [Type:Cuda, Device: 0] xpy :       10      0.000      0.000      0.040
Kokkos::parallel_for [Type:Cuda, Device: 0] Kokko... :       10      0.000      0.000      0.027
Kokkos::parallel_for [Type:Cuda, Device: 0] Kokko... :       10      0.000      0.000      0.026
Kokkos::parallel_for [Type:Cuda, Device: 0] Kokko... :       10      0.000      0.000      0.025
                                  cudaMemcpyToSymbol :        8      0.000      0.000      0.023
                                  cudaGetDeviceCount :        1      0.000      0.000      0.008
                               cudaFuncGetAttributes :        3      0.000      0.000      0.006
                              cudaFuncSetCacheConfig :        3      0.000      0.000      0.003
                                      cudaMallocHost :        1      0.000      0.000      0.003
                                       cudaSetDevice :        1      0.000      0.000      0.003
                                     cudaEventCreate :        1      0.000      0.000      0.002
                            cudaDeviceSetCacheConfig :        1      0.000      0.000      0.001
------------------------------------------------------------------------------------------------

------------------------------------------------------------------------------------------------
                                        Total timers : 479

To see the task dependency hierarchy, run with the --apex:tasktree option, which will generate a apex_tasktree.csv file that can be post-processed by apex-treesummary.py to generate a Graphviz DOT file as well as a human-readable tasktree.txt file: images/kokkos-tasktree.png

|-> 0.43476 - 100.0000% [1] {min=0.4348, max=0.4348, mean=0.4348, var=0.0000, std dev=0.0000} APEX MAIN 
|   |-> 0.20985 - 48.2670% [5] {min=0.0022, max=0.2008, mean=0.0420, var=0.0063, std dev=0.0794} cudaDeviceSynchronize 
|   |   |-> 0.00915 - 4.3614% [5] {min=0.0001, max=0.0023, mean=0.0018, var=0.0000, std dev=0.0009} GPU: Context Synchronize 
|   |   Remainder: 0.2007 - 46.1619%
|   |-> 0.07169 - 16.4887% [43] {min=0.0000, max=0.0023, mean=0.0017, var=0.0000, std dev=0.0010} cudaStreamSynchronize 
|   |   |-> 0.07145 - 99.6687% [43] {min=0.0000, max=0.0023, mean=0.0017, var=0.0000, std dev=0.0010} GPU: Stream Synchronize 
|   |   Remainder: 0.0002 - 0.0546%
|   |-> 0.03703 - 8.5169% [2] {min=0.0000, max=0.0370, mean=0.0185, var=0.0003, std dev=0.0185} cudaMemset 
|   |   |-> 0.00001 - 0.0239% [2] {min=0.0000, max=0.0000, mean=0.0000, var=0.0000, std dev=0.0000} GPU: Memset 
|   |   Remainder: 0.0370 - 8.5148%
|   |-> 0.02271 - 5.2234% [10] {min=0.0023, max=0.0023, mean=0.0023, var=0.0000, std dev=0.0000} Kokkos::parallel_reduce [Type:Cuda, Device: 0] sum 
|   |   |-> 0.02243 - 98.7593% [10] {min=0.0022, max=0.0023, mean=0.0022, var=0.0000, std dev=0.0000} cudaStreamSynchronize 
|   |   |   |-> 0.02237 - 99.7432% [10] {min=0.0022, max=0.0023, mean=0.0022, var=0.0000, std dev=0.0000} GPU: Stream Synchronize 
|   |   |   Remainder: 0.0001 - 0.2536%
|   |   |-> 0.00015 - 0.6575% [10] {min=0.0000, max=0.0000, mean=0.0000, var=0.0000, std dev=0.0000} cudaLaunchKernel 
|   |   |   |-> 0.00008 - 52.0754% [10] {min=0.0000, max=0.0000, mean=0.0000, var=0.0000, std dev=0.0000} GPU: void Kokkos::Impl::cuda_parallel_launch_local_memory<Kokkos::Impl::ParallelReduce<go(unsigned long)::{lambda(unsigned long, double&)#1}, Kokkos::RangePolicy<Kokkos::Cuda>, Kokkos::InvalidType, Kokkos::RangePolicy> >(Kokkos::Impl::ParallelReduce<go(unsigned long)::{lambda(unsigned long, double&)#1}, Kokkos::RangePolicy<Kokkos::Cuda>, Kokkos::InvalidType, Kokkos::RangePolicy>) 
|   |   |   Remainder: 0.0001 - 0.3151%
|   |   |-> 0.00001 - 0.0390% [1] {min=0.0000, max=0.0000, mean=0.0000, var=-0.0000, std dev=nan} cudaFuncGetAttributes 
|   |   |-> 0.00000 - 0.0216% [1] {min=0.0000, max=0.0000, mean=0.0000, var=0.0000, std dev=0.0000} cudaFuncSetCacheConfig 
|   |   Remainder: 0.0001 - 0.0273%
|   |-> 0.00713 - 1.6391% [10] {min=0.0000, max=0.0069, mean=0.0007, var=0.0000, std dev=0.0021} Kokkos::parallel_for [Type:Cuda, Device: 0] initialize 
|   |   |-> 0.00685 - 96.1635% [3] {min=0.0023, max=0.0023, mean=0.0023, var=0.0000, std dev=0.0000} cudaMemcpyToSymbol 
|   |   |   |-> 0.00000 - 0.0579% [3] {min=0.0000, max=0.0000, mean=0.0000, var=0.0000, std dev=0.0000} GPU: Memcpy HtoD 
|   |   |   Remainder: 0.0068 - 96.1078%
|   |   |-> 0.00016 - 2.2617% [10] {min=0.0000, max=0.0000, mean=0.0000, var=0.0000, std dev=0.0000} cudaLaunchKernel 
|   |   |   |-> 0.00006 - 39.7099% [10] {min=0.0000, max=0.0000, mean=0.0000, var=0.0000, std dev=0.0000} GPU: void Kokkos::Impl::cuda_parallel_launch_local_memory<Kokkos::Impl::ParallelFor<go(unsigned long)::{lambda(unsigned long)#1}, Kokkos::RangePolicy<Kokkos::Cuda>, Kokkos::RangePolicy> >(Kokkos::Impl::ParallelFor<go(unsigned long)::{lambda(unsigned long)#1}, Kokkos::RangePolicy<Kokkos::Cuda>, Kokkos::RangePolicy>) 
|   |   |   Remainder: 0.0001 - 1.3636%
|   |   |-> 0.00001 - 0.1766% [1] {min=0.0000, max=0.0000, mean=0.0000, var=0.0000, std dev=0.0000} cudaFuncGetAttributes 
|   |   |-> 0.00001 - 0.1023% [1] {min=0.0000, max=0.0000, mean=0.0000, var=0.0000, std dev=0.0000} cudaFuncSetCacheConfig 
|   |   Remainder: 0.0001 - 0.0212%
|   |-> 0.00464 - 1.0682% [5] {min=0.0000, max=0.0023, mean=0.0009, var=0.0000, std dev=0.0011} cudaMemcpyToSymbol 
|   |   |-> 0.00001 - 0.1426% [5] {min=0.0000, max=0.0000, mean=0.0000, var=0.0000, std dev=0.0000} GPU: Memcpy HtoD 
|   |   Remainder: 0.0046 - 1.0666%
|   |-> 0.00160 - 0.3675% [4] {min=0.0004, max=0.0004, mean=0.0004, var=0.0000, std dev=0.0000} cudaGetDeviceProperties 
|   |-> 0.00115 - 0.2652% [37] {min=0.0000, max=0.0003, mean=0.0000, var=0.0000, std dev=0.0001} cudaMalloc 
|   |-> 0.00079 - 0.1819% [31] {min=0.0000, max=0.0002, mean=0.0000, var=0.0000, std dev=0.0000} cudaFree 
|   |-> 0.00041 - 0.0943% [2] {min=0.0000, max=0.0004, mean=0.0002, var=0.0000, std dev=0.0002} cudaHostAlloc 
|   |-> 0.00035 - 0.0806% [33] {min=0.0000, max=0.0000, mean=0.0000, var=0.0000, std dev=0.0000} cudaMemcpyAsync 
|   |   |-> 0.00004 - 12.8374% [33] {min=0.0000, max=0.0000, mean=0.0000, var=0.0000, std dev=0.0000} GPU: Memcpy HtoD 
|   |   Remainder: 0.0003 - 0.0702%
|   |-> 0.00023 - 0.0531% [10] {min=0.0000, max=0.0000, mean=0.0000, var=0.0000, std dev=0.0000} Kokkos::parallel_for [Type:Cuda, Device: 0] xpy 
|   |   |-> 0.00014 - 61.3124% [10] {min=0.0000, max=0.0000, mean=0.0000, var=0.0000, std dev=0.0000} cudaLaunchKernel 
|   |   |   |-> 0.00004 - 28.8812% [10] {min=0.0000, max=0.0000, mean=0.0000, var=0.0000, std dev=0.0000} GPU: void Kokkos::Impl::cuda_parallel_launch_local_memory<Kokkos::Impl::ParallelFor<go(unsigned long)::{lambda(unsigned long)#2}, Kokkos::RangePolicy<Kokkos::Cuda>, Kokkos::RangePolicy> >(Kokkos::Impl::ParallelFor<go(unsigned long)::{lambda(unsigned long)#2}, Kokkos::RangePolicy<Kokkos::Cuda>, Kokkos::RangePolicy>) 
|   |   |   Remainder: 0.0001 - 43.6046%
|   |   |-> 0.00001 - 4.0673% [1] {min=0.0000, max=0.0000, mean=0.0000, var=-0.0000, std dev=nan} cudaFuncGetAttributes 
|   |   |-> 0.00001 - 2.2429% [1] {min=0.0000, max=0.0000, mean=0.0000, var=-0.0000, std dev=nan} cudaFuncSetCacheConfig 
|   |   Remainder: 0.0001 - 0.0172%
|   |-> 0.00016 - 0.0367% [4] {min=0.0000, max=0.0001, mean=0.0000, var=0.0000, std dev=0.0000} cudaLaunchKernel 
|   |   |-> 0.00002 - 14.2163% [1] {min=0.0000, max=0.0000, mean=0.0000, var=-0.0000, std dev=nan} GPU: desul::(anonymous namespace)::init_lock_arrays_cuda_kernel() 
|   |   |-> 0.00001 - 3.3486% [1] {min=0.0000, max=0.0000, mean=0.0000, var=-0.0000, std dev=nan} GPU: Kokkos::(anonymous namespace)::init_lock_array_kernel_threadid(int) 
|   |   |-> 0.00000 - 2.8072% [1] {min=0.0000, max=0.0000, mean=0.0000, var=0.0000, std dev=0.0000} GPU: Kokkos::(anonymous namespace)::init_lock_array_kernel_atomic() 
|   |   |-> 0.00000 - 2.3460% [1] {min=0.0000, max=0.0000, mean=0.0000, var=-0.0000, std dev=nan} GPU: Kokkos::Impl::(anonymous namespace)::query_cuda_kernel_arch(int*) 
|   |   Remainder: 0.0001 - 0.0284%
|   |-> 0.00015 - 0.0351% [10] {min=0.0000, max=0.0000, mean=0.0000, var=0.0000, std dev=0.0000} Kokkos::parallel_for [Type:Cuda, Device: 0] Kokkos::View::initialization [a] via memset 
|   |   |-> 0.00010 - 67.3574% [10] {min=0.0000, max=0.0000, mean=0.0000, var=0.0000, std dev=0.0000} cudaMemsetAsync 
|   |   |   |-> 0.00002 - 16.8062% [10] {min=0.0000, max=0.0000, mean=0.0000, var=0.0000, std dev=0.0000} GPU: Memset 
|   |   |   Remainder: 0.0001 - 56.0371%
|   |   Remainder: 0.0000 - 0.0115%
|   |-> 0.00014 - 0.0330% [10] {min=0.0000, max=0.0000, mean=0.0000, var=0.0000, std dev=0.0000} Kokkos::parallel_for [Type:Cuda, Device: 0] Kokkos::View::initialization [b] via memset 
|   |   |-> 0.00010 - 67.2569% [10] {min=0.0000, max=0.0000, mean=0.0000, var=0.0000, std dev=0.0000} cudaMemsetAsync 
|   |   |   |-> 0.00002 - 18.0111% [10] {min=0.0000, max=0.0000, mean=0.0000, var=0.0000, std dev=0.0000} GPU: Memset 
|   |   |   Remainder: 0.0001 - 55.1432%
|   |   Remainder: 0.0000 - 0.0108%
|   |-> 0.00014 - 0.0326% [10] {min=0.0000, max=0.0000, mean=0.0000, var=0.0000, std dev=0.0000} Kokkos::parallel_for [Type:Cuda, Device: 0] Kokkos::View::initialization [c] via memset 
|   |   |-> 0.00010 - 67.8282% [10] {min=0.0000, max=0.0000, mean=0.0000, var=0.0000, std dev=0.0000} cudaMemsetAsync 
|   |   |   |-> 0.00002 - 18.0413% [10] {min=0.0000, max=0.0000, mean=0.0000, var=0.0000, std dev=0.0000} GPU: Memset 
|   |   |   Remainder: 0.0001 - 55.5911%
|   |   Remainder: 0.0000 - 0.0105%
|   |-> 0.00008 - 0.0179% [2] {min=0.0000, max=0.0000, mean=0.0000, var=0.0000, std dev=0.0000} cudaMemcpy 
|   |   |-> 0.00000 - 2.4648% [1] {min=0.0000, max=0.0000, mean=0.0000, var=-0.0000, std dev=nan} GPU: Memcpy DtoH 
|   |   |-> 0.00000 - 2.0951% [1] {min=0.0000, max=0.0000, mean=0.0000, var=-0.0000, std dev=nan} GPU: Memcpy HtoD 
|   |   Remainder: 0.0001 - 0.0171%
|   |-> 0.00003 - 0.0078% [1] {min=0.0000, max=0.0000, mean=0.0000, var=-0.0000, std dev=nan} cudaGetDeviceCount 
|   |-> 0.00002 - 0.0043% [1] {min=0.0000, max=0.0000, mean=0.0000, var=-0.0000, std dev=nan} cudaMallocHost 
|   |-> 0.00001 - 0.0028% [1] {min=0.0000, max=0.0000, mean=0.0000, var=-0.0000, std dev=nan} cudaEventCreate 
|   |-> 0.00001 - 0.0026% [1] {min=0.0000, max=0.0000, mean=0.0000, var=-0.0000, std dev=nan} cudaSetDevice 
|   |-> 0.00001 - 0.0017% [1] {min=0.0000, max=0.0000, mean=0.0000, var=-0.0000, std dev=nan} cudaDeviceSetCacheConfig 
|   Remainder: 0.0764 - 17.5800%

To inclusively capture the time spent in the Kokkos kernel (not just the launch), use the --apex:kokkos_fence option, which will explicitly fence for each Kokkos region:

|-> 0.72432 - 100.0000% [1] {min=0.7243, max=0.7243, mean=0.7243, var=-0.0000, std dev=nan} APEX MAIN 
|   |-> 0.25000 - 34.5144% [65] {min=0.0000, max=0.2402, mean=0.0038, var=0.0009, std dev=0.0296} cudaDeviceSynchronize 
|   |   |-> 0.00953 - 3.8132% [65] {min=0.0000, max=0.0023, mean=0.0001, var=0.0000, std dev=0.0005} GPU: Context Synchronize 
|   |   Remainder: 0.2405 - 33.1983%
|   |-> 0.15381 - 21.2350% [2] {min=0.0000, max=0.1538, mean=0.0769, var=0.0059, std dev=0.0769} cudaMemset 
|   |   |-> 0.00001 - 0.0057% [2] {min=0.0000, max=0.0000, mean=0.0000, var=0.0000, std dev=0.0000} GPU: Memset 
|   |   Remainder: 0.1538 - 21.2338%
|   |-> 0.07558 - 10.4340% [43] {min=0.0000, max=0.0047, mean=0.0018, var=0.0000, std dev=0.0010} cudaStreamSynchronize 
|   |   |-> 0.07533 - 99.6801% [43] {min=0.0000, max=0.0047, mean=0.0018, var=0.0000, std dev=0.0010} GPU: Stream Synchronize 
|   |   Remainder: 0.0002 - 0.0334%
|   |-> 0.02767 - 3.8196% [10] {min=0.0023, max=0.0070, mean=0.0028, var=0.0000, std dev=0.0014} Kokkos::parallel_for [Type:Cuda, Device: 0] initialize 
|   |   |-> 0.02271 - 82.0969% [10] {min=0.0022, max=0.0023, mean=0.0023, var=0.0000, std dev=0.0000} cudaDeviceSynchronize 
|   |   |   |-> 0.02266 - 99.7500% [10] {min=0.0022, max=0.0023, mean=0.0023, var=0.0000, std dev=0.0000} GPU: Context Synchronize 
|   |   |   Remainder: 0.0001 - 0.2053%
|   |   |-> 0.00459 - 16.6082% [3] {min=0.0000, max=0.0023, mean=0.0015, var=0.0000, std dev=0.0011} cudaMemcpyToSymbol 
|   |   |   |-> 0.00000 - 0.0898% [3] {min=0.0000, max=0.0000, mean=0.0000, var=0.0000, std dev=0.0000} GPU: Memcpy HtoD 
|   |   |   Remainder: 0.0046 - 16.5933%
|   |   |-> 0.00018 - 0.6675% [10] {min=0.0000, max=0.0000, mean=0.0000, var=0.0000, std dev=0.0000} cudaLaunchKernel 
|   |   |   |-> 0.00006 - 34.7238% [10] {min=0.0000, max=0.0000, mean=0.0000, var=0.0000, std dev=0.0000} GPU: void Kokkos::Impl::cuda_parallel_launch_local_memory<Kokkos::Impl::ParallelFor<go(unsigned long)::{lambda(unsigned long)#1}, Kokkos::RangePolicy<Kokkos::Cuda>, Kokkos::RangePolicy> >(Kokkos::Impl::ParallelFor<go(unsigned long)::{lambda(unsigned long)#1}, Kokkos::RangePolicy<Kokkos::Cuda>, Kokkos::RangePolicy>) 
|   |   |   Remainder: 0.0001 - 0.4357%
|   |   |-> 0.00001 - 0.0506% [1] {min=0.0000, max=0.0000, mean=0.0000, var=0.0000, std dev=0.0000} cudaFuncGetAttributes 
|   |   |-> 0.00001 - 0.0262% [1] {min=0.0000, max=0.0000, mean=0.0000, var=0.0000, std dev=0.0000} cudaFuncSetCacheConfig 
|   |   Remainder: 0.0002 - 0.0210%
|   |-> 0.02314 - 3.1940% [10] {min=0.0023, max=0.0023, mean=0.0023, var=0.0000, std dev=0.0000} Kokkos::parallel_reduce [Type:Cuda, Device: 0] sum 
|   |   |-> 0.02269 - 98.0673% [10] {min=0.0022, max=0.0023, mean=0.0023, var=0.0000, std dev=0.0000} cudaStreamSynchronize 
|   |   |   |-> 0.02263 - 99.7442% [10] {min=0.0022, max=0.0023, mean=0.0023, var=0.0000, std dev=0.0000} GPU: Stream Synchronize 
|   |   |   Remainder: 0.0001 - 0.2509%
|   |   |-> 0.00016 - 0.7057% [10] {min=0.0000, max=0.0000, mean=0.0000, var=0.0000, std dev=0.0000} cudaLaunchKernel 
|   |   |   |-> 0.00008 - 49.5468% [10] {min=0.0000, max=0.0000, mean=0.0000, var=0.0000, std dev=0.0000} GPU: void Kokkos::Impl::cuda_parallel_launch_local_memory<Kokkos::Impl::ParallelReduce<go(unsigned long)::{lambda(unsigned long, double&)#1}, Kokkos::RangePolicy<Kokkos::Cuda>, Kokkos::InvalidType, Kokkos::RangePolicy> >(Kokkos::Impl::ParallelReduce<go(unsigned long)::{lambda(unsigned long, double&)#1}, Kokkos::RangePolicy<Kokkos::Cuda>, Kokkos::InvalidType, Kokkos::RangePolicy>) 
|   |   |   Remainder: 0.0001 - 0.3561%
|   |   |-> 0.00011 - 0.4690% [10] {min=0.0000, max=0.0000, mean=0.0000, var=0.0000, std dev=0.0000} cudaDeviceSynchronize 
|   |   |   |-> 0.00006 - 54.9500% [10] {min=0.0000, max=0.0000, mean=0.0000, var=0.0000, std dev=0.0000} GPU: Context Synchronize 
|   |   |   Remainder: 0.0000 - 0.2113%
|   |   |-> 0.00001 - 0.0373% [1] {min=0.0000, max=0.0000, mean=0.0000, var=0.0000, std dev=0.0000} cudaFuncGetAttributes 
|   |   |-> 0.00000 - 0.0209% [1] {min=0.0000, max=0.0000, mean=0.0000, var=0.0000, std dev=0.0000} cudaFuncSetCacheConfig 
|   |   Remainder: 0.0002 - 0.0223%
|   |-> 0.02305 - 3.1827% [10] {min=0.0023, max=0.0023, mean=0.0023, var=0.0000, std dev=0.0000} Kokkos::parallel_for [Type:Cuda, Device: 0] Kokkos::View::initialization [b] via memset 
|   |   |-> 0.02287 - 99.1884% [10] {min=0.0023, max=0.0023, mean=0.0023, var=0.0000, std dev=0.0000} cudaDeviceSynchronize 
|   |   |   |-> 0.02282 - 99.7834% [10] {min=0.0022, max=0.0023, mean=0.0023, var=0.0000, std dev=0.0000} GPU: Context Synchronize 
|   |   |   Remainder: 0.0000 - 0.2149%
|   |   |-> 0.00010 - 0.4228% [10] {min=0.0000, max=0.0000, mean=0.0000, var=0.0000, std dev=0.0000} cudaMemsetAsync 
|   |   |   |-> 0.00002 - 17.8274% [10] {min=0.0000, max=0.0000, mean=0.0000, var=0.0000, std dev=0.0000} GPU: Memset 
|   |   |   Remainder: 0.0001 - 0.3474%
|   |   Remainder: 0.0001 - 0.0124%
|   |-> 0.02303 - 3.1799% [10] {min=0.0023, max=0.0023, mean=0.0023, var=0.0000, std dev=0.0000} Kokkos::parallel_for [Type:Cuda, Device: 0] xpy 
|   |   |-> 0.02274 - 98.7192% [10] {min=0.0022, max=0.0023, mean=0.0023, var=0.0000, std dev=0.0000} cudaDeviceSynchronize 
|   |   |   |-> 0.02268 - 99.7610% [10] {min=0.0022, max=0.0023, mean=0.0023, var=0.0000, std dev=0.0000} GPU: Context Synchronize 
|   |   |   Remainder: 0.0001 - 0.2360%
|   |   |-> 0.00016 - 0.7077% [10] {min=0.0000, max=0.0000, mean=0.0000, var=0.0000, std dev=0.0000} cudaLaunchKernel 
|   |   |   |-> 0.00004 - 26.6773% [10] {min=0.0000, max=0.0000, mean=0.0000, var=0.0000, std dev=0.0000} GPU: void Kokkos::Impl::cuda_parallel_launch_local_memory<Kokkos::Impl::ParallelFor<go(unsigned long)::{lambda(unsigned long)#2}, Kokkos::RangePolicy<Kokkos::Cuda>, Kokkos::RangePolicy> >(Kokkos::Impl::ParallelFor<go(unsigned long)::{lambda(unsigned long)#2}, Kokkos::RangePolicy<Kokkos::Cuda>, Kokkos::RangePolicy>) 
|   |   |   Remainder: 0.0001 - 0.5189%
|   |   |-> 0.00001 - 0.0391% [1] {min=0.0000, max=0.0000, mean=0.0000, var=0.0000, std dev=0.0000} cudaFuncGetAttributes 
|   |   |-> 0.00001 - 0.0218% [1] {min=0.0000, max=0.0000, mean=0.0000, var=-0.0000, std dev=nan} cudaFuncSetCacheConfig 
|   |   Remainder: 0.0001 - 0.0163%
|   |-> 0.02303 - 3.1799% [10] {min=0.0023, max=0.0023, mean=0.0023, var=0.0000, std dev=0.0000} Kokkos::parallel_for [Type:Cuda, Device: 0] Kokkos::View::initialization [c] via memset 
|   |   |-> 0.02285 - 99.1970% [10] {min=0.0022, max=0.0023, mean=0.0023, var=0.0000, std dev=0.0000} cudaDeviceSynchronize 
|   |   |   |-> 0.02280 - 99.7757% [10] {min=0.0022, max=0.0023, mean=0.0023, var=0.0000, std dev=0.0000} GPU: Context Synchronize 
|   |   |   Remainder: 0.0001 - 0.2225%
|   |   |-> 0.00010 - 0.4226% [10] {min=0.0000, max=0.0000, mean=0.0000, var=0.0000, std dev=0.0000} cudaMemsetAsync 
|   |   |   |-> 0.00002 - 17.9168% [10] {min=0.0000, max=0.0000, mean=0.0000, var=0.0000, std dev=0.0000} GPU: Memset 
|   |   |   Remainder: 0.0001 - 0.3469%
|   |   Remainder: 0.0001 - 0.0121%
|   |-> 0.02302 - 3.1785% [10] {min=0.0023, max=0.0023, mean=0.0023, var=0.0000, std dev=0.0000} Kokkos::parallel_for [Type:Cuda, Device: 0] Kokkos::View::initialization [a] via memset 
|   |   |-> 0.02282 - 99.1317% [10] {min=0.0023, max=0.0023, mean=0.0023, var=0.0000, std dev=0.0000} cudaDeviceSynchronize 
|   |   |   |-> 0.02277 - 99.7617% [10] {min=0.0023, max=0.0023, mean=0.0023, var=0.0000, std dev=0.0000} GPU: Context Synchronize 
|   |   |   Remainder: 0.0001 - 0.2362%
|   |   |-> 0.00011 - 0.4574% [10] {min=0.0000, max=0.0000, mean=0.0000, var=0.0000, std dev=0.0000} cudaMemsetAsync 
|   |   |   |-> 0.00002 - 16.5306% [10] {min=0.0000, max=0.0000, mean=0.0000, var=0.0000, std dev=0.0000} GPU: Memset 
|   |   |   Remainder: 0.0001 - 0.3818%
|   |   Remainder: 0.0001 - 0.0131%
|   |-> 0.00464 - 0.6406% [5] {min=0.0000, max=0.0023, mean=0.0009, var=0.0000, std dev=0.0011} cudaMemcpyToSymbol 
|   |   |-> 0.00001 - 0.1462% [5] {min=0.0000, max=0.0000, mean=0.0000, var=0.0000, std dev=0.0000} GPU: Memcpy HtoD 
|   |   Remainder: 0.0046 - 0.6396%
|   |-> 0.00175 - 0.2419% [2] {min=0.0000, max=0.0017, mean=0.0009, var=0.0000, std dev=0.0008} cudaMemcpy 
|   |   |-> 0.00000 - 0.1169% [1] {min=0.0000, max=0.0000, mean=0.0000, var=-0.0000, std dev=nan} GPU: Memcpy DtoH 
|   |   |-> 0.00000 - 0.1004% [1] {min=0.0000, max=0.0000, mean=0.0000, var=0.0000, std dev=0.0000} GPU: Memcpy HtoD 
|   |   Remainder: 0.0017 - 0.2414%
|   |-> 0.00159 - 0.2195% [4] {min=0.0004, max=0.0004, mean=0.0004, var=0.0000, std dev=0.0000} cudaGetDeviceProperties 
|   |-> 0.00118 - 0.1633% [37] {min=0.0000, max=0.0003, mean=0.0000, var=0.0000, std dev=0.0001} cudaMalloc 
|   |-> 0.00081 - 0.1116% [31] {min=0.0000, max=0.0002, mean=0.0000, var=0.0000, std dev=0.0000} cudaFree 
|   |-> 0.00042 - 0.0583% [2] {min=0.0000, max=0.0004, mean=0.0002, var=0.0000, std dev=0.0002} cudaHostAlloc 
|   |-> 0.00038 - 0.0521% [33] {min=0.0000, max=0.0000, mean=0.0000, var=0.0000, std dev=0.0000} cudaMemcpyAsync 
|   |   |-> 0.00005 - 12.1538% [33] {min=0.0000, max=0.0000, mean=0.0000, var=0.0000, std dev=0.0000} GPU: Memcpy HtoD 
|   |   Remainder: 0.0003 - 0.0458%
|   |-> 0.00016 - 0.0224% [4] {min=0.0000, max=0.0001, mean=0.0000, var=0.0000, std dev=0.0000} cudaLaunchKernel 
|   |   |-> 0.00003 - 16.9626% [1] {min=0.0000, max=0.0000, mean=0.0000, var=0.0000, std dev=0.0000} GPU: desul::(anonymous namespace)::init_lock_arrays_cuda_kernel() 
|   |   |-> 0.00001 - 3.1128% [1] {min=0.0000, max=0.0000, mean=0.0000, var=0.0000, std dev=0.0000} GPU: Kokkos::(anonymous namespace)::init_lock_array_kernel_threadid(int) 
|   |   |-> 0.00000 - 2.8369% [1] {min=0.0000, max=0.0000, mean=0.0000, var=0.0000, std dev=0.0000} GPU: Kokkos::(anonymous namespace)::init_lock_array_kernel_atomic() 
|   |   |-> 0.00000 - 2.4232% [1] {min=0.0000, max=0.0000, mean=0.0000, var=0.0000, std dev=0.0000} GPU: Kokkos::Impl::(anonymous namespace)::query_cuda_kernel_arch(int*) 
|   |   Remainder: 0.0001 - 0.0167%
|   |-> 0.00003 - 0.0042% [1] {min=0.0000, max=0.0000, mean=0.0000, var=-0.0000, std dev=nan} cudaGetDeviceCount 
|   |-> 0.00002 - 0.0025% [1] {min=0.0000, max=0.0000, mean=0.0000, var=0.0000, std dev=0.0000} cudaMallocHost 
|   |-> 0.00002 - 0.0022% [1] {min=0.0000, max=0.0000, mean=0.0000, var=-0.0000, std dev=nan} cudaSetDevice 
|   |-> 0.00001 - 0.0016% [1] {min=0.0000, max=0.0000, mean=0.0000, var=0.0000, std dev=0.0000} cudaEventCreate 
|   |-> 0.00001 - 0.0010% [1] {min=0.0000, max=0.0000, mean=0.0000, var=-0.0000, std dev=nan} cudaDeviceSetCacheConfig 
|   Remainder: 0.0910 - 12.5608%

HIP Support

Similar to CUDA support, just add the -DAPEX_WITH_HIP=TRUE flag when configuring APEX, and if necessary provide the path to the ROCm install with -DROCM_ROOT=/opt/rocm-4.5.2 (or equivalent).

OpenMP Support

If the OpenMP runtime provides OMPT support, you can enable support with -DAPEX_WITH_OMPT=TRUE. The CMake configuration should detect whether the runtime and compiler provide the necessary support. If not, and the compiler is GCC or LLVM-based (Intel, Clang - but not Apple clang), you can potentially have APEX build a runtime drop-in replacement LLVM OpenMP runtime with the support with the -DAPEX_BUILD_OMPT=TRUE flag. Please note that OpenMP target offload will not be supported in this mode.

Runtime Auto Tuning Support

As of September 2024 the CUDA execution space is the only one supported by internal Kokkos Auto tuning for occupancy. All other execution spaces should support internal tuning for MDRange (tiling factors) and Team (team size, vector length) policies. All back ends are supported by explicit/manual auto tuning. Range policy tuning for all execution spaces is in development. To enable auto tuning support in Kokkos, the -DKokkos_ENABLE_TUNING=ON flag should be used to build Kokkos. To enable tuning with APEX at runtime, run with:

apex_exec --apex:kokkos --apex:kokkos_tuning <kokkos-application> --kokkos-tune-internals

To see the evolution of the counters, add the --apex:gtrace flag and visualize the trace_events.0.json.gz file with the Perfetto trace visualizer. Some of the environment variables that control APEX tuning behavior are:

Variable Values Description
APEX_KOKKOS_VERBOSE 0,1 enable verbose output
APEX_KOKKOS_COUNTERS 0,1 enable collection of counters during tuning
APEX_KOKKOS_TUNING_WINDOW non-zero positive integer how many times to evaluate each configuration (minimum time is chosen as representative)
APEX_KOKKOS_TUNING_POLICY simulated_annealing, genetic_search, exhaustive, random search strategy option, simulated_annealing is default

For a long, in-depth overview of APEX runtime auto tuning of Kokkos kernels, please see Kokkos Runtime Auto Tuning with APEX.