Skip to content

TaufeqRazakh/IntroToPerformanceProfiling

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Vtune & Nvprof Profiling

CSCI 596: Scientific Computing and Visualization

Pre-requisite - Installing the profiler GUI

Download & install the latest version of the Intel oneAPI Vtune Profiler GUI from this link.

Upon installation, launch the GUI from the installation directory depending on your OS.

  • windows: [Program Files]\Intel\oneAPI\vtune<version>
  • Linux OS: /opt/intel/oneapi/vtune/<version>
  • mac OS: /opt/intel/oneapi/vtune_profiler/<version>

Download and install the latest version of Intel Advisor here

Upon installation, launch the GUI from from the installation directory depending on your OS.

  • windows: [Program Files]\Intel\oneAPI\advisor<version>
  • Linux OS: /opt/intel/oneapi/advisor/<version>
  • mac OS: /opt/intel/oneapi/advisor/<version>

Download Nvidia Visual Profiler from here. You might sometimes be asked to make an account with Nvidia before you can actually download.

CPU profiling with Vtune

The purpose of profiling is to gain insight into the performance of a program while it runs on a given architecture. We will first try to understand how to carry out profiling on CPU's and then move on to GPU's.

Today's steps should give you an idea about how to carry out algorithm(hotspot), microarchitecture(memory access) and parallelism(threading) analysis for different architectures.

Intel® VTune™ is one of the many profiler tools that is suitable for analyzing multithreaded applications on CPU's (well...not just CPU, but for the scope of this discussion we will limit VTune's application to CPU execution).

Some examples of other available profilers

How does a profiler work ? They make use of the performance counter hardware that is in-built to the architecture. To check if hardware event-based sampling is enabled on your allocated compute node: $ cat /proc/sys/kernel/perf_event_paranoid ----> it should give a value of 0

The directory cpu_profiling has the source code for this section of the tutorial.

We start with the hotspot analysis on Discovery. Within the hotspot_analysis directory we provide the code for serial (single threaded) and parallel (multi threaded) calculation of pi along with the Makefile containing rules to build the binaries. You are already familiar with the these codes from previous assignments.

For VTune analysis applications must be compiled with the Intel® Compiler, we can invoke the suitable compiler by loading the load the necessary modules - intel-oneapi on discovery cluster with the following commands.

salloc --nodes=1 --ntasks=1 --cpus-per-task=2 --partition=debug
module purge
module load intel-oneapi/2021.3

We can now build the binaries using the following make commands.

make singlethreaded_pi_calc
make multithreaded_pi_calc

You should have two executables in your working directory. Set the environment to limit the OpenMp threads

export OMP_NUM_THREADS=2

Try executing the binaries and see if you get the value of pi

$./singlethreaded_pi_calc
PI = 3.141593
$ ./multithreaded_pi_calc
PI = 3.141593

Not we capture some profile reports with the following commands

vtune -collect hotspots -result-dir rSingleThread ./singlethreaded_pi_calc
vtune -collect hotspots -result-dir rMultiThread ./multithreaded_pi_calc
vtune -collect memory-consumption -result-dir rMultiMemory ./multithreaded_pi_calc

This will result in the creation of two reports named rSingleThread, rMultiThread, rMultiMemory. Import the the files to your local machine to view the results.

To quickly view the results from the command line try

vtune -report summary -result-dir rSingleThread/
vtune -report summary -result-dir rMultiThread/

We prefer using the GUI for analysis since it is feature rich and helps with the top-down tree view during analysis. Launch the GUI as listed in the pre-requisite section.

To load the profile report click on the three lines displayed on the left bar and select open > Result > <your report file>. Your report file will end with a .vtune extension.

Inference

omp_2_thread_summary We see that a total of 2 threads are created in the execution omp_2_thread_activity We can analyze the activity of the threads that are forked omp_2_function_memory_allocation We also see the memory allocations and deallocations happening across the call stack.

Roofline analysis with Advisor

For this section move to working on Devcloud's compute node. We will be referring to the sample code provided by intel. This is made available to you under the roofline_analysis directory.

We request compute resources with the following command

qsub -I -l nodes=1:xeon:ppn=2 -d .

Compile the project with the make command and generate a roofline report with the following command.

 advisor --collect=roofline --project-dir=eRooflineSample -- ./release/roofline_demo

Import the report to your local machine and view it with the Advisor GUI. Steps to download and open advisor are mentioned in the pre-requisite section.

Quirks

To see the list of available architecture-specific libraries on your compute node use the $lscpu command. We use the -march=core-avx2 option when compiling on Discovery's compute nodes since the compute node on debug queue support the Advenced Vector Instructions(AVX). To precisely check for AVX compatibility try lscpu | grep avx on your allocated compute node.

Inference

demo_roofline We see our program sits in a region which signals it is approaching the limits of the bandwidth and compute bounds of the architecture.

GPU hostspot analysis with Nvidia Visual Vrofiler and nvprof tool

nvprof is the command line tool used to profile on Nvidia GPU accelerated architectures. Visual Profiler is the GUI that is used to interactively analyze the collected data.
We will be going back to HPC to try this part of the tutorial. The code for this part of the tutorial is available in the gpu_profiling(/gpu_profiling) directory.

Start by requesting the GPU compute nodes with the following command.

salloc --partition=debug --gres=gpu:p100:1 --time=00:30:00

load nvidia sdk and compile

module load pgi-nvhpc/20.7
nvcc -o pi pi.cu

Collect a report with nvprof

nvprof --metrics achieved_occupancy,ipc -o occupancy.prof ./pi
nvprof -o timeline.prof ./pi

Open up NVVP on your local system and analyze the results.

You'll notice something like the plots below. See how Malloc and memcopy take up the highest bandwidth?

nvprof_timeline

Acknowledgements

A huge thanks to Prof. Aiichiro Nakano for suggesting I present performance profiling tools to the CSCI 596 class of Fall 21. I am also indebted to Dr. Marco Olguin, Computational Scientist at USC CARC for all the information and support in making the profiling and roofline analysis possible on Discovery's nodes.

About

No description, website, or topics provided.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages