Skip to content

Sub pages

Ahmad Yasin edited this page Dec 26, 2024 · 5 revisions

Instruction-mixes

Starting from lbr.py 1.04 version perf-tools supports getting i-mix stats for user-defined instructions by passing LBR_IMIX env variable to do.py run.

For example, running:
$ LBR_IMIX='jz jnz and' ./do.py profile -pm 100 -a './CLTRAMP3D 12'
will dump all stats in a *info.log file, including the user-defined instructions.

For example, the *info.log file from above will include the defined instructions in the Global stats section:

Global stats:  
perf-tools' lbr.py module version 1.04  
LBR samples: {total_cycles: 563681, IPs: {}, bad: 0, bogus: 111, total: 3754, events: {r20c4:ppp: 3753}, size: {max: 1348, avg: 257.0, min: 114}}  

estimate of          non-cold code footprint [KB]:     919.06  
count of                   non-cold code 4K-pages:       1571  
proxy count of                     non-cold loops:        329   :(see hot loops below)  
count of      backward taken conditional branches:      15000   :   1.60% of ALL  
count of       forward taken conditional branches:      37745   :   4.02% of ALL  
count of                    ST-STACK instructions:      23071   :   2.46% of ALL  
count of                   CISC-TEST instructions:      32953   :   3.51% of ALL  
count of                        CALL instructions:      19779   :   2.11% of ALL  
count of                         RET instructions:      20047   :   2.14% of ALL  
count of                        PUSH instructions:      64742   :   6.90% of ALL  
count of                         POP instructions:      60693   :   6.47% of ALL  
count of                  VZEROUPPER instructions:        677   :   0.07% of ALL  
count of                          JZ instructions:      67621   :   7.21% of ALL  
count of                         JNZ instructions:      33799   :   3.60% of ALL  
count of                         AND instructions:      22232   :   2.37% of ALL  
count of                         LOAD insts-class:     164119   :  17.49% of ALL  
count of                        STORE insts-class:      58415   :   6.22% of ALL  
count of                         LOCK insts-class:         67   :   0.01% of ALL  
count of                     PREFETCH insts-class:          0   :   0.00% of ALL  
count of              VEC128-INT comp insts-class:       1287   :   0.14% of ALL  
count of              VEC256-INT comp insts-class:        300   :   0.03% of ALL  
count of              VEC512-INT comp insts-class:          0   :   0.00% of ALL  
count of                VECX-INT comp insts-class:         69   :   0.01% of ALL  
count of                         ALL instructions:     938470   : 100.00% of ALL  
count of      indirect (call/jump) of >2GB offset:          0  
count of     mispredicted indirect of >2GB offset:          0  
#Global-stats-end  

JIT-profiling

Just-In-Time support is integrated to the master branch.

  • So far it is tested with Java (OpenJDK and Hotspot) and oneDNN.
  • JIT profiling is supported only in system-wide mode.

Follow the tips that do.py prints to screen on first run, e.g. a recent-enough perf tool is needed.

Steps to profile a JIT workload

  1. [Java-only] get flags to include in your Java launcher ./do.py profile --tune :perf-jit:1 -s1 -pm 4

This should quickly return while printing something like:

INFO: system-wide profiling.
INFO: JIT profiling: if Java; make sure JVM was started with '-XX:+PreserveFramePointer -agentpath:/usr/lib/linux-tools/5.19.0-40-generic/libperf-jvmti.so'.
  1. Start your workload when passing flags from step 1 to your JVM. Once it is in steady-state;
  2. Collect profiles using: do.py profile -s10 -o my-workload --tune :perf-jit:1 :help:0 --perf /path/to/recent/perf --mode profile

This step will collect multiple profiles, each is 10-seconds long. Make sure your workload is still running once do.py returns.

  1. Process using: do.py profile -s10 -o my-workload --tune :perf-jit:1 :help:0 --perf /path/to/recent/perf --mode process

Windows-support

Windows does not have a tool like perf in Linux, but Intel VTune has a tool called SEP with a functionality similar to perf record and perf script in Linux, although this functionality is limited only to brstack using LBR.
The implementation of this functionality will be different based on the VTune version.

1. Generating LBR brstack script in Windows using SEP

First, download Intel OneAPI tool or VTune. If you have no concerns for the disk space, we recommended to install Intel One API tool, which includes all of Intel tools including Intel ICX compiler:
https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit-download.html

If you have limited disk space, you can install only VTune from here:
https://www.intel.com/content/www/us/en/developer/tools/oneapi/vtune-profiler-download.html

If VTune version is 2024.0, use this sep command:

$ sep.exe -start -out app.tb7 -ec BR_INST_RETIRED.NEAR_TAKEN:PRECISE=YES:SA=400009:pdir:lbr:USR=YES -lbr no_filter:usr -perf-script ip,brstack -app app.exe -args "options"

If the OneAPI tool or VTune version is 2024.1, there is a newer option called -atype:
Check the capability of SEP for this option with the command below:
$ sep -atypelist
The output of -atypelist should be like below. If you see hwpgo in the list, then you can run the later command, otherwise use the sep command above used for the 2024.0 version.

$ sep -atypelist
Atype: hotspots
Atype: uarch-exploration
Atype: memory-access
Atype: io
Atype: hwpgo
$ sep -start -out app.tb7 -atype hwpgo -lbr no_filter:usr -perf-script event,ip,brstack -app app.exe -args "options"

The output file from both sep commands is app.perf.data.script.

2. Generating perf script output like file

First, you'll need to clone this repository <add Yeongseon's repository here>, the main script is gen_brstackinsn.py.
To run gen_brstackinsn.py, you need the app.perf.data.script file from the previous step, app.exe, and its pdb file ready. If app.exe calls another dll file which does most of jobs, use the dll file instead of app.exe. both files used should be in the same directory.

$ python gen_brstackinsn.py -i full_path\app.perf.data.script -a full_path\app.exe [or full_path\app.dll]

The output file is app.exe-c4000009.perf.script, a perf script output like file that perf-tools will use.
4000009 is sample after value (SAV), and this value is fixed at 400009 in sep.exe in VTune 2024.1, if you used VTune 2024.0 and would like to change it there you can use the option -sav=value in the related sep command above.

3. Processing using do.py

Run do.py on the .perf.script file using the process-win command:

$ ./do.py process-win -w app.exe-c4000009.perf.script

This command will run the LBR-profile step processing the samples from the input file and generate all the related output files, mainly .info.log.

Generating Retire Latency DB

Generating Retire Latency Database .json from .txt/.stat files study directory.

Overview

  • retire_latency_extractor.py -- The main script to run.
    Extracts required values of Retire Latency histograms from .stat files and generates the database .json.
    Required values: Count, Min, Max, Mean, Median, NZ Median (non-zero), Mode, Mode Count, NZ Mode, NZ Mode Count, Buckets.
    Aggregation per event, per num of cores, per Workload, per Component (sub-workload).
    The script internally converts .txt files generated by perf-script on a perf collection from a platform whose PMU supports Retire Latency.
  • perf_data_converter.py -- A script that converts .txt file(s) to Retire Latency .stat file(s) with Retire Latency histograms.
    Used on a single file or a directory.

Usage

generating data files

Run the following to log system setup and collect Retire Latency retire_lat.txt files

<clone latest and cd to perf-tools>
$ ./do.py setup-all disable-atom disable-smt disable-aslr log --tune :msr:1 :loop-ideal-ipc:1  # once after system boot
$ ./do.py profile --tune :help:0 ":model:'MTL-raw0'" -pm 4000 -a "./<local wrapper>.sh <args>" -o "<use below convention>"

running on Retire Latency study directory

simply run './retire_latency_extractor.py <study_directory>'

  • to process only .stat files without converting .txt files, use '-p'
  • check './retire_latency_extractor.py -h' for more options

Requirements

File naming convention

all study files must follow the filename convention:
{workload}_{component}_n{cores number}[_smt2]_{model}_ tpebs-perf.data.retire_lat.{txt/stats}[.gz]
component: sub-workload.
smt2 (optional): the file is a result of a SMT run.
values in [ ] are optional

Event-list

  • DB uses events directory to unify events naming, the default directory is 'perfmon/ADL/events', use '--events_dir' to tune to a different one.

System info & requirements

  • System info
    setup-system.log -- DB includes Platform section including system info from setup-system.log, which was generated using perf-tools package with 'do.py log --tune :msr:1'. Make sure to have this file in cwd or use '--setup-log' to add its path.
  • Meteor Lake client or Granite Rapids server or newer machines.
  • perf-tools (do.py) version 2.73 or newer
  • perf tool version 5.14 or newer
  • Kernel version 6.3-rc3.