-
Notifications
You must be signed in to change notification settings - Fork 21
Sub pages
Starting from lbr.py 1.04 version perf-tools supports getting i-mix stats for user-defined instructions by passing LBR_IMIX env variable to do.py run.
For example, running:
$ LBR_IMIX='jz jnz and' ./do.py profile -pm 100 -a './CLTRAMP3D 12'
will dump all stats in a *info.log file, including the user-defined instructions.
For example, the *info.log file from above will include the defined instructions in the Global stats section:
Global stats:
perf-tools' lbr.py module version 1.04
LBR samples: {total_cycles: 563681, IPs: {}, bad: 0, bogus: 111, total: 3754, events: {r20c4:ppp: 3753}, size: {max: 1348, avg: 257.0, min: 114}}
estimate of non-cold code footprint [KB]: 919.06
count of non-cold code 4K-pages: 1571
proxy count of non-cold loops: 329 :(see hot loops below)
count of backward taken conditional branches: 15000 : 1.60% of ALL
count of forward taken conditional branches: 37745 : 4.02% of ALL
count of ST-STACK instructions: 23071 : 2.46% of ALL
count of CISC-TEST instructions: 32953 : 3.51% of ALL
count of CALL instructions: 19779 : 2.11% of ALL
count of RET instructions: 20047 : 2.14% of ALL
count of PUSH instructions: 64742 : 6.90% of ALL
count of POP instructions: 60693 : 6.47% of ALL
count of VZEROUPPER instructions: 677 : 0.07% of ALL
count of JZ instructions: 67621 : 7.21% of ALL
count of JNZ instructions: 33799 : 3.60% of ALL
count of AND instructions: 22232 : 2.37% of ALL
count of LOAD insts-class: 164119 : 17.49% of ALL
count of STORE insts-class: 58415 : 6.22% of ALL
count of LOCK insts-class: 67 : 0.01% of ALL
count of PREFETCH insts-class: 0 : 0.00% of ALL
count of VEC128-INT comp insts-class: 1287 : 0.14% of ALL
count of VEC256-INT comp insts-class: 300 : 0.03% of ALL
count of VEC512-INT comp insts-class: 0 : 0.00% of ALL
count of VECX-INT comp insts-class: 69 : 0.01% of ALL
count of ALL instructions: 938470 : 100.00% of ALL
count of indirect (call/jump) of >2GB offset: 0
count of mispredicted indirect of >2GB offset: 0
#Global-stats-end
Just-In-Time support is integrated to the master branch.
- So far it is tested with Java (OpenJDK and Hotspot) and oneDNN.
- JIT profiling is supported only in system-wide mode.
Follow the tips that do.py prints to screen on first run, e.g. a recent-enough perf tool is needed.
- [Java-only] get flags to include in your Java launcher
./do.py profile --tune :perf-jit:1 -s1 -pm 4
This should quickly return while printing something like:
INFO: system-wide profiling.
INFO: JIT profiling: if Java; make sure JVM was started with '-XX:+PreserveFramePointer -agentpath:/usr/lib/linux-tools/5.19.0-40-generic/libperf-jvmti.so'.
- Start your workload when passing flags from step 1 to your JVM. Once it is in steady-state;
- Collect profiles using:
do.py profile -s10 -o my-workload --tune :perf-jit:1 :help:0 --perf /path/to/recent/perf --mode profile
This step will collect multiple profiles, each is 10-seconds long. Make sure your workload is still running once do.py returns.
- Process using:
do.py profile -s10 -o my-workload --tune :perf-jit:1 :help:0 --perf /path/to/recent/perf --mode process
Windows does not have a tool like perf in Linux, but Intel VTune has a tool called SEP with a functionality similar to perf record and perf script in Linux, although this functionality is limited only to brstack using LBR.
The implementation of this functionality will be different based on the VTune version.
First, download Intel OneAPI tool or VTune. If you have no concerns for the disk space, we recommended to install Intel One API tool, which includes all of Intel tools including Intel ICX compiler:
https://www.intel.com/content/www/us/en/developer/tools/oneapi/base-toolkit-download.html
If you have limited disk space, you can install only VTune from here:
https://www.intel.com/content/www/us/en/developer/tools/oneapi/vtune-profiler-download.html
If VTune version is 2024.0, use this sep command:
$ sep.exe -start -out app.tb7 -ec BR_INST_RETIRED.NEAR_TAKEN:PRECISE=YES:SA=400009:pdir:lbr:USR=YES -lbr no_filter:usr -perf-script ip,brstack -app app.exe -args "options"
If the OneAPI tool or VTune version is 2024.1, there is a newer option called -atype:
Check the capability of SEP for this option with the command below:
$ sep -atypelist
The output of -atypelist should be like below. If you see hwpgo in the list, then you can run the later command, otherwise use the sep command above used for the 2024.0 version.
$ sep -atypelist
Atype: hotspots
Atype: uarch-exploration
Atype: memory-access
Atype: io
Atype: hwpgo
$ sep -start -out app.tb7 -atype hwpgo -lbr no_filter:usr -perf-script event,ip,brstack -app app.exe -args "options"
The output file from both sep commands is app.perf.data.script.
First, you'll need to clone this repository <add Yeongseon's repository here>, the main script is gen_brstackinsn.py.
To run gen_brstackinsn.py, you need the app.perf.data.script file from the previous step, app.exe, and its pdb file ready. If app.exe calls another dll file which does most of jobs, use the dll file instead of app.exe. both files used should be in the same directory.
$ python gen_brstackinsn.py -i full_path\app.perf.data.script -a full_path\app.exe [or full_path\app.dll]
The output file is app.exe-c4000009.perf.script, a perf script output like file that perf-tools will use.
4000009 is sample after value (SAV), and this value is fixed at 400009 in sep.exe in VTune 2024.1, if you used VTune 2024.0 and would like to change it there you can use the option -sav=value in the related sep command above.
Run do.py on the .perf.script file using the process-win command:
$ ./do.py process-win -w app.exe-c4000009.perf.script
This command will run the LBR-profile step processing the samples from the input file and generate all the related output files, mainly .info.log.
Generating Retire Latency Database .json from .txt/.stat files study directory.
-
retire_latency_extractor.py -- The main script to run.
Extracts required values of Retire Latency histograms from .stat files and generates the database .json.
Required values: Count, Min, Max, Mean, Median, NZ Median (non-zero), Mode, Mode Count, NZ Mode, NZ Mode Count, Buckets.
Aggregation per event, per num of cores, per Workload, per Component (sub-workload).
The script internally converts .txt files generated by perf-script on a perf collection from a platform whose PMU supports Retire Latency. -
perf_data_converter.py -- A script that converts .txt file(s) to Retire Latency .stat file(s) with
Retire Latency histograms.
Used on a single file or a directory.
Run the following to log system setup and collect Retire Latency retire_lat.txt files
<clone latest and cd to perf-tools>
$ ./do.py setup-all disable-atom disable-smt disable-aslr log --tune :msr:1 :loop-ideal-ipc:1 # once after system boot
$ ./do.py profile --tune :help:0 ":model:'MTL-raw0'" -pm 4000 -a "./<local wrapper>.sh <args>" -o "<use below convention>"
simply run './retire_latency_extractor.py <study_directory>'
- to process only .stat files without converting .txt files, use '-p'
- check './retire_latency_extractor.py -h' for more options
all study files must follow the filename convention:
{workload}_{component}_n{cores number}[_smt2]_{model}_
tpebs-perf.data.retire_lat.{txt/stats}[.gz]
component: sub-workload.
smt2 (optional): the file is a result of a SMT run.
values in [ ] are optional
- DB uses events directory to unify events naming, the default directory is 'perfmon/ADL/events', use '--events_dir' to tune to a different one.
- System info
setup-system.log -- DB includes Platform section including system info from setup-system.log, which was generated using perf-tools package with 'do.py log --tune :msr:1'. Make sure to have this file in cwd or use '--setup-log' to add its path. - Meteor Lake client or Granite Rapids server or newer machines.
- perf-tools (do.py) version 2.73 or newer
- perf tool version 5.14 or newer
- Kernel version 6.3-rc3.
For hyperlinked files, right-click the link and save the file with the name as it appears in the hyperlink.