Release perf-tools version 4.0 release · aayasin/perf-tools

Several new features and frameworks got added, a couple of each to highlight this time too:

Feature I: yperf – a brand new profiler that automatically guides on performance optimizations
automatically == no human intervention!
The idea is to hint for practical yet profitable optimizations to mitigate a give bottleneck of the workload at hand. Advise module auto-analyzes profiles and currently supports Instructions Fetch BW, Big Code, Cache Memory Bandwidth and Mispredictions bottlenecks as well as >6 SW optimizations (e.g. loop alignment or loop unrolling, if-conversion & de-virtualization included) and 1 HW tunable to disable HW prefetchers when they are useless.
Feature II: detection and stats for functions
Added functions’ detection in LBR profile-step (on top of loops thus far) with summary as well as detailed per-flow stats inside a function, inner/outer functions and more.
Feature III: Pipeline-view displays issue rates at u-arch/pipeline stages
Fetched (by unit), issue, execute, and retire. This capability not only helps analysis of high-IPC sequences in today’s wide-cores, but also provide more insights from silicon with lesser need to go back to simulations.
Framework I: Expanding into Windows
The LBR profile-step can work on profiles collected from Windows. This expanded perf-tools into Client workloads.
Two steps are needed: collect with VTune-SEP on Windows and convert output using gen_brstackinsn.py script. Then feed the result to ./do.py process-win. See Windows support wiki page for details.
Framework II: Java and JIT support
Many more enhancements
- IPC histogram by efficiency
- Paths and top callchains to precise samples as well as Paths to loop heads
- uiCA tool to calculate loop ideal IPC
- study.py new modes: code-l2pf, dsb-bw
- new tools: slow-branch, lbr_filter, LBR_IPC_IPS (reports IPC histogram at selected IPs)
- More new frameworks:
  - Debian distribution,
  - Filter on subset of CPUs,
  - HW models: GNR, LNL, AMD initial support for LBR profile-step
  - Test support for TMA versions 5.01 and 4.8
  - Rollup Retire Latency value to .stat files (MTL, LNL, GNR)
- New workloads
  - aibenchmqark, permute, DCPerf-Django
  - kernels for LCP, LTT Mispredicts , mov-op & load-op macro-fusion, MRN, store_fwd_block
- Many fixes including
  - Speed of LBR post-processing
  - Workaround for witnessed perf tool or perf_events bugs (e.g. CPUs_Utilized)
  - Client Hybrid fixes for P-core

perf-tools core team: Ahmad Yasin, Amiri Khalil, Jon Strang
contributors: Sinduri Gundu, Andi Kleen

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf-tools version 4.0 release