Several new features and frameworks got added, a couple of each to highlight this time too:
-
Feature I:
yperf
– a brand new profiler that automatically guides on performance optimizations
automatically == no human intervention!
The idea is to hint for practical yet profitable optimizations to mitigate a give bottleneck of the workload at hand. Advise module auto-analyzes profiles and currently supports Instructions Fetch BW, Big Code, Cache Memory Bandwidth and Mispredictions bottlenecks as well as >6 SW optimizations (e.g. loop alignment or loop unrolling, if-conversion & de-virtualization included) and 1 HW tunable to disable HW prefetchers when they are useless. -
Feature II: detection and stats for functions
Added functions’ detection in LBR profile-step (on top of loops thus far) with summary as well as detailed per-flow stats inside a function, inner/outer functions and more. -
Feature III: Pipeline-view displays issue rates at u-arch/pipeline stages
Fetched (by unit), issue, execute, and retire. This capability not only helps analysis of high-IPC sequences in today’s wide-cores, but also provide more insights from silicon with lesser need to go back to simulations. -
Framework I: Expanding into Windows
The LBR profile-step can work on profiles collected from Windows. This expanded perf-tools into Client workloads.
Two steps are needed: collect with VTune-SEP on Windows and convert output using gen_brstackinsn.py script. Then feed the result to./do.py process-win
. See Windows support wiki page for details. -
Framework II: Java and JIT support
-
Many more enhancements
- IPC histogram by efficiency
- Paths and top callchains to precise samples as well as Paths to loop heads
- uiCA tool to calculate loop ideal IPC
study.py
new modes:code-l2pf
,dsb-bw
- new tools:
slow-branch
,lbr_filter
,LBR_IPC_IPS
(reports IPC histogram at selected IPs) - More new frameworks:
- Debian distribution,
- Filter on subset of CPUs,
- HW models: GNR, LNL, AMD initial support for LBR profile-step
- Test support for TMA versions 5.01 and 4.8
- Rollup Retire Latency value to .stat files (MTL, LNL, GNR)
- New workloads
- aibenchmqark, permute, DCPerf-Django
- kernels for LCP, LTT Mispredicts , mov-op & load-op macro-fusion, MRN, store_fwd_block
- Many fixes including
- Speed of LBR post-processing
- Workaround for witnessed perf tool or perf_events bugs (e.g.
CPUs_Utilized
) - Client Hybrid fixes for P-core
perf-tools core team: Ahmad Yasin, Amiri Khalil, Jon Strang
contributors: Sinduri Gundu, Andi Kleen