Skip to content

perf-tools version 4.0 release

Latest
Compare
Choose a tag to compare
@aayasin aayasin released this 06 Feb 20:37
· 5 commits to master since this release

Several new features and frameworks got added, a couple of each to highlight this time too:

  • Feature I: yperf – a brand new profiler that automatically guides on performance optimizations
    automatically == no human intervention!
    The idea is to hint for practical yet profitable optimizations to mitigate a give bottleneck of the workload at hand. Advise module auto-analyzes profiles and currently supports Instructions Fetch BW, Big Code, Cache Memory Bandwidth and Mispredictions bottlenecks as well as >6 SW optimizations (e.g. loop alignment or loop unrolling, if-conversion & de-virtualization included) and 1 HW tunable to disable HW prefetchers when they are useless.

  • Feature II: detection and stats for functions
    Added functions’ detection in LBR profile-step (on top of loops thus far) with summary as well as detailed per-flow stats inside a function, inner/outer functions and more.

  • Feature III: Pipeline-view displays issue rates at u-arch/pipeline stages
    Fetched (by unit), issue, execute, and retire. This capability not only helps analysis of high-IPC sequences in today’s wide-cores, but also provide more insights from silicon with lesser need to go back to simulations.

  • Framework I: Expanding into Windows
    The LBR profile-step can work on profiles collected from Windows. This expanded perf-tools into Client workloads.
    Two steps are needed: collect with VTune-SEP on Windows and convert output using gen_brstackinsn.py script. Then feed the result to ./do.py process-win. See Windows support wiki page for details.

  • Framework II: Java and JIT support

  • Many more enhancements

    • IPC histogram by efficiency
    • Paths and top callchains to precise samples as well as Paths to loop heads
    • uiCA tool to calculate loop ideal IPC
    • study.py new modes: code-l2pf, dsb-bw
    • new tools: slow-branch, lbr_filter, LBR_IPC_IPS (reports IPC histogram at selected IPs)
    • More new frameworks:
      • Debian distribution,
      • Filter on subset of CPUs,
      • HW models: GNR, LNL, AMD initial support for LBR profile-step
      • Test support for TMA versions 5.01 and 4.8
      • Rollup Retire Latency value to .stat files (MTL, LNL, GNR)
    • New workloads
      • aibenchmqark, permute, DCPerf-Django
      • kernels for LCP, LTT Mispredicts , mov-op & load-op macro-fusion, MRN, store_fwd_block
    • Many fixes including
      • Speed of LBR post-processing
      • Workaround for witnessed perf tool or perf_events bugs (e.g. CPUs_Utilized)
      • Client Hybrid fixes for P-core

perf-tools core team: Ahmad Yasin, Amiri Khalil, Jon Strang
contributors: Sinduri Gundu, Andi Kleen