Skip to content

Latest commit

 

History

History
120 lines (108 loc) · 7.82 KB

README.md

File metadata and controls

120 lines (108 loc) · 7.82 KB

perf-tools

perf-tools is an open-source package that profiles workloads, identifies issues, and maps them to application code. It is the home for a collection of performance analysis tools, recipes, micro-benchmarks & more

Overview

  • do.py -- The main driver with handy shortcuts for setting up and doing profiling, over Linux perf
  • study.py -- A driver to study and compare multiple flavors of an application (it wraps do.py and employs parallel post-processing)
  • analyze.py -- A module for analyzing profiling logs. It automates the process of software optimizations described in From Top-down Microarchitecture Analysis to Structured Performance Optimizations
  • pmu.py -- A module for interface to the Performance Monitoring Unit (PMU)
  • stats.py -- A module for processing counters and profiling logs
  • lbr/ -- functionality for processing Last Branch Record (LBR) streams
    • loops.py -- A module for handling loops
    • funcs.py -- A module for handling functions / procedures
    • x86.py -- A module for handling x86 instructions
  • tma.py -- A module with modern encapsulation of the Top-down Microarchitecture Analysis (TMA) method
  • pmu-tools/ -- linked Andi Kleen's perf-based great tools
    • toplev -- profiler featuring TMA method on Intel processors
    • ocperf -- perf wrapper that converts Intel event names to perf-events syntax
    • genretlat -- a profiler to collect Retire Latencies on recent Intel processors
  • workloads/ -- an evolving collection of "micro-workloads"
  • kernels/ -- an evolving collection of x86 kernels
    • gen-kernel.py -- generator of X86 kernels
    • jumpy.py -- module for different jumping constructs
    • peakXwide.c -- sample kernels for a X-wide superscalar machine, e.g. 4 for Skylake
    • sse2avx.c -- another auto-generated kernel for SSE <-> AVX ISA transition penalty
    • memcpy.c -- a custom kernel for strings of libc demonstrating how to timestamp a region-of-interest
    • callchain.c -- a custom kernel for chain of function calls as demonstrated in Establishing a Base of Trust with Performance Counters for Enterprise Workloads
    • pagefault.c -- a custom kernel for page faults on memory data accesses
    • fp-arith-mix.c -- demonstrates utilization of extra counters in Icelake's PMU
    • rfetch3m -- a random fetcher across 3MB code footprint (auto-generated)
    • There are more kernels produced by build.sh though not uploaded to git

Checkout with:

git clone --recurse-submodules https://github.com/aayasin/perf-tools

Usage

setting up system (for more robust profiling)

  • to setup the perf tool, invoke ./do.py setup-perf
  • to turn-off SMT (CPU hyper-threading), invoke ./do.py disable-smt; don't forget to re-enable it once done, e.g. ./do.py enable-smt
  • ./do.py disable-prefetches to disable hardware prefetches. Ditto re-enable comment for this/next commands.
  • ./do.py enable-fix-freq to use fixed-frequency (in paritcular disables Turbo).
  • ./do.py disable-atom to disable E-cores in Hybrid processors.

profiling

First, edit run.sh to invoke your application or use the -a '<your app and its args>', alternatively. System-wide profiling is supported as well.

  • to profile, simply ./do.py profile which includes multiple steps:

    • logging step: collects the system setup info
    • basic counting & sampling steps: collect key metrics like time or CPUs utilized, via basic profiling and output top CPU-time consuming commands/modules/functions, their call-stack as well as the disassembly of top hotspot.
    • topdown profiling steps: collect reduced tree, auto drill-down and full-tree collections with multiple re-runs.
    • advanced sampling steps: deeper profiling using advanced capabilities of the PMU, and output certain reports at the assembly level (of hottest command). Example reports include instruction-mixes, hitcounts (basic-block execution counts), loops, as well as stats on hottest loops (identifying loops has some restrictions). Another precise event step is available but is disabled by default.
    • Additional misc profile-steps are available, e.g. tracing MSRs. Refer to --profile-mask documentation for full list of profile-steps.

    A filtered output will be dumped on screen while all logs are saved to the current directory.
    Use --profile-mask 42, as an example, to invoke subset of all steps.
    For topdown profiling and advanced sampling, see system requirements.

  • ./do.py log will only log hardware and software setup.

  • ./do.py setup-all will setup all required tool (fetch and build those needed. Internet access required).

  • ./do.py setup-perf profile will setup just perf then do default profiling (multiple commands can be used at once).

  • ./do.py tar will archive all logs into a shareable tar file.

  • ./do.py all will setup perf before doing all above profiling steps.

  • ./do.py profile -pm 13a -v1 will do selected profile steps - per-app counting, sampling, topdown 2-levels, sampling w/ LBR - and print underlying commands as well.

  • ./do.py help -m My_Metric will print description of given metric (that toplev understands)

kernels (microbenchmarks)

  • to build pre-defined ones, simply cd kernels/ && ./build.sh, or
  • GEN=0 ./build.sh from kernels/ dir to re-build the kernels without generating them
  • to run a kernel, invoke it with number-of-iterations, e.g. ./kernels/jumpy5p14 200000000
  • to create a custom kernel, set the desired parameters. e.g. ./kernels/gen-kernel.py -i PAUSE -n 10 outputs a C-file of a loop with 10 PAUSE instructions, that can be fed to your favorite compiler.

tools

A set of command-line tools to facilitate profiling

  • addrbits -- extracts certain bit-range of hexa input
  • lbr_stats -- calculates stats on LBR-based profile
  • llvm-mca -- calculates IPC-ideal for simple loops in LBR profile-step
  • loop_stats -- calculates stats for a particular loop in an LBR-based profile
  • n-copies -- invokes N-copies of an app, with CPU affinity (uses sibling thread N=2, 1 thread/core when N <= nproc)
  • n-loop -- run a given app n-times in a loop
  • ptage -- computes percentages & sum of number-prefixed input
  • slow-branch -- extracts slow sequences from Timed-LBR profile

wrappers

Shortcuts to set-up certain tools

  • build-perf.sh -- builds the perf tool from scratch; invoke with ./do.py build-perf to let it use the installer of your Linux distribution (Ubuntu is the default).
  • build-xed.sh -- downloads & builds Intel's xed. Enabled by default with ./do.py setup-all --tune :xed:1.
  • omp-bin[.sh] -- wrapper for OpenMP apps setting # of threads and CPU affinity

More information

Required Linux kernel for most recent processors 🎉

Intel product Kernel version perf version
Ice Lake 5.10
Rocket Lake 5.11
Alder Lake 5.13 5.17
Raptor Lake 5.18
Sapphire Rapids 5.18
Meteor Lake 6.4 6.3 or 6.5 onwards
Granite Rapids 6.8 6.6

Besides, perf tool version 5.13 or newer is required (except observed broken perf versions). See do.py --install-perf for more.