perf-cpp is a powerful C++ library that provides direct access to hardware performance counters from the application. The library allows for precise event-counting and sampling of specific code segments and to link sampled data (e.g., memory addresses) with application-specific details (e.g., class instances).
- Count Hardware Events: Integrate performance monitoring into your application. Configure, start, and stop hardware counters to profile specific code segments.
- Sampling: Leverage sampling to record performance data periodically, e.g., instruction pointers, memory addresses, access latency, branches, and more.
- Customizable Event Configuration: Use built-in hardware events (e.g., cycles, instructions, cache-misses) and those specific to your underlying CPU. Additionally, define and utilize Metrics–quantitative measurements like cycles per instruction–to gain deeper insights into performance and efficiency.
- Practical Examples: Jumpstart your implementation with the diverse collection of examples that demonstrate practical applications of the library.
Get up and running with perf-cpp in seconds:
# Clone the repository
git clone https://github.com/jmuehlig/perf-cpp.git
# Switch to the repository folder
cd perf-cpp
# Optional: Switch to the latest stable version
git checkout v0.9.0
# Build the library (in build/)
cmake . -B build -DBUILD_EXAMPLES=1
cmake --build build
# Optional: Build examples (in build/examples/bin)
cmake --build build --target examples
For detailed building instructions, including how to integrate perf-cpp into your CMake projects, visit our build guide.
Quickly set up hardware event monitoring:
#include <perfcpp/event_counter.h>
/// Initialize the counter
auto counters = perf::CounterDefinition{};
auto event_counter = perf::EventCounter{ counters };
/// Specify hardware events to count
event_counter.add({"seconds", "instructions", "cycles", "cache-misses"});
/// Run the workload
event_counter.start();
your_workload(); /// <-- Your code to profile
event_counter.stop();
/// Print the result to the console
const auto result = event_counter.result();
for (const auto [event_name, value] : result)
{
std::cout << event_name << ": " << value << std::endl;
}
Possible output:
seconds: 0.0955897
instructions: 5.92087e+07
cycles: 4.70254e+08
cache-misses: 1.35633e+07
For further details, including how to count events in parallel settings, visit our guide on recording events.
Implement detailed sampling with control over the recorded content:
#include <perfcpp/sampler.h>
/// Create the sampler
auto counters = perf::CounterDefinition{};
auto sampler = perf::Sampler{ counters };
/// Specify when a sample is recorded: every 4000th cycle
sampler.trigger("cycles", perf::Period{4000U});
/// Specify what metadata is included into a sample: time, CPU ID, instruction
sampler.values()
.time(true)
.cpu_id(true)
.instruction_pointer(true);
/// Run the workload
sampler.start();
your_workload(); /// <-- Your code to profile
sampler.stop();
/// Print the samples to the console
const auto samples = sampler.result();
for (const auto& sample_record : samples)
{
const auto time = sample_record.time().value();
const auto cpu_id = sample_record.cpu_id().value();
const auto instruction = sample_record.instruction_pointer().value();
std::cout
<< "Time = " << time << " | CPU = " << cpu_id
<< " | Instruction = 0x" << std::hex << instruction << std::dec
<< std::endl;
}
Possible output:
Time = 365449130714033 | CPU = 8 | Instruction = 0x5a6e84b2075c
Time = 365449130913157 | CPU = 8 | Instruction = 0x64af7417c75c
Time = 365449131112591 | CPU = 8 | Instruction = 0x5a6e84b2075c
Time = 365449131312005 | CPU = 8 | Instruction = 0x64af7417c75c
For further details, for example, which metrics can be included into samples, visit our sampling guide.
We include a comprehensive collection of examples demonstrating the advanced capabilities of perf-cpp, including, for example, counting events in parallel settings and sampling memory accesses.
All code examples are available in the examples/ folder.
- Full Documentation: Explore detailed guides on every feature of perf-cpp.
- Examples: Learn how to set up different features from code-examples.
- Changelog: Stay updated with the latest changes and improvements.
- C++ Standard: Requires support for C++17 features.
- CMake Version: 3.10 or higher.
- Linux Kernel Version: 4.0 or newer (note that some features need a newer Kernel).
perf_event_paranoid
Setting: Adjust as needed to allow access to performance counters (see the Paranoid Value Section below).
The perf_event_paranoid
setting controls access to performance counters:
-1
: No restrictions (full access).0
: Allow normal users access, but no raw tracepoint samples.1
: Allow user and kernel-level profiling (default since Linux 4.6).>= 2
: Only user-level measurements allowed.
cat /proc/sys/kernel/perf_event_paranoid
sudo sysctl -w kernel.perf_event_paranoid=-1
Note: To make this change permanent, edit /etc/sysctl.conf
and add kernel.perf_event_paranoid = -1
.
We welcome contributions and feedback to make perf-cpp even better. For feature requests, feedback, or bug reports, please reach out via our issue tracker or submit a pull request.
Alternatively, you can email me: jan.muehlig@tu-dortmund.de
.
While perf-cpp is dedicated to providing developers with clear insights into application performance, it is part of a broader ecosystem of tools that facilitate performance analysis. Below is a non-exhaustive list of some other valuable profiling projects:
- PAPI offers access not only to CPU performance counters but also to a variety of other hardware components including GPUs, I/O systems, and more.
- Likwid is a collection of several command line tools for benchmarking, including an extensive wiki.
- PerfEvent provides lightweight access to performance counters, facilitating streamlined performance monitoring.
- Intel's Instrumentation and Tracing Technology allows applications to manage the collection of trace data effectively when used in conjunction with Intel VTune Profiler.
- For those who prefer a more hands-on approach, the perf_event_open system call can be utilized directly without any wrappers.
This is a non-exhaustive list of academic research papers and blog articles (feel free to add to it, e.g., via pull request – also your own work).
- Quantitative Evaluation of Intel PEBS Overhead for Online System-Noise Analysis (2017)
- Analyzing memory accesses with modern processors (2020)
- Precise Event Sampling on AMD Versus Intel: Quantitative and Qualitative Comparison (2023)
- Multi-level Memory-Centric Profiling on ARM Processors with ARM SPE (2024)