linux perf event c++ wrapper
Measure how much your code costs in terms of hardware instructions, cachemisses, branchmisses and memory allocations.
You will need sashamakarenko/makefile project cloned next to the pe working copy.
$> git clone https://github.com/sashamakarenko/makefile.git makefile
$> git clone https://github.com/sashamakarenko/pe.git pe
$> cd pe
$> make
Let us invoke gettimeofday 20 times on the CPU core 3 and get statistics from this. Instrument your code or create pe/src/tests/TestGettimeofday.cpp like this:
#include <pe/Measurement.h>
int main( int argc, char** argv )
{
pe::Measurement m;
m.pinToCpuCore( 3 );
m.addEvent( pe::EventType::cpuCycles );
m.addEvent( pe::EventType::hwInstructions );
m.addEvent( pe::EventType::branchInstructions );
m.addEvent( pe::EventType::llCacheReadMisses );
m.addEvent( pe::EventType::branchMisses );
m.addEvent( pe::EventType::memory ); // you will have to use LD_PRELOAD=build/lib/release/libPePreload-1.0.so to capture memory usage
m.initialize( 20 );
std::cout << "\ngettimeofday:" << std::endl;
for( int i = 0; i < m.getMaxCaptures(); ++i )
{
m.startCapture();
// this is what we measure
gettimeofday( &tv, nullptr );
m.stopCapture();
}
m.prepareResults();
m.printCaptures();
m.showAverageValues( std::cout );
m.rewind();
return 0;
}
$> make
Enable perf events:
$> sudo echo -1 > /proc/sys/kernel/perf_event_paranoid
Bring the frequency up
$> sudo cpufreq-set -c 3 -g performance
Run the code
$> make check
Relax the CPU
$> sudo cpufreq-set -g powersave
The metrics correspond to 10 calls in a row.
Hopefully the column names are self-explanatory.
The column nanos/call
correspond to a single call time on a 5GHz
event | nanos/call | cpu.cycles | hw.instrs | br.instrs | cch.ll.rmiss | br.misses | bus.cycles | cch.l1d.rmiss | cch.l1i.rmiss |
---|---|---|---|---|---|---|---|---|---|
gettimeofday | 13.2 | 661 | 821 | 140 | 0 | 0 | 3 | 2 | 0 |
clock_gettime | 14.3 | 714 | 911 | 170 | 0 | 3 | 3 | 3 | 0 |
rdtsc | 7.0 | 352 | 61 | 0 | 0 | 0 | 2 | 0 | 0 |
See src/tests/TestLibCalls.cpp
for details.
In order to avoid compiler call evictions, for the functions returning int we actually measure:
externVolatileInt += function();
externVolatileInt += function();
... // repeated 10 times
externVolatileInt += function();
event | nanos/call | cpu.cycles | hw.instrs | br.instrs | cch.ll.rmiss | br.misses | bus.cycles | cch.l1d.rmiss | cch.l1i.rmiss |
---|---|---|---|---|---|---|---|---|---|
void() | 1.4 | 70 | 20 | 20 | 0 | 0 | 0 | 0 | 0 |
void(int) | 0.7 | 36 | 30 | 20 | 0 | 0 | 0 | 0 | 0 |
int(int) | 0.9 | 43 | 81 | 20 | 0 | 0 | 0 | 0 | 0 |
inline int base.get() | 0.7 | 37 | 43 | 0 | 0 | 0 | 0 | 0 | 0 |
int base.get() | 0.9 | 45 | 83 | 20 | 0 | 0 | 0 | 0 | 0 |
int base.*getIntPtr() | 1.2 | 61 | 123 | 30 | 0 | 0 | 0 | 0 | 0 |
virtual int base.get() | 0.9 | 43 | 93 | 20 | 0 | 0 | 0 | 0 | 0 |
virtual int derived.get() | 0.9 | 43 | 93 | 20 | 0 | 0 | 0 | 0 | 0 |
virtual int virtDerived.get() | 1.3 | 67 | 133 | 20 | 0 | 0 | 0 | 0 | 0 |
inline base.getIndirect() | 1.0 | 52 | 46 | 0 | 0 | 0 | 0 | 0 | 0 |
inline derived.getIndirect() | 1.0 | 48 | 46 | 0 | 0 | 0 | 0 | 0 | 0 |
event | nanos/call | cpu.cycles | hw.instrs | br.instrs | cch.ll.rmiss | br.misses | bus.cycles | cch.l1d.rmiss | cch.l1i.rmiss |
---|---|---|---|---|---|---|---|---|---|
void() | 0.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
void(int) | 0.0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
int(int) | 1.0 | 48 | 30 | 0 | 0 | 0 | 0 | 0 | 0 |
inline int base.get() | 0.9 | 47 | 40 | 0 | 0 | 0 | 0 | 0 | 0 |
int base.get() | 0.9 | 44 | 40 | 0 | 0 | 0 | 0 | 0 | 0 |
int base.*getIntPtr() | 1.3 | 64 | 120 | 30 | 0 | 0 | 0 | 0 | 0 |
virtual int base.get() | 1.0 | 48 | 71 | 20 | 0 | 0 | 0 | 0 | 0 |
virtual int derived.get() | 0.9 | 45 | 71 | 20 | 0 | 0 | 0 | 0 | 0 |
virtual int virtDerived.get() | 1.5 | 77 | 120 | 20 | 0 | 0 | 0 | 0 | 0 |
inline base.getIndirect() | 1.0 | 48 | 40 | 0 | 0 | 0 | 0 | 0 | 0 |
inline derived.getIndirect() | 0.9 | 43 | 40 | 0 | 0 | 0 | 0 | 0 | 0 |