This is a simple perf timer that provides output in text and HTML format. It's very simple to work with.
The frame profiler supports a number of output modes, but the most useful is the HTML output mode, which offers multiple ways of viewing your data:
At the top of each page, you'll see a tree map view:
Below that, you'll see an expandable and collapsible table that can be configured in either a top-down tree view to see what call stacks are the most expensive:
Or in a flat view to see which functions are taking the longest in all contexts combined:
In a multithreaded environment, these outputs are presented per-thread, so you can see each thread's individual hotpoints, and they're also collected at the bottom of the page to show cumulative hotspots across all threads.
In addition to the HTML output view, there's also two text output modes (TREE and FLAT), which display the same tables you see in the HTML output as ASCII tables.
When collection is turned on, the perf timer uses collections.deque for atomic insertion of events with no locking. When disabled, it does nothing at all.
To enable:
perf_timer.EnablePerfTracking()
To disable:
import perf_timer
perf_timer.EnablePerfTracking(False)
Once enabled, you can create perf timers as context managers:
import perf_timer
with perf_timer.PerfTimer("BlockName"):
# Your code here
It will collect nested metrics on your application, recording the total time and callstacks for each block you instrument this way.
When you're finished, to print the report:
import perf_timer
perf_timer.PerfTimer.PrintPerfReport(perf_timer.ReportMode.HTML)
In addition to HTML (which will create an interactive html file with all the results), you can also use ReportMode.TREE and ReportMode.FLAT to print out an ascii table with the results.
When the report mode is HTML, the second parameter is a string indicating the output file name. It will default to <main_file_name>_PERF.html.
When the report mode is TREE or FLAT, the second parameter is a function that accepts strings to print. It defaults to 'print()'.
PerfTimer also supports frame-based profiling. The term "frame-based" here is derived from the use case of profiling video games, where hitches and long frames can be very difficult to profile. However, the implementation is generic enough that it can be used for capturing profile reports in any context where you want to do any kind of grouping of profile results.
To use frame-based profiling, simply add a numerical current frame count to your PerfTimer constructor:
import perf_timer
global frameCount
with perf_timer.PerfTimer("BlockName", frameCount):
# Your profiled code here
In the HTML output mode, this will add a frame-by-frame performance graph to the top of the HTML page. Initially, the profile output will show the collected data from all frames together, but clicking on any of these bars will load the profiling data for that specific frame below it, allowing for inspection of hotspots within the longest frames and hitches.
If you're profiling a large number of frames and having difficulty clicking on the frame you want, you can also click and drag within the header to zoom in on a segment of the output so that you can click on the one you want more easily.
Running the frame-based profiler with output modes of FLAT or TREE will generate a separate ascii table PER FRAME and PER THREAD. This could result in an extremely large amount of output. It's strongly recommended to use frame-based profiling only with the HTML output mode.
In addition to being usable as a python library, perf_timer can also be used to generate visualization output for metrics collected from other applications, potentially written in different languages. If you can generate metrics in the format required by perf_timer, it can consume those and convert them into output. There are two formats supported by perf_timer: a JSON format or a binary format.
Both standard and frame-based profiling modes are supported in the external viewer mode. See the section on frame-based profiling above for more information on this.
The JSON format is relatively simple. It contains an array of arrays, with each nested array representing a single recorded metric. The format of the array is as follows:
[
// Either 0 (ENTER CONTEXT) or 1 (EXIT CONTEXT)
operation: number,
// Unique number per thread. Doesn't matter what it is as long as it's unique.
threadId: number,
// Current frame count. If you don't wish to use frame-based profiling, set this to -1.
frameId: number,
// Timestamp of the event, in nanoseconds - this is an absolute time, not a duration
timestamp: number,
// The string name for the current block. Parent block names should not be included.
blockName: string
]
Example JSON data:
[
[0, 1, 0, 1609044830711675392, "DemoFunc_0"],
[0, 1, 0, 1609044830713163520, "DemoFunc_0_0"],
[0, 1, 0, 1609044830714651392, "DemoFunc_1_0"],
[1, 1, 0, 1609044830716141056, "DemoFunc_1_0"],
[1, 1, 0, 1609044830716141056, "DemoFunc_0_0"],
[0, 1, 0, 1609044830716141056, "DemoFunc_0_1"],
[0, 1, 0, 1609044830717628672, "DemoFunc_1_0"],
[1, 1, 0, 1609044830719115520, "DemoFunc_1_0"],
[1, 1, 0, 1609044830719115520, "DemoFunc_0_1"],
[1, 1, 0, 1609044830719115520, "DemoFunc_0"],
[0, 1, 1, 1609044830719115520, "DemoFunc_0"],
[1, 1, 1, 1609044830720603392, "DemoFunc_0"]
]
Note that it is the responsibility of the application to ensure all ENTER events are balanced with a corresponding EXIT event and that the events are ordered correctly within their thread. (Cross-thread ordering does not matter - thread events can be interleaved, or each thread can have its own section of the array.) perf_timer
will use the ordering of the ENTER and EXIT events to build call stacks for the tree views.
The binary format is similar to the JSON format, but, perhaps expectedly, is more strict. It's provided to allow for native applications to generate the output more efficiently.
Note that the binary format does not include ANY alignment padding to ensure it will work across all languages and architectures without making any assumptions about the alignment of data. Structs used to generate the metrics file should thus be packed. All values should also be provided in little endian byte order, meaning if you generate the file on a big endian system, you will need to convert your values.
The overall file format is as follows:
Field | Type | Size | Values |
---|---|---|---|
Magic | unsigned int32 | 4 bytes | MUST contain the value 0xFA57 |
Count | unsigned int32 | 4 bytes | The total number of events recorded |
Events | struct Event | (see struct definition) |
The definition of struct Event
is as follows:
Field | Type | Size | Values |
---|---|---|---|
Operation | Byte | 1 byte | 0 (START_CONTEXT) or 1 (END_CONTEXT) |
Thread ID | unsigned int64 | 8 bytes | Any integer value unique to the current thread |
Frame ID | signed int32 | 4 bytes | The current frame count for this event, or -1 if not using frame-based profiling |
Timestamp | unsigned int64 | 8 bytes | The timestamp of the event in nanoseconds |
String Length | unsigned int16 | 2 bytes | The length of the Block Name string in bytes |
Block Name | char* | Variable | The name of the current context as a string without the null terminator |
Example binary data:
The sample_output
directory in the tree includes some randomly-generated examples of the HTML output, as well as example binary and json files containing example data.
If you use the 010 editor by SweetScape, the example data can be viewed using the following template definition (as shown in the screenshot above):
int magic <format=hex, comment="Must be 0xFA57">;
Assert(magic == 0xFA57);
int count<comment="Number of events">;
enum<byte> Operation
{
Start = 0,
End = 1,
};
local int i;
for( i = 0; i < count; i++ )
{
struct Foo
{
Operation op<comment="Event operation">;
int64 threadId<comment="Numeric thread ID">;
int32 frameId<comment="Frame counter for this event">;
uint64 time<comment="Nanosecond timestamp">;
uint16 strlen<comment="String length">;
char name[strlen]<comment="Block name">;
};
Foo f;
}
You can also use that template to test the validity of any binary output you generate.
A reference implementation for generating perf timer data in another language is provided in the lib_references directory. Currently, only a header for C++ is provided.
The C++ presentation is a 100% usable implementation of the perf timer; however, it is somewhat naiive about multithreading, using std::mutex to guard events rather than using a more tailored solution such as a concurrent queue. This decision was made to keep the reference implementation as simple as possible, but it should be fairly trivial to change it to use a concurrent queue.
Note that by default, multithreading is not enabled to avoid taking the cost in single-threaded apps. You can control this by defining PERFTIMER_MULTITHREADED to true before including the header, or by modifying the header. (It's recomended to set this in the makefile or other build settings to ensure it's applied consistently to all headers or you may have crashes.)
Additionally, the perf timer itself is also disabled by default so that instrumenting your code has zero cost unless you make a build with the perf timer turned on. To turn it on, simply #define PERFTIMER_ENABLED before including the header. (Again, recommended to do this in the makefile.)
Basic usage:
void SomeFunction()
{
// This creates a performance timer for this function scope. It's automatically named with
// the function's name and results in an object that will go out of scope when the function
// returns.
PERF_TIMER(g_frameCount);
// This provides a note, which will appear on the output as an annotation on the frame it's
// associated with. This is helpful for tracking the impact on frame time of certain rare events.
PERF_NOTE("A note!", g_frameCount);
{
// This creates a block-scope perf timer that will track only the time within
// this specific block of code, and will go out of scope at the closing brace.
::PerfTimer::PerfTimer blockTimer("Some block", g_frameCount);
}
}