Reduce overhead of sampling profiler by having only one thread do it #6433

abadams · 2021-11-19T20:50:46Z

The current built-in profiler has a lot of overhead for fine-grained compute_at schedules. E.g. for the bgu app it inflates runtime by about 50% to turn on the profiler. This is happening because all threads are writing their current state to the same cache line, causing a lot of cross-cache traffic. Each one of these writes is effectively a cache miss.

This PR changes it so that whenever we have lots of threads all doing the same thing (so in a leaf parallel loop body and not inside a fork node), one of them gets elected to write to that status field. The election is done by racing to grab a pipeline-scope token using an atomic op. The winner does the reporting. This speeds things up in two ways: First, the threads that don't write don't incur the cache misses. Second, the thread that does write can keep that line in cache, with the sampler thread just snooping on the bus traffic when it wants to read instead of invalidating the cache line (assuming I remember how MESI works properly).

steven-johnson

LGTM, but the explanatory description in this PR should really be added in code comments somewhere.

steven-johnson · 2021-11-23T17:27:46Z

src/Profiling.cpp

@@ -30,6 +72,8 @@ class InjectProfiling : public IRMutator {

    string pipeline_name;

+    bool in_fork = false, in_parallel = false, in_leaf_task = false;


ubernit: this is an unusual formatting for our code; we almost always put member var declarations one-per-line.

…ofiler_overhead_v2

abadams · 2021-11-24T16:20:41Z

I'm a bit nervous about how hard it is to test this. I'd appreciate it if a someone could run an important production pipeline at Google before and after and tell me if the profile looks reasonable.

abadams · 2021-12-02T14:29:04Z

Actually a systematic test of the app suite is probably enough. I'll do that and report back.

abadams · 2021-12-02T15:04:40Z

The geomean overhead of the old profiler is 19% across the apps (usually it's zero overhead, but there are a couple of apps with fine-grained compute where the overhead is large). The new profiler has geomean 3% overhead. The few profiles I've spot-checked look reasonable. Merging.

abadams added 2 commits November 19, 2021 11:42

Reduce overhead of sampling profiler by having only one thread do it

6c67d99

Use const ref

308beed

steven-johnson self-requested a review November 23, 2021 17:26

steven-johnson approved these changes Nov 23, 2021

View reviewed changes

abadams added 2 commits November 23, 2021 13:14

Merge remote-tracking branch 'origin/master' into abadams/sampling_pr…

94e3163

…ofiler_overhead_v2

One line per member

588de72

abadams merged commit 5cf9ae5 into master Dec 2, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce overhead of sampling profiler by having only one thread do it #6433

Reduce overhead of sampling profiler by having only one thread do it #6433

abadams commented Nov 19, 2021

steven-johnson left a comment

steven-johnson Nov 23, 2021

abadams commented Nov 24, 2021

abadams commented Dec 2, 2021

abadams commented Dec 2, 2021

		@@ -30,6 +72,8 @@ class InjectProfiling : public IRMutator {

		string pipeline_name;

		bool in_fork = false, in_parallel = false, in_leaf_task = false;

Reduce overhead of sampling profiler by having only one thread do it #6433

Reduce overhead of sampling profiler by having only one thread do it #6433

Conversation

abadams commented Nov 19, 2021

steven-johnson left a comment

Choose a reason for hiding this comment

steven-johnson Nov 23, 2021

Choose a reason for hiding this comment

abadams commented Nov 24, 2021

abadams commented Dec 2, 2021

abadams commented Dec 2, 2021