-
Notifications
You must be signed in to change notification settings - Fork 17.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
proposal: runtime/pprof: add PMU-based profiles #36821
Comments
Will it be available to non-linux OSes? If so, that's great. Previously, we've suggested Linux |
Hi Hyang-Ah Hana Kim, -Milind |
Related: #21295 |
The pprof docs don't indicate that Linux perf delivers more accurate reports in general; is that the case? From https://golang.org/doc/diagnostics.html#profiling: What other profilers can I use to profile Go programs? On Linux, perf tools can be used for profiling Go programs. Perf can profile and unwind cgo/SWIG code and kernel, so it can be useful to get insights into native/kernel performance bottlenecks. On macOS, Instruments suite can be used profile Go programs. |
Hi Liam, |
Super useful to make this proposal more desired. |
What's required to make Linux perf deliver the superior accuracy you propose? Is it hard to configure? Apologies if that's an FAQ... |
Hi Liam,
|
Change https://golang.org/cl/219508 mentions this issue: |
/cc @aclements |
Thanks for the proposal doc (https://go.googlesource.com/proposal/+/refs/changes/08/219508/2/design/36821-perf-counter-pprof.md). Using hardware counter to do a better job than setitimer seems fairly unobjectionable. No new API would be an implementation detail and fairly easy to accept (assuming the implementation were not too large). The hard part is when the new API starts getting added. We don't want to overfit to Linux, nor to x86. How much new API is really needed here? Can the proposal be useful with zero new API? |
Thanks Russ for your comments. The proposal suggests introducing a handful of new APIs in the Right now, we add a few PMU configurations that surface as public APIs, e.g., If we don't want to keep this set growing, I have a solution that will introduce only one API that takes a string argument Let me know which parts of the proposal seem overfitted to Linux or x86. I don't think there is x86 overfitting because any architecture on which Linux runs will implement |
I have a very quick question. Are the performance numbers from PMU mapped to goroutines? Based on my understanding, numbers from PMU are for different threads, while how goroutines are mapped to threads is managed by the runtime. |
PMUs are configured per OS thread. PMUs by themselves are unaware of goroutines. Just like an OS timer sample, the signal handler attributes the sample to the goroutine currently running on the OS thread. |
@chabbimilind Thanks a lot for the reply! I feel the tool would be very helpful in performance optimization and performance failure diagnosis, since both of them rely on precise time measurement. |
@chabbimilind To a first approximation, no new API means no new API. Is this proposal useful with no new API at all? |
@ianlancetaylor, the proposal will not work without a new API. That aspect is quite clearly stated in the proposal. |
The closest thing I was able to find for Windows was https://docs.microsoft.com/en-us/windows/win32/api/_hcp/. But it requires a driver for configuring things. Windows Insider builds also have DTrace. For OS X, Intel has special drivers https://www.campos.cc/post/using-intel-performance-counter-monitor-on-mac-os-x/. Both platforms have separate programs to measure counters (xperf and Instruments). But there doesn't seem to be a nice way to invoke them programmatically. I guess the canonical API is going to be processors themselves. |
@egonelbre, Windows and macOS, AFAIK, are not exposing system calls to access PMUs in a clean way like Linux. Depending on a library can be fragile because the libraries may not even exist and there is also the question of whether we would ever make such cgo calls from inside the go runtime. |
@aclements @dvyukov @rhysh please can you complete reviewing this proposal? |
@chabbimilind, I really did mean zero new API, which would mean making the current CPU-based profiles more accurate but not enable profiling other events like mispredicted branches and so on. Once we add those I fear the API just gets arbitrarily large, since there are so many different counters. And they vary by architecture. We will take some time to read your doc, but we're not working as fast as normal these days. Thanks for your patience. |
@rsc , the concern around API growth is understandable. Hence, earlier in this thread, I proposed an alternative API that does not grow. It simply takes a string input and the string parsing remains internal to the implementation of the single API exposed. This will allow all kinds of counter names / values / configs being passed in as strings without needing any new APIs. Please do consider that alternative. I cannot see how zero API change is possible. We need to maintain the OS interval timer as the default sampling engine to work across all platforms under all environments. The hardware performance counter-based sampling engine has to enabled by exposing at least one API. |
Looking more closely at the implementation, I see you are depending entirely event overflow signals (configured to use SIGPROF), while the records in the share memory ring buffer are completely discarded. This has the advantage of not having to depend on the kernel to correctly generate stack traces [1] and makes applying pprof labels much simpler. On the other hand, it seems that this limits scalability quite a bit and potentially introduces additional inaccuracy (SIGPROF cannot be delivered while the thread has it masked). [1] FWIW, I think we could successfully recover most missing frames (notably, inline frames) from kernel traces in post-processing. |
Correct. We can use kernel-provided stacks if needed. For example, in another profiler I worked on, we stitched the kernel call stacks from the perf-provided ring buffer, but used the user-land frames based on our own unwinding. I think, there is quite a lot of flexibility (and decisions to be taken) on these details. |
Evaluate measured cputime using the tests from github.com/golang/go/issues/36821, which demonstrate the inaccuracy and imprecision of go's 100Hz cpu profiler. The results here suggest that measured cputime is both accurate and precise with regards to computing on-CPU time. === RUN TestEquivalentGoroutines 0's got 9.98% of total time 1's got 9.53% of total time 2's got 9.22% of total time 3's got 10.42% of total time 4's got 9.84% of total time 5's got 10.43% of total time 6's got 10.50% of total time 7's got 10.21% of total time 8's got 10.03% of total time 9's got 9.86% of total time === RUN TestProportionalGoroutines 0's got 1.87% of total time (1.000000x) 1's got 3.60% of total time (1.931999x) 2's got 5.41% of total time (2.899312x) 3's got 7.21% of total time (3.864451x) 4's got 9.11% of total time (4.880925x) 5's got 10.94% of total time (5.864723x) 6's got 12.77% of total time (6.842004x) 7's got 14.34% of total time (7.685840x) 8's got 16.58% of total time (8.885060x) 9's got 18.18% of total time (9.741030x)
Evaluate measured cputime using the tests from github.com/golang/go/issues/36821, which demonstrate the inaccuracy and imprecision of go's 100Hz cpu profiler. The results here suggest that measured cputime is both accurate and precise with regards to computing on-CPU time. === RUN TestEquivalentGoroutines 0's got 9.98% of total time 1's got 9.53% of total time 2's got 9.22% of total time 3's got 10.42% of total time 4's got 9.84% of total time 5's got 10.43% of total time 6's got 10.50% of total time 7's got 10.21% of total time 8's got 10.03% of total time 9's got 9.86% of total time === RUN TestProportionalGoroutines 0's got 1.87% of total time (1.000000x) 1's got 3.60% of total time (1.931999x) 2's got 5.41% of total time (2.899312x) 3's got 7.21% of total time (3.864451x) 4's got 9.11% of total time (4.880925x) 5's got 10.94% of total time (5.864723x) 6's got 12.77% of total time (6.842004x) 7's got 14.34% of total time (7.685840x) 8's got 16.58% of total time (8.885060x) 9's got 18.18% of total time (9.741030x)
As of Go 1.18, the built-in profiler uses per-thread timers (via timer_create) on Linux. @chabbimilind , I ran your initial test code with go1.18.2 and saw that the results have improved from what you reported in early 2020. On a machine running Linux with 16 Intel threads (8 cores), I saw each of the 10 functions in the parallel test case reported as using 9–11% of the CPU time, across about 6900 samples. It is no longer limited to reporting ~250 samples per second, and now reports close to the 1000 per second (an average of 917 for the test run below) that I'd expect from 10 running goroutines. The serial test is short enough that each run collects only 30 samples, but when I extend it to run 100x longer (to collect about 3100 samples) I see each function get within 0.5% / 50 basis points of its expected fraction. Now that the baseline has changed, could you refresh this issue with what incremental improvements you'd like to see? Is this list now:
Thanks.
|
Fixes golang#41554. This commit introduces a /sched/goroutine/running:nanoseconds metric, defined as the total time spent by a goroutine in the running state. This measurement is useful for systems that would benefit from fine-grained CPU attribution. An alternative to scheduler-backed CPU attribution would be the use of profiler labels. Given it's common to spawn multiple goroutines for the same task, goroutine-backed statistics can easily become misleading. Profiler labels instead let you measure CPU performance across a set of cooperating goroutines. That said, it has two downsides: - performance overhead; for high-performance systems that care about fine-grained CPU attribution (databases for e.g. that want to measure total CPU time spent processing each request), profiler labels are too cost-prohibitive, especially given the Go runtime has a much cheaper and more granular view of the data needed - inaccuracy and imprecision, as evaluated in golang#36821 It's worth noting that we already export /sched/latencies:seconds to track scheduling latencies, i.e. time spent by a goroutine in the runnable state (go-review.googlesource.com/c/go/+/308933). This commit does effectively the same except for the running state on the requesting goroutine. Users are free to use this metric to power histograms or tracking on-CPU time across a set of goroutines. Change-Id: Ie78336a3ddeca0521ae29cce57bc7a5ea67da297
@rhysh, it is great to see |
Change https://go.dev/cl/410798 mentions this issue: |
Change https://go.dev/cl/410797 mentions this issue: |
Change https://go.dev/cl/410799 mentions this issue: |
Hi, since #42502 has been accepted, I submitted an implementation for it. At the same time based on this new API, I added some PMU events support for this issue, and the implementation does not introduce any new API. FYI. |
This is great news! |
Yes, there is currently no PMU support for http/pprof and testing. Because the os-timer cannot be completely replaced, a new API must be introduced. I'm not sure if this is appropriate, so didn't add it. |
Change https://go.dev/cl/410796 mentions this issue: |
Thanks @chabbimilind for this great proposal and @erifan for the implementation! Recently I've been looking at this proposal but I still have a few questions:
The
I just looked up some kernel code and I found the overflow signal is not delivered directly in the PMU interrupt handler. If my understanding is right, there will be a time gap between the overflows and the signal handling. This will make the profiling results less reliable since the signal handler is not the first place where the event overflows. If we are going to add more events like From my view, maybe we can read the profiling info directly from the perf ring buffer. But I think some runtime info will be lost and I'm not sure whether we can recover it. I will investigate it further. |
This situation is indeed difficult to deal with, but it should be rare. Go's use of PMU is similar to
We can get the g from R28 if the goroutine is not dead when we writing the samples. But if the goroutine is dead, that will be a problem. Maybe you can test which of the two implementations is more accurate. My previous implementation showed that the skid from overflow to signal handling does not seem to be very large. |
Having access to a unique number that's faster across multiple cores than an increment in a mutex is needed. This is allowing the mutex to be removed with regards to queue processing and serial data insertion. This is accomplished in Linux with I hope that this is implemented in Go. |
@andrewhodel Could you clarify your use-case? Are you saying that you want to enable the If so, isn't each counter per-core, and thus not actually guaranteed to be globally unique? |
I have question and comments about using the ring buffer. Ring buffer is accurate for the PC, but the call stack has to be found by unwinding. For the leaf frame, which PC to use? Is it the one seen in the signal handler or the one given by the ring buffer? Due to the skid between the time the PMU takes the PC sample and when the signal is produced, the PC seen in the signal handler can be off from the PC recorded in the ring buffer. We can unwind the stack from the signal handler and override the leaf-frame PC with the ring buffer PC, but this can be wrong if the real execution has moved on to another function by the time the signal is delivered. I knew of this problem from the beginning in my proposal and preferred to use the PC seen in the signal handler. AFAIK, perf overrides the PC with the one seen in the ring buffer. I had a personal communication with Stephane Eranian a while back about this matter, who is a subject matter expert. I think it is probably "ok" to get the call stack from the signal and override the leaf PC with the PC given by the ring buffer, knowing that sometimes the call stack will look whacky. |
Hm, interesting problem, I hadn't thought about that. I agree that it seems OK to use the adjusted PC, especially if that is already what the Linux perf tool does.
Just to clarify, the impact is really the skid between the PMU count and CPU interrupt delivery? i.e., even a stack trace collected by the kernel in the interrupt handler has the same issue. I don't think there should be any skid between interrupt delivery and signal delivery? (After the interrupt, the next return to userspace should be directly to the signal handler) [1] [1] Unless I suppose, the skid happen exactly in the kernel where the thread is being descheduled and is made visible to other CPUs to run before the interrupt occurs. That seems vanishingly rare though. |
Yes, I meant the skid between PC is sampled and when the PMU raises an interrupt. Referring back to "precise_ip" https://man7.org/linux/man-pages/man2/perf_event_open.2.html, it states the following: " This controls the amount of skid. Skid is how many I don't know how accurate this statement is. My understanding is that the register snapshot in the ring buffer can be set to have zero skid; but there is no guarantee on when the kernel will unwind the stack and what stack state it will see. The kernel can use the register snapshot from the ring buffer for its unwinding, but what if the processor had moved on and made changes to the state of memory e.g., return address on the stack? |
What version of Go are you using (
go version
)?Does this issue reproduce with the latest release?
Yes; tested on
go version devel +74d366f484 Mon Jan 27 20:45:39 2020 +0000 linux/amd64
What operating system and processor architecture are you using (
go env
)?go env
OutputWhat did you do?
The following experiments demonstrate that pprof CPU profiles lack accuracy (closeness to the ground truth) and precision (repeatability across different runs). The tests are highly predictive in nature and involve little runtime overheads (allocation, GC, system call, etc.). They are carefully designed so that we can compare pprof CPU profiles against our expectations. The evaluation shows a wide difference from expectation. One should not confuse this to be a runtime issue; the issue is with the use of OS timers used for sampling; OS timers are coarse-grained and have a high skid. In a proposal/design document that will follow, I will propose a design to extend CPU profiling by sampling CPU Performance Monitoring Unit (PMU) aka hardware performance counters. PMU event-based sampling is a mature technology available on almost all modern CPUs. PMU as a sampling agent offers many benefits (in no particular priority order):
There are two test cases
goroutine.go
andserial.go
in this issue.test 1 (parallel)
Download the first test case,
goroutine.go
program, from https://github.com/chabbimilind/GoPprofDemo/blob/master/goroutine.goIt has ten exactly the same goroutines:
f1-f10
and I use pprof to collect the CPU profiles.Run the following command several times and notice that each time, pprof reports a different amount of time spent in each of the ten routines.
go run goroutine.go && go tool pprof -top goroutine_prof
test 2 (serial)
Download the second test case,
serial.go
go program, from https://github.com/chabbimilind/GoPprofDemo/blob/master/serial.go:It has ten functions (
A_expect_1_82
-J_expect_18_18
). The functionA_expect_1_82
is expected to consume 1.82% of the total execution time andJ_expect_18_18
is expected to consume 18.18% of the execution time, and so on. The code is serial and there is complete data dependence between each function and each iteration of the loop in the functions to avoid any hardware-level optimizations.Run the following command several times.
go run serial.go && go tool pprof -top serial_prof
What did you expect to see?
For
goroutine.go
, each function (f1-10
) should be attributed with exactly (or almost exactly) 10% of the execution time on each run.For
serial.go
the time attribution should roughly follow the distribution shown below in each run.What did you see instead?
test 1 (parallel)
Run 1, 2, and 3, shown below, respectively show the
pprof -top
output ofgoroutine.go
.You will notice in a single run (say Run 1),
f1-f10
have a wide variance in the time attributed to them; the expectation is that each of them gets 10% execution time. There is up to6x
difference in time attributed to the function with the highest amount if attribution (main.f7
,4210ms
, inRun 1
) vs. the function with the lowest a amount of attribution (main.f9
,700ms
, inRun 1
). This shows a poor accuracy (deviation from the ground truth) of pprof timer-based profiles.Furthermore, the time attributed to a function widely varies from run to run. Notice how the top-10 ordering changes. In
Run 1
,main.f7
is shown to run for4210ms
, whereas inRun 2
it is shown to run for only520ms
. The expectation is that the measurements remain the same from run to run. This shows a poor precision (unpredictability of measurement) of pprof timer-based profiles.goroutine.go/Run 1:
goroutine.go/Run 2:
goroutine.go/Run 3:
test 2 (serial)
The output for
go run serial.go && go tool pprof -top serial_prof
for three runs is shown below.Comparing the flat% (or cum%) against the expected percentage for each function shows a large difference. For example,
main.H_expect_14_546
, which is expected to have 14.546% execution time, is attributed 25% of the execution time in Run 1. Furthermore, run to run, there is a lack of precision, for example,main.I_expect_16_36
is attributed 6.25% (20ms) execution time in Run 1, whereas it is attributed 21.88% (70ms) execution time in Run 2.serial.go/Run 1:
serial.go/Run 2:
serial.go/Run 3:
Improved results with PMU-based profiling.
In a prototype PMU-based profiling implementation, below are the pprof profiles for CPU cycles hardware performance counter for the same
goroutine.go
program. Notice that each functions gets the same (or almost the same) CPU cycles attribution within a single run and across runs.goroutine.go/Run 1:
goroutine.go/Run 2:
Below are the pprof profiles for CPU cycles hardware performance counter for the same
serial.go
program. Notice that each function gets close to expected CPU cycles attribution within a single run and across runs.serial.go/Run 1:
serial.go/Run 2:
Dependence on the number of cores and length of test execution:
The results of
goroutine.go
test depend on the number of CPU cores available. On a multi-core CPU, if you setGOMAXPROCS=1
,goroutine.go
will not show a huge variation, since each goroutine runs for several seconds. However, if you setGOMAXPROCS
to a larger value, say 4, you will notice a significant measurement attribution problem. One reason for this problem is that the itimer samples on Linux are not guaranteed to be delivered to the thread whose timer expired.The results of
serial.go
can change based on the time of execution of each function. By passing the-m=<int>
to the program, you can make the program run for longer or shorter. By making it run longer, the profiles can be made more accurate, but when a function runs for less than 100ms, the accuracy is often low.The text was updated successfully, but these errors were encountered: