[PROF-5860] Allow new CPU Profiling 2.0 alpha profiler to be enabled #2209

ivoanjo · 2022-08-05T13:49:55Z

The new Ruby profiler, aka "CPU Profiling 2.0", is considered to be alpha state. We do not recommend turning it on.

But! We actually can turn it on now -- by using DD_PROFILING_FORCE_ENABLE_NEW=true.

The rest of the pieces have been put into place in previous PRs.

What does this PR do?:

Add a setting that allows choosing between the "old" profiler codepath, and the new "CPU Profiling 2.0" codepath.

Motivation:

Making it possible to test the new profiler.

Additional Notes

This PR sits atop #2208 because without the component in #2208 the profiler would not be able to be turned on. It is otherwise independent from that change.

How to test the change?:

Here's a simple run of the new profiler:

$ DD_TRACE_DEBUG=true DD_PROFILING_ENABLED=true DD_SERVICE=ivoanjo-testing DD_ENV=staging DD_PROFILING_FORCE_ENABLE_NEW=true bundle exec ddtracerb exec ruby -e sleep
I, [2022-08-05T14:40:38.715368 #24783]  INFO -- ddtrace: [ddtrace] (dd-trace-rb/lib/datadog/profiling/tasks/setup.rb:47:in `check_if_cpu_time_profiling_is_supported') CPU time profiling skipped because native CPU time is not supported: Feature requires Linux; macOS is not supported. Profiles containing 'Wall time' data will still be reported.
D, [2022-08-05T14:40:38.727312 #24783] DEBUG -- ddtrace: [ddtrace] (dd-trace-rb/lib/datadog/core/workers/async.rb:133:in `start_worker') Starting thread for: #<Datadog::Core::Telemetry::Heartbeat:0x00007f825b982080>
D, [2022-08-05T14:40:38.727510 #24783] DEBUG -- ddtrace: [ddtrace] (dd-trace-rb/lib/datadog/core/configuration/components.rb:371:in `startup!') Profiling started
D, [2022-08-05T14:40:38.727647 #24783] DEBUG -- ddtrace: [ddtrace] (dd-trace-rb/lib/datadog/profiling/collectors/cpu_and_wall_time_worker.rb:33:in `block in start') Starting thread for: #<Datadog::Profiling::Collectors::CpuAndWallTimeWorker:0x00007f827b91b950>
D, [2022-08-05T14:40:38.727712 #24783] DEBUG -- ddtrace: [ddtrace] (dd-trace-rb/lib/datadog/core/workers/async.rb:133:in `start_worker') Starting thread for: #<Datadog::Profiling::Scheduler:0x00007f826c07bb60>
D, [2022-08-05T14:41:38.735153 #24783] DEBUG -- ddtrace: [ddtrace] (dd-trace-rb/lib/datadog/profiling/stack_recorder.rb:25:in `serialize') Encoded profile covering 2022-08-05T13:40:38Z to 2022-08-05T13:41:38Z
D, [2022-08-05T14:41:40.534027 #24783] DEBUG -- ddtrace: [ddtrace] (dd-trace-rb/lib/datadog/profiling/http_transport.rb:49:in `export') Successfully reported profiling data

The new Ruby profiler, aka "CPU Profiling 2.0", is considered to be alpha state. We do not recommend turning it on. But! We actually can turn it on now -- by using `DD_PROFILING_FORCE_ENABLE_NEW=true`. The rest of the pieces have been put into place in previous PRs.

marcotc

ɑ 🥳

**What does this PR do?**: (Important note: This feature is only available on the new CPU Profiling 2.0 profiler which is still in **alpha**, see #2209) This PR is the last piece of the puzzle started in #2304 and #2308. With this change, time (both cpu-time and wall-time) spent by threads doing garbage collection is now accounted for, and shows up in the flamegraph. This works by creating a new Ruby VM TracePoint to handle the `RUBY_INTERNAL_EVENT_GC_ENTER` and `RUBY_INTERNAL_EVENT_GC_EXIT` events. (These events are only available at the C-level; Ruby-level Tracepoints cannot use them). Then, whenever Ruby calls the TracePoint, we call the previously-added `cpu_and_wall_time_collector_on_gc_start` and `cpu_and_wall_time_collector_on_gc_finish` to track the time spent in GC, and then insert it as a sample in the profiling output. **Motivation**: Without this work, time spent doing garbage collection is invisible and blamed on methods directly. By making it visible, we enable customers to make better informed decisions on what needs to be optimized (or fixed!). **Additional Notes**: As I mentioned below, this only affects the new CPU Profiling 2.0 profiler codepath. During development, I initially attempted to compare the time spent in GC gathered via the TracePoint to the one that Ruby exposes via [`GC::Profiler#total_time`](https://rubyapi.org/3.1/o/gc/profiler). The results were off and when I looked into why, I discovered by looking at Ruby's `gc_start` function in `gc.c` that the time tracking for `GC::Profiler` only covers a subset of the GC process, not the entire time. Thus, when comparing values, expect the `GC::Profiler` to be way less than the values exposed by the profiler. I belive the approach and values recorded by the profiler are more accurate since Ruby calls the `ENTER` event quite early in the GC process, and `EXIT` quite late. See also <ivoanjo/gvl-tracing#6> for a similar approach proposed by a Ruby core developer. **How to test the change?**: TODO

**What does this PR do?**: (Important note: This feature is only available on the new CPU Profiling 2.0 profiler which is still in **alpha**, see #2209) This PR is the last piece of the puzzle started in #2304 and #2308. With this change, time (both cpu-time and wall-time) spent by threads doing garbage collection is now accounted for, and shows up in the flamegraph. This works by creating a new Ruby VM TracePoint to handle the `RUBY_INTERNAL_EVENT_GC_ENTER` and `RUBY_INTERNAL_EVENT_GC_EXIT` events. (These events are only available at the C-level; Ruby-level Tracepoints cannot use them). Then, whenever Ruby calls the TracePoint, we call the previously-added `cpu_and_wall_time_collector_on_gc_start` and `cpu_and_wall_time_collector_on_gc_finish` to track the time spent in GC, and then insert it as a sample in the profiling output. **Motivation**: Without this work, time spent doing garbage collection is invisible and blamed on methods directly. By making it visible, we enable customers to make better informed decisions on what needs to be optimized (or fixed!). **Additional Notes**: As I mentioned below, this only affects the new CPU Profiling 2.0 profiler codepath. During development, I initially attempted to compare the time spent in GC gathered via the TracePoint to the one that Ruby exposes via [`GC::Profiler#total_time`](https://rubyapi.org/3.1/o/gc/profiler). The results were off and when I looked into why, I discovered by looking at Ruby's `gc_start` function in `gc.c` that the time tracking for `GC::Profiler` only covers a subset of the GC process, not the entire time. Furthermore, from what I was able to observe, it accounts for cpu-time process-wide on Linux, and in some cases (macos), only cpu-time in usermode, so that's another reason for discrepancies. Thus, when comparing values, expect the `GC::Profiler` to be way less than the values exposed by the profiler. I belive the approach and values recorded by the profiler are more accurate since Ruby calls the `ENTER` event quite early in the GC process, and `EXIT` quite late. See also <ivoanjo/gvl-tracing#6> for a similar approach proposed by a Ruby core developer. **How to test the change?**: TODO

**What does this PR do?**: (Important note: This feature is only available on the new CPU Profiling 2.0 profiler which is still in **alpha**, see #2209) This PR is the last piece of the puzzle started in #2304 and #2323. With this change, time (both cpu-time and wall-time) spent by threads doing garbage collection is now accounted for, and shows up in the flamegraph. This works by creating a new Ruby VM TracePoint to handle the `RUBY_INTERNAL_EVENT_GC_ENTER` and `RUBY_INTERNAL_EVENT_GC_EXIT` events. (These events are only available at the C-level; Ruby-level Tracepoints cannot use them). Then, whenever Ruby calls the TracePoint, we call the previously-added `cpu_and_wall_time_collector_on_gc_start` and `cpu_and_wall_time_collector_on_gc_finish` to track the time spent in GC, and then insert it as a sample in the profiling output (using `cpu_and_wall_time_collector_sample_after_gc`). **Motivation**: Without this work, time spent doing garbage collection is invisible and blamed on methods directly. By making it visible, we enable customers to make better informed decisions on what needs to be optimized (or fixed!). **Additional Notes**: As I mentioned below, this only affects the new CPU Profiling 2.0 profiler codepath. During development, I initially attempted to compare the time spent in GC gathered via the TracePoint to the one that Ruby exposes via [`GC::Profiler#total_time`](https://rubyapi.org/3.1/o/gc/profiler). The results were off and when I looked into why, I discovered by looking at Ruby's `gc_start` function in `gc.c` that the time tracking for `GC::Profiler` only covers a subset of the GC process, not the entire time. Furthermore, from what I was able to observe, it accounts for cpu-time process-wide on Linux, and in some cases (macos), only cpu-time in usermode, so that's another reason for discrepancies. Thus, when comparing values, expect the `GC::Profiler` to be way less than the values exposed by the profiler. I belive the tracepoint-based approach and values recorded by the profiler to be more accurate since Ruby calls the `ENTER` event quite early in the GC process, and `EXIT` quite late. See also <ivoanjo/gvl-tracing#6> for a similar approach proposed by a Ruby core developer. **How to test the change?**: Beyond the included code coverage, you should see `Garbage Collection` frames representing time spent in GC in flamegraphs (when using the new CPU Profiling 2.0 codepath, see above). Here's a tiny Ruby script that triggers a lot of allocation: ```ruby def do_alloc(n) alloc(n) end def alloc(n) n.times do Object.new end end while true do_alloc(100_000) end ```

**What does this PR do?**: (Important note: This fix only affects the new CPU Profiling 2.0 profiler which is still in **alpha**, see #2209) In #2181 we added a concurrency-safe mechanism for `StackRecorder`. Key to how it works, is that we keep to `ddog_Profile` instances at every point in time, and we alternate between using them. But this alternation introduced a bug with the `start_time` of profiles. Previously, the `start_time` got set whenever a profile got created or reset, and we used the current time for it. That caused the following effect: ``` t=0 created slot_one_profile (start_time=0) created slow_two_profile (start_time=0) make_active slot_one_profile t=60 serialized slot_one_profile (start_time=0, finish_time=60) reset slot_one_profile (start_time=60) make_active slot_two_profile t=120 serialized slot_two_profile (start_time=0, finish_time=120) reset slot_two_profile (start_time=120) make_active slot_one_profile t=180 serialized slot_one_profile (start_time=60, finish_time=180) reset slow_one_profile (start_time=180) make_active slot_two_profile ``` That is, other than the first profile (which is why we previously missed this bug), every profile got double the duration as intended, because we reset it after serialization, but that profile would not be used for the next period. To fix this issue, we additionally change the "make_active" step above (actually implemented in `serializer_flip_active_and_inactive_slots`) to set the correct `start_time` on the profile that becomes active. Thus, we get the correct behavior: ``` t=0 created slot_one_profile (start_time=0) created slow_two_profile (start_time=0) # Ignored, will be changed later make_active slot_one_profile (start_time=0) t=60 serialized slot_one_profile (start_time=0, finish_time=60) reset slot_one_profile (start_time=60) # Ignored, will be changed later make_active slot_two_profile (start_time=60) # Correct start_time t=120 serialized slot_two_profile (start_time=60, finish_time=120) reset slot_two_profile (start_time=120) # Ignored, will be changed later make_active slot_one_profile (start_time=120) # Correct start_time t=180 serialized slot_one_profile (start_time=120, finish_time=180) reset slow_one_profile (start_time=180) make_active slot_two_profile (start_time=180) ``` **Motivation**: Having profiles with the wrong duration breaks profile aggregation. **Additional notes**: As I mentioned above, this only affects the new CPU Profiling 2.0 profiler codepath, so I don't expect any customers to have ever ran into this issue. **How to test the change?**: Change includes coverage. Furthermore, running the profiler with `DD_TRACE_DEBUG` prints the start/finish timestamps after serialization, which can be used to confirm they are correct. Finally, comparing profiles before/after in the profiler UX will also show the difference.

**What does this PR do?**: (Important note: This fix only affects the new CPU Profiling 2.0 profiler which is still in **alpha**, see #2209) In #2181 we added a concurrency-safe mechanism for `StackRecorder`. Key to how it works, is that we keep two `ddog_Profile` instances at every point in time, and we alternate between using them. But this alternation introduced a bug with the `start_time` of profiles. Previously, the `start_time` got set whenever a profile got created or reset, and we used the current time for it. That caused the following effect: ``` t=0 created slot_one_profile (start_time=0) created slow_two_profile (start_time=0) make_active slot_one_profile t=60 serialized slot_one_profile (start_time=0, finish_time=60) reset slot_one_profile (start_time=60) make_active slot_two_profile t=120 serialized slot_two_profile (start_time=0, finish_time=120) reset slot_two_profile (start_time=120) make_active slot_one_profile t=180 serialized slot_one_profile (start_time=60, finish_time=180) reset slow_one_profile (start_time=180) make_active slot_two_profile ``` That is, other than the first profile (which is why we previously missed this bug), every profile got double the duration as intended, because we reset it after serialization, but that profile would not be used for the next period. To fix this issue, we additionally change the "make_active" step above (actually implemented in `serializer_flip_active_and_inactive_slots`) to set the correct `start_time` on the profile that becomes active. Thus, we get the correct behavior: ``` t=0 created slot_one_profile (start_time=0) created slow_two_profile (start_time=0) # Ignored, will be changed later make_active slot_one_profile (start_time=0) t=60 serialized slot_one_profile (start_time=0, finish_time=60) reset slot_one_profile (start_time=60) # Ignored, will be changed later make_active slot_two_profile (start_time=60) # Correct start_time t=120 serialized slot_two_profile (start_time=60, finish_time=120) reset slot_two_profile (start_time=120) # Ignored, will be changed later make_active slot_one_profile (start_time=120) # Correct start_time t=180 serialized slot_one_profile (start_time=120, finish_time=180) reset slow_one_profile (start_time=180) make_active slot_two_profile (start_time=180) ``` **Motivation**: Having profiles with the wrong duration breaks profile aggregation. **Additional notes**: As I mentioned above, this only affects the new CPU Profiling 2.0 profiler codepath, so I don't expect any customers to have ever ran into this issue. **How to test the change?**: Change includes coverage. Furthermore, running the profiler with `DD_TRACE_DEBUG` prints the start/finish timestamps after serialization, which can be used to confirm they are correct. Finally, comparing profiles before/after in the profiler UX will also show the difference.

**What does this PR do?**: (Important note: This feature is only available on the new CPU Profiling 2.0 profiler which is still in **alpha**, see #2209) This PR is the last piece of the puzzle started in #2304 and #2323. With this change, time (both cpu-time and wall-time) spent by threads doing garbage collection is now accounted for, and shows up in the flamegraph. This works by creating a new Ruby VM TracePoint to handle the `RUBY_INTERNAL_EVENT_GC_ENTER` and `RUBY_INTERNAL_EVENT_GC_EXIT` events. (These events are only available at the C-level; Ruby-level Tracepoints cannot use them). Then, whenever Ruby calls the TracePoint, we call the previously-added `cpu_and_wall_time_collector_on_gc_start` and `cpu_and_wall_time_collector_on_gc_finish` to track the time spent in GC, and then insert it as a sample in the profiling output (using `cpu_and_wall_time_collector_sample_after_gc`). **Motivation**: Without this work, time spent doing garbage collection is invisible and blamed on methods directly. By making it visible, we enable customers to make better informed decisions on what needs to be optimized (or fixed!). **Additional Notes**: As I mentioned below, this only affects the new CPU Profiling 2.0 profiler codepath. During development, I initially attempted to compare the time spent in GC gathered via the TracePoint to the one that Ruby exposes via [`GC::Profiler#total_time`](https://rubyapi.org/3.1/o/gc/profiler). The results were off and when I looked into why, I discovered by looking at Ruby's `gc_start` function in `gc.c` that the time tracking for `GC::Profiler` only covers a subset of the GC process, not the entire time. Furthermore, from what I was able to observe, it accounts for cpu-time process-wide on Linux, and in some cases (macos), only cpu-time in usermode, so that's another reason for discrepancies. Thus, when comparing values, expect the `GC::Profiler` to be way less than the values exposed by the profiler. I belive the tracepoint-based approach and values recorded by the profiler to be more accurate since Ruby calls the `ENTER` event quite early in the GC process, and `EXIT` quite late. See also <ivoanjo/gvl-tracing#6> for a similar approach proposed by a Ruby core developer. **How to test the change?**: Beyond the included code coverage, you should see `Garbage Collection` frames representing time spent in GC in flamegraphs (when using the new CPU Profiling 2.0 codepath, see above). Here's a tiny Ruby script that triggers a lot of allocation: ```ruby def do_alloc(n) alloc(n) end def alloc(n) n.times do Object.new end end while true do_alloc(100_000) end ```

**What does this PR do?**: This PR extends the `thread id` label that is attached to profiling samples to include the thread's "native thread id". Up until now, the `thread id` that we attached was the result of calling `Thread#object_id`. That approach allows us to easily match what we see inside the Ruby process (because we can log the `Thread#object_id` where something happens) with the profile. BUT if we wanted to match the profile with what the operating system's tooling showed, it was quite hard, because threads shown in the OS task manager tools show an OS-level thread identifier, which is not the same as the `Thread#object_id`. To enable this matching, this PR changes the format of the `thread id` from `"#{thread.object_id}"` to `"#{thread.native_object_id} (#{thread.object_id})"` thus providing both identifers. (This is OK with the profiling backend). Because `Thread#native_object_id` is a Ruby 3.1+ feature, on older Rubies we use a fallback identifier instead (`pthread_t`). This identifier isn't as useful as the native identifier; in the future we may want to explore a better fallback. **Motivation**: I found this helpful during development and when correlating data from profiler with other tools. **Additional Notes**: This includes full support for macOS, even though we don't officially support macOS. This feature is only available on the new CPU Profiling 2.0 profiler which is still in alpha, see #2209. **How to test the change?**: This is easiest to test by profiling a Ruby app, and then checking that the native ids for threads match up with what shows up on `/proc`. (Remember, only for >= 3.1, for older Rubies it won't match.)

ivoanjo requested a review from a team August 5, 2022 13:49

marcotc approved these changes Aug 5, 2022

View reviewed changes

marcotc approved these changes Aug 8, 2022

View reviewed changes

Base automatically changed from ivoanjo/prof-5859-sample-triggering to master August 17, 2022 13:39

ivoanjo merged commit f7fdfb6 into master Aug 17, 2022

ivoanjo deleted the ivoanjo/prof-5860-new-profiler-configuration branch August 17, 2022 13:40

github-actions bot added this to the 0.54.3 milestone Aug 17, 2022

ivoanjo modified the milestones: 0.54.3, 1.4.0 Aug 19, 2022

This was referenced Oct 26, 2022

Fix incorrect start time for profiles #2333

Merged

[PROF-6260] Enable tracking of time spent in GC #2336

Merged

ivoanjo mentioned this pull request Nov 2, 2022

Include native thread id as part of thread id label for profiling #2345

Merged

ivoanjo mentioned this pull request Nov 15, 2022

[PROF-5943] Clear leftover state on new profiler after Ruby VM forks #2367

Merged

ivoanjo mentioned this pull request Jul 31, 2023

Resque workers deadlocked from ddtrace after processing jobs #3015

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PROF-5860] Allow new CPU Profiling 2.0 alpha profiler to be enabled #2209

[PROF-5860] Allow new CPU Profiling 2.0 alpha profiler to be enabled #2209

ivoanjo commented Aug 5, 2022

marcotc left a comment

[PROF-5860] Allow new CPU Profiling 2.0 **alpha** profiler to be enabled #2209

[PROF-5860] Allow new CPU Profiling 2.0 **alpha** profiler to be enabled #2209

Conversation

ivoanjo commented Aug 5, 2022

The new Ruby profiler, aka "CPU Profiling 2.0", is considered to be alpha state. We do not recommend turning it on.

marcotc left a comment

Choose a reason for hiding this comment

[PROF-5860] Allow new CPU Profiling 2.0 alpha profiler to be enabled #2209

[PROF-5860] Allow new CPU Profiling 2.0 alpha profiler to be enabled #2209