-
Notifications
You must be signed in to change notification settings - Fork 375
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PROF-5860] Allow new CPU Profiling 2.0 **alpha** profiler to be enabled #2209
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
The new Ruby profiler, aka "CPU Profiling 2.0", is considered to be alpha state. We do not recommend turning it on. But! We actually can turn it on now -- by using `DD_PROFILING_FORCE_ENABLE_NEW=true`. The rest of the pieces have been put into place in previous PRs.
marcotc
approved these changes
Aug 5, 2022
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ɑ 🥳
marcotc
approved these changes
Aug 8, 2022
ivoanjo
added a commit
that referenced
this pull request
Oct 11, 2022
**What does this PR do?**: (Important note: This feature is only available on the new CPU Profiling 2.0 profiler which is still in **alpha**, see #2209) This PR is the last piece of the puzzle started in #2304 and #2308. With this change, time (both cpu-time and wall-time) spent by threads doing garbage collection is now accounted for, and shows up in the flamegraph. This works by creating a new Ruby VM TracePoint to handle the `RUBY_INTERNAL_EVENT_GC_ENTER` and `RUBY_INTERNAL_EVENT_GC_EXIT` events. (These events are only available at the C-level; Ruby-level Tracepoints cannot use them). Then, whenever Ruby calls the TracePoint, we call the previously-added `cpu_and_wall_time_collector_on_gc_start` and `cpu_and_wall_time_collector_on_gc_finish` to track the time spent in GC, and then insert it as a sample in the profiling output. **Motivation**: Without this work, time spent doing garbage collection is invisible and blamed on methods directly. By making it visible, we enable customers to make better informed decisions on what needs to be optimized (or fixed!). **Additional Notes**: As I mentioned below, this only affects the new CPU Profiling 2.0 profiler codepath. During development, I initially attempted to compare the time spent in GC gathered via the TracePoint to the one that Ruby exposes via [`GC::Profiler#total_time`](https://rubyapi.org/3.1/o/gc/profiler). The results were off and when I looked into why, I discovered by looking at Ruby's `gc_start` function in `gc.c` that the time tracking for `GC::Profiler` only covers a subset of the GC process, not the entire time. Thus, when comparing values, expect the `GC::Profiler` to be way less than the values exposed by the profiler. I belive the approach and values recorded by the profiler are more accurate since Ruby calls the `ENTER` event quite early in the GC process, and `EXIT` quite late. See also <ivoanjo/gvl-tracing#6> for a similar approach proposed by a Ruby core developer. **How to test the change?**: TODO
ivoanjo
added a commit
that referenced
this pull request
Oct 11, 2022
**What does this PR do?**: (Important note: This feature is only available on the new CPU Profiling 2.0 profiler which is still in **alpha**, see #2209) This PR is the last piece of the puzzle started in #2304 and #2308. With this change, time (both cpu-time and wall-time) spent by threads doing garbage collection is now accounted for, and shows up in the flamegraph. This works by creating a new Ruby VM TracePoint to handle the `RUBY_INTERNAL_EVENT_GC_ENTER` and `RUBY_INTERNAL_EVENT_GC_EXIT` events. (These events are only available at the C-level; Ruby-level Tracepoints cannot use them). Then, whenever Ruby calls the TracePoint, we call the previously-added `cpu_and_wall_time_collector_on_gc_start` and `cpu_and_wall_time_collector_on_gc_finish` to track the time spent in GC, and then insert it as a sample in the profiling output. **Motivation**: Without this work, time spent doing garbage collection is invisible and blamed on methods directly. By making it visible, we enable customers to make better informed decisions on what needs to be optimized (or fixed!). **Additional Notes**: As I mentioned below, this only affects the new CPU Profiling 2.0 profiler codepath. During development, I initially attempted to compare the time spent in GC gathered via the TracePoint to the one that Ruby exposes via [`GC::Profiler#total_time`](https://rubyapi.org/3.1/o/gc/profiler). The results were off and when I looked into why, I discovered by looking at Ruby's `gc_start` function in `gc.c` that the time tracking for `GC::Profiler` only covers a subset of the GC process, not the entire time. Furthermore, from what I was able to observe, it accounts for cpu-time process-wide on Linux, and in some cases (macos), only cpu-time in usermode, so that's another reason for discrepancies. Thus, when comparing values, expect the `GC::Profiler` to be way less than the values exposed by the profiler. I belive the approach and values recorded by the profiler are more accurate since Ruby calls the `ENTER` event quite early in the GC process, and `EXIT` quite late. See also <ivoanjo/gvl-tracing#6> for a similar approach proposed by a Ruby core developer. **How to test the change?**: TODO
ivoanjo
added a commit
that referenced
this pull request
Oct 26, 2022
**What does this PR do?**: (Important note: This feature is only available on the new CPU Profiling 2.0 profiler which is still in **alpha**, see #2209) This PR is the last piece of the puzzle started in #2304 and #2323. With this change, time (both cpu-time and wall-time) spent by threads doing garbage collection is now accounted for, and shows up in the flamegraph. This works by creating a new Ruby VM TracePoint to handle the `RUBY_INTERNAL_EVENT_GC_ENTER` and `RUBY_INTERNAL_EVENT_GC_EXIT` events. (These events are only available at the C-level; Ruby-level Tracepoints cannot use them). Then, whenever Ruby calls the TracePoint, we call the previously-added `cpu_and_wall_time_collector_on_gc_start` and `cpu_and_wall_time_collector_on_gc_finish` to track the time spent in GC, and then insert it as a sample in the profiling output (using `cpu_and_wall_time_collector_sample_after_gc`). **Motivation**: Without this work, time spent doing garbage collection is invisible and blamed on methods directly. By making it visible, we enable customers to make better informed decisions on what needs to be optimized (or fixed!). **Additional Notes**: As I mentioned below, this only affects the new CPU Profiling 2.0 profiler codepath. During development, I initially attempted to compare the time spent in GC gathered via the TracePoint to the one that Ruby exposes via [`GC::Profiler#total_time`](https://rubyapi.org/3.1/o/gc/profiler). The results were off and when I looked into why, I discovered by looking at Ruby's `gc_start` function in `gc.c` that the time tracking for `GC::Profiler` only covers a subset of the GC process, not the entire time. Furthermore, from what I was able to observe, it accounts for cpu-time process-wide on Linux, and in some cases (macos), only cpu-time in usermode, so that's another reason for discrepancies. Thus, when comparing values, expect the `GC::Profiler` to be way less than the values exposed by the profiler. I belive the tracepoint-based approach and values recorded by the profiler to be more accurate since Ruby calls the `ENTER` event quite early in the GC process, and `EXIT` quite late. See also <ivoanjo/gvl-tracing#6> for a similar approach proposed by a Ruby core developer. **How to test the change?**: Beyond the included code coverage, you should see `Garbage Collection` frames representing time spent in GC in flamegraphs (when using the new CPU Profiling 2.0 codepath, see above). Here's a tiny Ruby script that triggers a lot of allocation: ```ruby def do_alloc(n) alloc(n) end def alloc(n) n.times do Object.new end end while true do_alloc(100_000) end ```
ivoanjo
added a commit
that referenced
this pull request
Oct 26, 2022
**What does this PR do?**: (Important note: This fix only affects the new CPU Profiling 2.0 profiler which is still in **alpha**, see #2209) In #2181 we added a concurrency-safe mechanism for `StackRecorder`. Key to how it works, is that we keep to `ddog_Profile` instances at every point in time, and we alternate between using them. But this alternation introduced a bug with the `start_time` of profiles. Previously, the `start_time` got set whenever a profile got created or reset, and we used the current time for it. That caused the following effect: ``` t=0 created slot_one_profile (start_time=0) created slow_two_profile (start_time=0) make_active slot_one_profile t=60 serialized slot_one_profile (start_time=0, finish_time=60) reset slot_one_profile (start_time=60) make_active slot_two_profile t=120 serialized slot_two_profile (start_time=0, finish_time=120) reset slot_two_profile (start_time=120) make_active slot_one_profile t=180 serialized slot_one_profile (start_time=60, finish_time=180) reset slow_one_profile (start_time=180) make_active slot_two_profile ``` That is, other than the first profile (which is why we previously missed this bug), every profile got double the duration as intended, because we reset it after serialization, but that profile would not be used for the next period. To fix this issue, we additionally change the "make_active" step above (actually implemented in `serializer_flip_active_and_inactive_slots`) to set the correct `start_time` on the profile that becomes active. Thus, we get the correct behavior: ``` t=0 created slot_one_profile (start_time=0) created slow_two_profile (start_time=0) # Ignored, will be changed later make_active slot_one_profile (start_time=0) t=60 serialized slot_one_profile (start_time=0, finish_time=60) reset slot_one_profile (start_time=60) # Ignored, will be changed later make_active slot_two_profile (start_time=60) # Correct start_time t=120 serialized slot_two_profile (start_time=60, finish_time=120) reset slot_two_profile (start_time=120) # Ignored, will be changed later make_active slot_one_profile (start_time=120) # Correct start_time t=180 serialized slot_one_profile (start_time=120, finish_time=180) reset slow_one_profile (start_time=180) make_active slot_two_profile (start_time=180) ``` **Motivation**: Having profiles with the wrong duration breaks profile aggregation. **Additional notes**: As I mentioned above, this only affects the new CPU Profiling 2.0 profiler codepath, so I don't expect any customers to have ever ran into this issue. **How to test the change?**: Change includes coverage. Furthermore, running the profiler with `DD_TRACE_DEBUG` prints the start/finish timestamps after serialization, which can be used to confirm they are correct. Finally, comparing profiles before/after in the profiler UX will also show the difference.
ivoanjo
added a commit
that referenced
this pull request
Oct 26, 2022
**What does this PR do?**: (Important note: This fix only affects the new CPU Profiling 2.0 profiler which is still in **alpha**, see #2209) In #2181 we added a concurrency-safe mechanism for `StackRecorder`. Key to how it works, is that we keep two `ddog_Profile` instances at every point in time, and we alternate between using them. But this alternation introduced a bug with the `start_time` of profiles. Previously, the `start_time` got set whenever a profile got created or reset, and we used the current time for it. That caused the following effect: ``` t=0 created slot_one_profile (start_time=0) created slow_two_profile (start_time=0) make_active slot_one_profile t=60 serialized slot_one_profile (start_time=0, finish_time=60) reset slot_one_profile (start_time=60) make_active slot_two_profile t=120 serialized slot_two_profile (start_time=0, finish_time=120) reset slot_two_profile (start_time=120) make_active slot_one_profile t=180 serialized slot_one_profile (start_time=60, finish_time=180) reset slow_one_profile (start_time=180) make_active slot_two_profile ``` That is, other than the first profile (which is why we previously missed this bug), every profile got double the duration as intended, because we reset it after serialization, but that profile would not be used for the next period. To fix this issue, we additionally change the "make_active" step above (actually implemented in `serializer_flip_active_and_inactive_slots`) to set the correct `start_time` on the profile that becomes active. Thus, we get the correct behavior: ``` t=0 created slot_one_profile (start_time=0) created slow_two_profile (start_time=0) # Ignored, will be changed later make_active slot_one_profile (start_time=0) t=60 serialized slot_one_profile (start_time=0, finish_time=60) reset slot_one_profile (start_time=60) # Ignored, will be changed later make_active slot_two_profile (start_time=60) # Correct start_time t=120 serialized slot_two_profile (start_time=60, finish_time=120) reset slot_two_profile (start_time=120) # Ignored, will be changed later make_active slot_one_profile (start_time=120) # Correct start_time t=180 serialized slot_one_profile (start_time=120, finish_time=180) reset slow_one_profile (start_time=180) make_active slot_two_profile (start_time=180) ``` **Motivation**: Having profiles with the wrong duration breaks profile aggregation. **Additional notes**: As I mentioned above, this only affects the new CPU Profiling 2.0 profiler codepath, so I don't expect any customers to have ever ran into this issue. **How to test the change?**: Change includes coverage. Furthermore, running the profiler with `DD_TRACE_DEBUG` prints the start/finish timestamps after serialization, which can be used to confirm they are correct. Finally, comparing profiles before/after in the profiler UX will also show the difference.
ivoanjo
added a commit
that referenced
this pull request
Oct 26, 2022
**What does this PR do?**: (Important note: This fix only affects the new CPU Profiling 2.0 profiler which is still in **alpha**, see #2209) In #2181 we added a concurrency-safe mechanism for `StackRecorder`. Key to how it works, is that we keep two `ddog_Profile` instances at every point in time, and we alternate between using them. But this alternation introduced a bug with the `start_time` of profiles. Previously, the `start_time` got set whenever a profile got created or reset, and we used the current time for it. That caused the following effect: ``` t=0 created slot_one_profile (start_time=0) created slow_two_profile (start_time=0) make_active slot_one_profile t=60 serialized slot_one_profile (start_time=0, finish_time=60) reset slot_one_profile (start_time=60) make_active slot_two_profile t=120 serialized slot_two_profile (start_time=0, finish_time=120) reset slot_two_profile (start_time=120) make_active slot_one_profile t=180 serialized slot_one_profile (start_time=60, finish_time=180) reset slow_one_profile (start_time=180) make_active slot_two_profile ``` That is, other than the first profile (which is why we previously missed this bug), every profile got double the duration as intended, because we reset it after serialization, but that profile would not be used for the next period. To fix this issue, we additionally change the "make_active" step above (actually implemented in `serializer_flip_active_and_inactive_slots`) to set the correct `start_time` on the profile that becomes active. Thus, we get the correct behavior: ``` t=0 created slot_one_profile (start_time=0) created slow_two_profile (start_time=0) # Ignored, will be changed later make_active slot_one_profile (start_time=0) t=60 serialized slot_one_profile (start_time=0, finish_time=60) reset slot_one_profile (start_time=60) # Ignored, will be changed later make_active slot_two_profile (start_time=60) # Correct start_time t=120 serialized slot_two_profile (start_time=60, finish_time=120) reset slot_two_profile (start_time=120) # Ignored, will be changed later make_active slot_one_profile (start_time=120) # Correct start_time t=180 serialized slot_one_profile (start_time=120, finish_time=180) reset slow_one_profile (start_time=180) make_active slot_two_profile (start_time=180) ``` **Motivation**: Having profiles with the wrong duration breaks profile aggregation. **Additional notes**: As I mentioned above, this only affects the new CPU Profiling 2.0 profiler codepath, so I don't expect any customers to have ever ran into this issue. **How to test the change?**: Change includes coverage. Furthermore, running the profiler with `DD_TRACE_DEBUG` prints the start/finish timestamps after serialization, which can be used to confirm they are correct. Finally, comparing profiles before/after in the profiler UX will also show the difference.
This was referenced Oct 26, 2022
ivoanjo
added a commit
that referenced
this pull request
Oct 28, 2022
**What does this PR do?**: (Important note: This fix only affects the new CPU Profiling 2.0 profiler which is still in **alpha**, see #2209) In #2181 we added a concurrency-safe mechanism for `StackRecorder`. Key to how it works, is that we keep two `ddog_Profile` instances at every point in time, and we alternate between using them. But this alternation introduced a bug with the `start_time` of profiles. Previously, the `start_time` got set whenever a profile got created or reset, and we used the current time for it. That caused the following effect: ``` t=0 created slot_one_profile (start_time=0) created slow_two_profile (start_time=0) make_active slot_one_profile t=60 serialized slot_one_profile (start_time=0, finish_time=60) reset slot_one_profile (start_time=60) make_active slot_two_profile t=120 serialized slot_two_profile (start_time=0, finish_time=120) reset slot_two_profile (start_time=120) make_active slot_one_profile t=180 serialized slot_one_profile (start_time=60, finish_time=180) reset slow_one_profile (start_time=180) make_active slot_two_profile ``` That is, other than the first profile (which is why we previously missed this bug), every profile got double the duration as intended, because we reset it after serialization, but that profile would not be used for the next period. To fix this issue, we additionally change the "make_active" step above (actually implemented in `serializer_flip_active_and_inactive_slots`) to set the correct `start_time` on the profile that becomes active. Thus, we get the correct behavior: ``` t=0 created slot_one_profile (start_time=0) created slow_two_profile (start_time=0) # Ignored, will be changed later make_active slot_one_profile (start_time=0) t=60 serialized slot_one_profile (start_time=0, finish_time=60) reset slot_one_profile (start_time=60) # Ignored, will be changed later make_active slot_two_profile (start_time=60) # Correct start_time t=120 serialized slot_two_profile (start_time=60, finish_time=120) reset slot_two_profile (start_time=120) # Ignored, will be changed later make_active slot_one_profile (start_time=120) # Correct start_time t=180 serialized slot_one_profile (start_time=120, finish_time=180) reset slow_one_profile (start_time=180) make_active slot_two_profile (start_time=180) ``` **Motivation**: Having profiles with the wrong duration breaks profile aggregation. **Additional notes**: As I mentioned above, this only affects the new CPU Profiling 2.0 profiler codepath, so I don't expect any customers to have ever ran into this issue. **How to test the change?**: Change includes coverage. Furthermore, running the profiler with `DD_TRACE_DEBUG` prints the start/finish timestamps after serialization, which can be used to confirm they are correct. Finally, comparing profiles before/after in the profiler UX will also show the difference.
ivoanjo
added a commit
that referenced
this pull request
Oct 28, 2022
**What does this PR do?**: (Important note: This feature is only available on the new CPU Profiling 2.0 profiler which is still in **alpha**, see #2209) This PR is the last piece of the puzzle started in #2304 and #2323. With this change, time (both cpu-time and wall-time) spent by threads doing garbage collection is now accounted for, and shows up in the flamegraph. This works by creating a new Ruby VM TracePoint to handle the `RUBY_INTERNAL_EVENT_GC_ENTER` and `RUBY_INTERNAL_EVENT_GC_EXIT` events. (These events are only available at the C-level; Ruby-level Tracepoints cannot use them). Then, whenever Ruby calls the TracePoint, we call the previously-added `cpu_and_wall_time_collector_on_gc_start` and `cpu_and_wall_time_collector_on_gc_finish` to track the time spent in GC, and then insert it as a sample in the profiling output (using `cpu_and_wall_time_collector_sample_after_gc`). **Motivation**: Without this work, time spent doing garbage collection is invisible and blamed on methods directly. By making it visible, we enable customers to make better informed decisions on what needs to be optimized (or fixed!). **Additional Notes**: As I mentioned below, this only affects the new CPU Profiling 2.0 profiler codepath. During development, I initially attempted to compare the time spent in GC gathered via the TracePoint to the one that Ruby exposes via [`GC::Profiler#total_time`](https://rubyapi.org/3.1/o/gc/profiler). The results were off and when I looked into why, I discovered by looking at Ruby's `gc_start` function in `gc.c` that the time tracking for `GC::Profiler` only covers a subset of the GC process, not the entire time. Furthermore, from what I was able to observe, it accounts for cpu-time process-wide on Linux, and in some cases (macos), only cpu-time in usermode, so that's another reason for discrepancies. Thus, when comparing values, expect the `GC::Profiler` to be way less than the values exposed by the profiler. I belive the tracepoint-based approach and values recorded by the profiler to be more accurate since Ruby calls the `ENTER` event quite early in the GC process, and `EXIT` quite late. See also <ivoanjo/gvl-tracing#6> for a similar approach proposed by a Ruby core developer. **How to test the change?**: Beyond the included code coverage, you should see `Garbage Collection` frames representing time spent in GC in flamegraphs (when using the new CPU Profiling 2.0 codepath, see above). Here's a tiny Ruby script that triggers a lot of allocation: ```ruby def do_alloc(n) alloc(n) end def alloc(n) n.times do Object.new end end while true do_alloc(100_000) end ```
ivoanjo
added a commit
that referenced
this pull request
Nov 2, 2022
**What does this PR do?**: This PR extends the `thread id` label that is attached to profiling samples to include the thread's "native thread id". Up until now, the `thread id` that we attached was the result of calling `Thread#object_id`. That approach allows us to easily match what we see inside the Ruby process (because we can log the `Thread#object_id` where something happens) with the profile. BUT if we wanted to match the profile with what the operating system's tooling showed, it was quite hard, because threads shown in the OS task manager tools show an OS-level thread identifier, which is not the same as the `Thread#object_id`. To enable this matching, this PR changes the format of the `thread id` from `"#{thread.object_id}"` to `"#{thread.native_object_id} (#{thread.object_id})"` thus providing both identifers. (This is OK with the profiling backend). Because `Thread#native_object_id` is a Ruby 3.1+ feature, on older Rubies we use a fallback identifier instead (`pthread_t`). This identifier isn't as useful as the native identifier; in the future we may want to explore a better fallback. **Motivation**: I found this helpful during development and when correlating data from profiler with other tools. **Additional Notes**: This includes full support for macOS, even though we don't officially support macOS. This feature is only available on the new CPU Profiling 2.0 profiler which is still in alpha, see #2209. **How to test the change?**: This is easiest to test by profiling a Ruby app, and then checking that the native ids for threads match up with what shows up on `/proc`. (Remember, only for >= 3.1, for older Rubies it won't match.)
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
The new Ruby profiler, aka "CPU Profiling 2.0", is considered to be alpha state. We do not recommend turning it on.
But! We actually can turn it on now -- by using
DD_PROFILING_FORCE_ENABLE_NEW=true
.The rest of the pieces have been put into place in previous PRs.
What does this PR do?:
Add a setting that allows choosing between the "old" profiler codepath, and the new "CPU Profiling 2.0" codepath.
Motivation:
Making it possible to test the new profiler.
Additional Notes
This PR sits atop #2208 because without the component in #2208 the profiler would not be able to be turned on. It is otherwise independent from that change.
How to test the change?:
Here's a simple run of the new profiler: