-
Notifications
You must be signed in to change notification settings - Fork 375
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PROF-10422] Add GVL profiling as a preview feature #3929
[PROF-10422] Add GVL profiling as a preview feature #3929
Conversation
This test helper method was getting very annoying to maintain every time we needed to test new features.
The flakiness on this spec showed up again when I slightly refactored the `Collectors::Stack::Testing._native_sample` method but goes away if we use an array. I really dislike all the handwaving around this, but I'm not sure what we can do better in this case.
We're using low-level Ruby APIs that don't look before they jump, so let's be careful to do our own validation to avoid crashes.
It occurs to me that I'm sprinkling a lot of conditionals all over the place, and when we want to support 3.2 we'll need to find all of them and if we miss some there won't be anything reminding us it's missing. So by extracting the version to a single location, we'll bump it once and all specs that need to work for 3.2 will be run.
So far in the profiler, we've been avoiding adding defaults to keyword arguments that are expected to be set in production as, well, those defaults won't match what happens in production (they'll get overridden). This is all fine, but quite annoying when we use our components piecemeal for benchmarking and testing, since adding any new arguments means modifying a gazillion places in the codebase to start passing in that argument. To solve this, let's introduce a `for_testing` helper method that provides defaults that we often use for testing, and that then calls into the regular constructor.
…dcoded This setting is going to be the main overhead lever for the Waiting for GVL feature. Having it configurable will allow us to play with it and better find good defaults.
This method has way too many positional arguments, let's try to avoid this footgun by using keyword arguments instead.
+ a small refactor for skipping GVL profiling specs on legacy Rubies
This skip was only called after a previous before started the profiler, so this left the test suite in an incorrect state.
…d argument warning
will continue to mark the thread as being in this state, until `on_gvl_running` + `sample_after_gvl_running` happen and | ||
clear the `per_thread_gvl_waiting_timestamp`, which will make samples revert back to the regular behavior. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Awesome explanation. Is there some kind of bound on how much later sample_after_gvl_running
may run? I.e., could we end up with a situation where that scheduling will also occur much later than expected and a lot of cpu/wall samples get misattributed as waiting on gvl?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
From my read of the VM sources, it should be immediate.
Looking at the VM sources RUBY_INTERNAL_THREAD_EVENT_RESUMED
gets called inside thread_sched_wait_running_turn
. Following a bit along the chain, that method gets called from thread_sched_to_running_common
<- thread_sched_to_running
, etc.
At some point, we end up in BLOCKING_REGION
. This method gets called by rb_nogvl
, rb_thread_io_blocking_call
, do_select
, rb_thread_wait_for_sngle_fd
, etc: all methods where the VM needs to block the thread. All of these methods where BLOCKING_REGION
gets used have somewhere shortly after BLOCKING_REGION
ends some kind of check for interruptions. Another way to get to thread_sched_to_running
is thread_blocking_end
, that gets used by native_sleep
which also checks for interruptions.
Since there's a bunch of paths, I'm not 100% sure there's not some weird path where interruptions are not checked, but it seems like in almost all situations the VM checks for interruptions before going back to process whatever it was waiting for.
Additionally, it's not possible for
a lot of cpu/wall samples get misattributed as waiting on gvl?
because once on_gvl_running
gets called, and the timestamp is flipped to negative, the next sample (even if it's not from sample_after_gvl_running
but a regular sample) will clear the on_gvl_running
.
So even if we hit a weird corner case where sample_after_gvl_running
does not run in a timely manner, the next regular sample will anyway clear the state.
Thus, this makes me think we're mostly covered ;)
@@ -781,6 +802,21 @@ static VALUE release_gvl_and_run_sampling_trigger_loop(VALUE instance) { | |||
; | |||
} | |||
|
|||
#ifndef NO_GVL_INSTRUMENTATION | |||
if (state->gvl_profiling_enabled) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could move the ifndef inside the if and raise on #else
? Sounds like otherwise we may accidentally end up in situations with gvl_profiling_enabled == true
silently failing to do anything? I realize you're guarding against this in the ruby side of things but just like you enforce types in this code, doesn't hurt to be extra careful
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added in c443893
**What does this PR do?** This PR fixes a flaky spec introduced by #3929: a spec to check that an error was raised in a background thread implicitly depended on a race (that the background thread ran before the rspec thread did) and thus started failing when the race was lost. By actually synchronizing with the background thread using the `on_failure_proc`, we now guarantee that the background thread has the chance to run as expected. **Motivation:** Our goal is to always have zero flaky specs in the profiler! **Additional Notes:** Fixes DataDog/ruby-guild#179 **How to test the change?** Validate that CI is still green.
**What does this PR do?** This PR builds atop #3929 that added support for GVL profiling for Ruby 3.3+ and makes GVL profiling also work for Ruby 3.2 . Supporting GVL profiling on Ruby 3.2 needed special additional work. That's because while in Ruby 3.2 we have the GVL instrumentation API giving us the events we need to profile the GVL, we're missing: 1. Getting the Ruby thread `VALUE` as an argument in GVL instrumentation API events 2. The `rb_internal_thread_specific` API that allows us to attach in a thread-safe way data to Ruby thread objects Both 1 and 2 were only introduced in Ruby 3.3, and our implementation of GVL profiling relies/relied on them. This PR... reimplements 1 & 2 in an alternative way, allowing us to keep our existing design for 3.3+, while also supporting the older Ruby version. I've split it into two commits: i. Abstracting access and management of 1 & 2 into a new set of files (`gvl_profiling_helper.c`/`gvl_profiling_helper.h`). These new files are zero-overhead abstractions for most situations. ii. Implementing 1 & 2 for Ruby 3.2. **Motivation:** We believe GVL profiling is quite an important observability feature for Ruby, and thus we want to support it on all Ruby versions that provide the GVL instrumentation API. **Additional Notes:** To solve 1, we're using native level thread-locals (GCC's `__thread`) to keep a pointer to the underlying Ruby `rb_thread_t` structure. This is more complex than than "just keep it on a thread-local" because: a) Ruby reuses native threads. When a Ruby thread dies, Ruby keeps the underlying native thread around for a bit, and if another Ruby thread is born very quickly after the previous one, Ruby will reuse the native thread and attach it to the new Ruby thread. To avoid incorrectly reusing the thread-locals, we install an event hook on Ruby thread start, and make sure to clean any native thread-locals when a new thread stats. b) Some of the GVL instrumentation API events are emitted while the thread does not have the GVL and so we need to be careful when we can and cannot read VM information. Thus, we only initialize the thread-local during the `RUBY_INTERNAL_THREAD_EVENT_RESUMED` which is emitted while the thread owns the GVL. c) Since we don't get the current thread in events, we need to get a bit... creative. Thus, what we do is in `RUBY_INTERNAL_THREAD_EVENT_RESUMED`, because we know the current thread MUST own the GVL, we read from the internal Ruby VM state which thread is the GVL owner to find the info we need. With a + b + c together we are able to keep a pointer to the underlying `rb_thread_t` up-to-date in a native thread local, thus replacing the need to get a `VALUE thread` as an argument. To solve 2, we rely on an important observation: there's a `VALUE stat_insn_usage` field inside `rb_thread_t` that's unused and seems to have effectively been forgotten about. There's nowhere in the VM code that's writing or reading it (other than marking it for GC), and not even git history reveals a time where this field was used. I could not find any other references to this field anywhere else. Thus, we make use of this field to store the information we need, as a replacement for `rb_internal_thread_specific`. Since presumably Ruby 3.2 will never see this field either removed or used during its remaining maintenance release period this should work fine, and we have a nice clean solution for 3.3+. **How to test the change?** Happily, with the changes on this PR, the existing test coverage we added for GVL profiling on 3.3 is also green on 3.2! :)
👍 |
What does this PR do?
This PR introduces a new feature for the Continuous Profiler: GVL profiling.
Specifically, when enabled, GVL profiling means the profiler gathers information from threads waiting to acquire the Ruby "Global VM Lock" (GVL).
This is important because this can be a big a source of latency for Ruby applications: a thread "Waiting on the GVL" is a thread that's ready to make progress, as soon as Ruby allows it to run again.
For instance, consider this example where four threads are fighting to execute (source):
And here's how they are shown in profiler timeline view:
This new feature is off by default and requires Ruby 3.3+. This PR introduces a new environment variable (
DD_PROFILING_PREVIEW_GVL_ENABLED
) and settingc.profiling.advanced.preview_gvl_enabled
to enable the feature.For more details on why GVL profiling is relevant, check out Understanding the Ruby Global VM Lock (GVL) by observing it.
Motivation:
The latency impact of GVL contention has thus far not been observable in most profilers, including ours. You'd see that a thread wasn't running for some time period (let's say 200ms), but it may not be clear that out of those 200ms, maybe the database answered back in 100ms, but then the other 100ms were GVL contention.
This feature (built atop the GVL instrumentation API) finally exposes this information.
Additional Notes:
I've written an intro to how this feature is implemented in
docs/ProfilingDevelopment.md
. I suggest starting there to review the PR.Ruby 3.2 provides some, but not all of the new APIs needed to support this functionality. We should be able to support Ruby 3.2 in the future, but I opted to skip it for this first version as making it work for Ruby 3.3+ was already quite complex.
This PRs includes a few refactoring cleanups to make it easier to introduce the functionality. I opted not to extract them into separate PRs, but I'm open to doing so if it looks this PR is too complex for a good review.
How to test the change?
This feature includes test coverage. Additionally, I tried it with a number of examples from the https://github.com/ivoanjo/gvl-tracing gem, which you can see below.
I still plan to do more validation with our usual integration testing apps, but I'm opening this PR already as I think that this feature can be merged as-is while we do more validation. Its blast radius is expected to be extremely small (off by default, only for Ruby 3.3+) and thus we can have it live in master during this extra validation.
https://github.com/ivoanjo/gvl-tracing/blob/master/examples/example4.rb:
https://github.com/ivoanjo/gvl-tracing/blob/master/examples/rubykaigi2023/rk-example2.rb:
https://github.com/ivoanjo/gvl-tracing/blob/master/examples/rubykaigi2023/rk-example4.rb:
https://github.com/ivoanjo/gvl-tracing/blob/master/examples/rubykaigi2023/rk-example6.rb: