Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PROF-10422] Add GVL profiling as a preview feature #3929

Merged
merged 42 commits into from
Sep 20, 2024

Conversation

ivoanjo
Copy link
Member

@ivoanjo ivoanjo commented Sep 19, 2024

What does this PR do?

This PR introduces a new feature for the Continuous Profiler: GVL profiling.

Specifically, when enabled, GVL profiling means the profiler gathers information from threads waiting to acquire the Ruby "Global VM Lock" (GVL).

This is important because this can be a big a source of latency for Ruby applications: a thread "Waiting on the GVL" is a thread that's ready to make progress, as soon as Ruby allows it to run again.

For instance, consider this example where four threads are fighting to execute (source):

def fib(n)
  return n if n <= 1
  fib(n - 1) + fib(n - 2)
end

Thread.new { sleep(0.05) while true }

3.times.map { Thread.new { fib(37) } }.map(&:join)

And here's how they are shown in profiler timeline view:

image

This new feature is off by default and requires Ruby 3.3+. This PR introduces a new environment variable (DD_PROFILING_PREVIEW_GVL_ENABLED) and setting c.profiling.advanced.preview_gvl_enabled to enable the feature.

For more details on why GVL profiling is relevant, check out Understanding the Ruby Global VM Lock (GVL) by observing it.

Motivation:

The latency impact of GVL contention has thus far not been observable in most profilers, including ours. You'd see that a thread wasn't running for some time period (let's say 200ms), but it may not be clear that out of those 200ms, maybe the database answered back in 100ms, but then the other 100ms were GVL contention.

This feature (built atop the GVL instrumentation API) finally exposes this information.

Additional Notes:

I've written an intro to how this feature is implemented in docs/ProfilingDevelopment.md. I suggest starting there to review the PR.

Ruby 3.2 provides some, but not all of the new APIs needed to support this functionality. We should be able to support Ruby 3.2 in the future, but I opted to skip it for this first version as making it work for Ruby 3.3+ was already quite complex.

This PRs includes a few refactoring cleanups to make it easier to introduce the functionality. I opted not to extract them into separate PRs, but I'm open to doing so if it looks this PR is too complex for a good review.

How to test the change?

This feature includes test coverage. Additionally, I tried it with a number of examples from the https://github.com/ivoanjo/gvl-tracing gem, which you can see below.

I still plan to do more validation with our usual integration testing apps, but I'm opening this PR already as I think that this feature can be merged as-is while we do more validation. Its blast radius is expected to be extremely small (off by default, only for Ruby 3.3+) and thus we can have it live in master during this extra validation.

https://github.com/ivoanjo/gvl-tracing/blob/master/examples/example4.rb:

image

https://github.com/ivoanjo/gvl-tracing/blob/master/examples/rubykaigi2023/rk-example2.rb:

image

https://github.com/ivoanjo/gvl-tracing/blob/master/examples/rubykaigi2023/rk-example4.rb:

image

https://github.com/ivoanjo/gvl-tracing/blob/master/examples/rubykaigi2023/rk-example6.rb:

image

This test helper method was getting very annoying to maintain every
time we needed to test new features.
The flakiness on this spec showed up again when I slightly refactored
the `Collectors::Stack::Testing._native_sample` method but goes away if
we use an array.

I really dislike all the handwaving around this, but I'm not sure what
we can do better in this case.
We're using low-level Ruby APIs that don't look before they jump, so
let's be careful to do our own validation to avoid crashes.
It occurs to me that I'm sprinkling a lot of conditionals all over the
place, and when we want to support 3.2 we'll need to find all of them
and if we miss some there won't be anything reminding us it's missing.

So by extracting the version to a single location, we'll bump it once
and all specs that need to work for 3.2 will be run.
So far in the profiler, we've been avoiding adding defaults to keyword
arguments that are expected to be set in production as, well, those
defaults won't match what happens in production (they'll get
overridden).

This is all fine, but quite annoying when we use our components
piecemeal for benchmarking and testing, since adding any new arguments
means modifying a gazillion places in the codebase to start passing in
that argument.

To solve this, let's introduce a `for_testing` helper method that
provides defaults that we often use for testing, and that then calls
into the regular constructor.
…dcoded

This setting is going to be the main overhead lever for the Waiting for
GVL feature. Having it configurable will allow us to play with it and
better find good defaults.
This method has way too many positional arguments, let's try to avoid
this footgun by using keyword arguments instead.
@ivoanjo ivoanjo requested review from a team as code owners September 19, 2024 10:27
@github-actions github-actions bot added core Involves Datadog core libraries profiling Involves Datadog profiling labels Sep 19, 2024
@ivoanjo ivoanjo requested a review from AlexJF September 19, 2024 10:41
Comment on lines +293 to +294
will continue to mark the thread as being in this state, until `on_gvl_running` + `sample_after_gvl_running` happen and
clear the `per_thread_gvl_waiting_timestamp`, which will make samples revert back to the regular behavior.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Awesome explanation. Is there some kind of bound on how much later sample_after_gvl_running may run? I.e., could we end up with a situation where that scheduling will also occur much later than expected and a lot of cpu/wall samples get misattributed as waiting on gvl?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

From my read of the VM sources, it should be immediate.

Looking at the VM sources RUBY_INTERNAL_THREAD_EVENT_RESUMED gets called inside thread_sched_wait_running_turn. Following a bit along the chain, that method gets called from thread_sched_to_running_common <- thread_sched_to_running, etc.

At some point, we end up in BLOCKING_REGION. This method gets called by rb_nogvl, rb_thread_io_blocking_call, do_select, rb_thread_wait_for_sngle_fd, etc: all methods where the VM needs to block the thread. All of these methods where BLOCKING_REGION gets used have somewhere shortly after BLOCKING_REGION ends some kind of check for interruptions. Another way to get to thread_sched_to_running is thread_blocking_end, that gets used by native_sleep which also checks for interruptions.

Since there's a bunch of paths, I'm not 100% sure there's not some weird path where interruptions are not checked, but it seems like in almost all situations the VM checks for interruptions before going back to process whatever it was waiting for.


Additionally, it's not possible for

a lot of cpu/wall samples get misattributed as waiting on gvl?

because once on_gvl_running gets called, and the timestamp is flipped to negative, the next sample (even if it's not from sample_after_gvl_running but a regular sample) will clear the on_gvl_running.

So even if we hit a weird corner case where sample_after_gvl_running does not run in a timely manner, the next regular sample will anyway clear the state.

Thus, this makes me think we're mostly covered ;)

@@ -781,6 +802,21 @@ static VALUE release_gvl_and_run_sampling_trigger_loop(VALUE instance) {
;
}

#ifndef NO_GVL_INSTRUMENTATION
if (state->gvl_profiling_enabled) {
Copy link
Contributor

@AlexJF AlexJF Sep 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We could move the ifndef inside the if and raise on #else? Sounds like otherwise we may accidentally end up in situations with gvl_profiling_enabled == true silently failing to do anything? I realize you're guarding against this in the ruby side of things but just like you enforce types in this code, doesn't hurt to be extra careful

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added in c443893

@ivoanjo ivoanjo merged commit 52a8d99 into master Sep 20, 2024
158 of 166 checks passed
@ivoanjo ivoanjo deleted the ivoanjo/prof-10422-gvl-instrumentation-spike branch September 20, 2024 09:27
@github-actions github-actions bot added this to the 2.4.0 milestone Sep 20, 2024
ivoanjo added a commit that referenced this pull request Sep 20, 2024
**What does this PR do?**

This PR fixes a flaky spec introduced by #3929: a spec to check that
an error was raised in a background thread implicitly depended on a race
(that the background thread ran before the rspec thread did) and thus
started failing when the race was lost.

By actually synchronizing with the background thread using the
`on_failure_proc`, we now guarantee that the background thread has
the chance to run as expected.

**Motivation:**

Our goal is to always have zero flaky specs in the profiler!

**Additional Notes:**

Fixes DataDog/ruby-guild#179

**How to test the change?**

Validate that CI is still green.
ivoanjo added a commit that referenced this pull request Sep 23, 2024
**What does this PR do?**

This PR builds atop #3929 that added support for GVL profiling for
Ruby 3.3+ and makes GVL profiling also work for Ruby 3.2 .

Supporting GVL profiling on Ruby 3.2 needed special additional work.
That's because while in Ruby 3.2 we have the GVL instrumentation API
giving us the events we need to profile the GVL, we're missing:

1. Getting the Ruby thread `VALUE` as an argument in GVL instrumentation
   API events
2. The `rb_internal_thread_specific` API that allows us to attach in
   a thread-safe way data to Ruby thread objects

Both 1 and 2 were only introduced in Ruby 3.3, and our implementation
of GVL profiling relies/relied on them.

This PR... reimplements 1 & 2 in an alternative way, allowing us to
keep our existing design for 3.3+, while also supporting the older
Ruby version.

I've split it into two commits:
 i. Abstracting access and management of 1 & 2 into a new set of files
    (`gvl_profiling_helper.c`/`gvl_profiling_helper.h`). These new files
    are zero-overhead abstractions for most situations.
ii. Implementing 1 & 2 for Ruby 3.2.

**Motivation:**

We believe GVL profiling is quite an important observability feature
for Ruby, and thus we want to support it on all Ruby versions that
provide the GVL instrumentation API.

**Additional Notes:**

To solve 1, we're using native level thread-locals (GCC's
`__thread`) to keep a pointer to the underlying Ruby `rb_thread_t`
structure.

This is more complex than than "just keep it on a thread-local"
because:

a) Ruby reuses native threads. When a Ruby thread dies, Ruby keeps the
   underlying native thread around for a bit, and if another Ruby
   thread is born very quickly after the previous one, Ruby
   will reuse the native thread and attach it to the new Ruby thread.

   To avoid incorrectly reusing the thread-locals, we install an event
   hook on Ruby thread start, and make sure to clean any native
   thread-locals when a new thread stats.

b) Some of the GVL instrumentation API events are emitted while the
   thread does not have the GVL and so we need to be careful when we can
   and cannot read VM information.

   Thus, we only initialize the thread-local during the
   `RUBY_INTERNAL_THREAD_EVENT_RESUMED` which is emitted while the
   thread owns the GVL.

c) Since we don't get the current thread in events, we need to get a
   bit... creative. Thus, what we do is in
   `RUBY_INTERNAL_THREAD_EVENT_RESUMED`, because we know the current
   thread MUST own the GVL, we read from the internal Ruby VM state
   which thread is the GVL owner to find the info we need.

With a + b + c together we are able to keep a pointer to the underlying
`rb_thread_t` up-to-date in a native thread local, thus replacing the
need to get a `VALUE thread` as an argument.

To solve 2, we rely on an important observation: there's a
`VALUE stat_insn_usage` field inside `rb_thread_t` that's unused and
seems to have effectively been forgotten about.

There's nowhere in the VM code that's writing or reading it (other
than marking it for GC), and not even git history reveals a time
where this field was used. I could not find any other references to
this field anywhere else. Thus, we make use of this field to store
the information we need, as a replacement for
`rb_internal_thread_specific`.

Since presumably Ruby 3.2 will never see this field either removed or used
during its remaining maintenance release period this should work fine,
and we have a nice clean solution for 3.3+.

**How to test the change?**

Happily, with the changes on this PR, the existing test coverage
we added for GVL profiling on 3.3 is also green on 3.2! :)
@ivoanjo ivoanjo added feature Involves a product feature and removed core Involves Datadog core libraries labels Sep 23, 2024
@p-datadog
Copy link
Contributor

👍

@y9v y9v mentioned this pull request Oct 11, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Involves a product feature profiling Involves Datadog profiling
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants