[PROF-10422] Add GVL profiling as a preview feature #3929

ivoanjo · 2024-09-19T10:27:39Z

What does this PR do?

This PR introduces a new feature for the Continuous Profiler: GVL profiling.

Specifically, when enabled, GVL profiling means the profiler gathers information from threads waiting to acquire the Ruby "Global VM Lock" (GVL).

This is important because this can be a big a source of latency for Ruby applications: a thread "Waiting on the GVL" is a thread that's ready to make progress, as soon as Ruby allows it to run again.

For instance, consider this example where four threads are fighting to execute (source):

def fib(n)
  return n if n <= 1
  fib(n - 1) + fib(n - 2)
end

Thread.new { sleep(0.05) while true }

3.times.map { Thread.new { fib(37) } }.map(&:join)

And here's how they are shown in profiler timeline view:

This new feature is off by default and requires Ruby 3.3+. This PR introduces a new environment variable (DD_PROFILING_PREVIEW_GVL_ENABLED) and setting c.profiling.advanced.preview_gvl_enabled to enable the feature.

For more details on why GVL profiling is relevant, check out Understanding the Ruby Global VM Lock (GVL) by observing it.

Motivation:

The latency impact of GVL contention has thus far not been observable in most profilers, including ours. You'd see that a thread wasn't running for some time period (let's say 200ms), but it may not be clear that out of those 200ms, maybe the database answered back in 100ms, but then the other 100ms were GVL contention.

This feature (built atop the GVL instrumentation API) finally exposes this information.

Additional Notes:

I've written an intro to how this feature is implemented in docs/ProfilingDevelopment.md. I suggest starting there to review the PR.

Ruby 3.2 provides some, but not all of the new APIs needed to support this functionality. We should be able to support Ruby 3.2 in the future, but I opted to skip it for this first version as making it work for Ruby 3.3+ was already quite complex.

This PRs includes a few refactoring cleanups to make it easier to introduce the functionality. I opted not to extract them into separate PRs, but I'm open to doing so if it looks this PR is too complex for a good review.

How to test the change?

This feature includes test coverage. Additionally, I tried it with a number of examples from the https://github.com/ivoanjo/gvl-tracing gem, which you can see below.

I still plan to do more validation with our usual integration testing apps, but I'm opening this PR already as I think that this feature can be merged as-is while we do more validation. Its blast radius is expected to be extremely small (off by default, only for Ruby 3.3+) and thus we can have it live in master during this extra validation.

https://github.com/ivoanjo/gvl-tracing/blob/master/examples/example4.rb:

https://github.com/ivoanjo/gvl-tracing/blob/master/examples/rubykaigi2023/rk-example2.rb:

https://github.com/ivoanjo/gvl-tracing/blob/master/examples/rubykaigi2023/rk-example4.rb:

https://github.com/ivoanjo/gvl-tracing/blob/master/examples/rubykaigi2023/rk-example6.rb:

…ampling

… gvl

…e state

This test helper method was getting very annoying to maintain every time we needed to test new features.

The flakiness on this spec showed up again when I slightly refactored the `Collectors::Stack::Testing._native_sample` method but goes away if we use an array. I really dislike all the handwaving around this, but I'm not sure what we can do better in this case.

We're using low-level Ruby APIs that don't look before they jump, so let's be careful to do our own validation to avoid crashes.

It occurs to me that I'm sprinkling a lot of conditionals all over the place, and when we want to support 3.2 we'll need to find all of them and if we miss some there won't be anything reminding us it's missing. So by extracting the version to a single location, we'll bump it once and all specs that need to work for 3.2 will be run.

So far in the profiler, we've been avoiding adding defaults to keyword arguments that are expected to be set in production as, well, those defaults won't match what happens in production (they'll get overridden). This is all fine, but quite annoying when we use our components piecemeal for benchmarking and testing, since adding any new arguments means modifying a gazillion places in the codebase to start passing in that argument. To solve this, let's introduce a `for_testing` helper method that provides defaults that we often use for testing, and that then calls into the regular constructor.

…dcoded This setting is going to be the main overhead lever for the Waiting for GVL feature. Having it configurable will allow us to play with it and better find good defaults.

This method has way too many positional arguments, let's try to avoid this footgun by using keyword arguments instead.

…imeWorker

This skip was only called after a previous before started the profiler, so this left the test suite in an incorrect state.

…d argument warning

AlexJF · 2024-09-19T11:37:53Z

docs/ProfilingDevelopment.md

+will continue to mark the thread as being in this state, until `on_gvl_running` + `sample_after_gvl_running` happen and
+clear the `per_thread_gvl_waiting_timestamp`, which will make samples revert back to the regular behavior.


Awesome explanation. Is there some kind of bound on how much later sample_after_gvl_running may run? I.e., could we end up with a situation where that scheduling will also occur much later than expected and a lot of cpu/wall samples get misattributed as waiting on gvl?

From my read of the VM sources, it should be immediate.

Looking at the VM sources RUBY_INTERNAL_THREAD_EVENT_RESUMED gets called inside thread_sched_wait_running_turn. Following a bit along the chain, that method gets called from thread_sched_to_running_common <- thread_sched_to_running, etc.

At some point, we end up in BLOCKING_REGION. This method gets called by rb_nogvl, rb_thread_io_blocking_call, do_select, rb_thread_wait_for_sngle_fd, etc: all methods where the VM needs to block the thread. All of these methods where BLOCKING_REGION gets used have somewhere shortly after BLOCKING_REGION ends some kind of check for interruptions. Another way to get to thread_sched_to_running is thread_blocking_end, that gets used by native_sleep which also checks for interruptions.

Since there's a bunch of paths, I'm not 100% sure there's not some weird path where interruptions are not checked, but it seems like in almost all situations the VM checks for interruptions before going back to process whatever it was waiting for.

Additionally, it's not possible for

a lot of cpu/wall samples get misattributed as waiting on gvl?

because once on_gvl_running gets called, and the timestamp is flipped to negative, the next sample (even if it's not from sample_after_gvl_running but a regular sample) will clear the on_gvl_running.

So even if we hit a weird corner case where sample_after_gvl_running does not run in a timely manner, the next regular sample will anyway clear the state.

Thus, this makes me think we're mostly covered ;)

AlexJF · 2024-09-19T12:42:16Z

ext/datadog_profiling_native_extension/collectors_cpu_and_wall_time_worker.c

@@ -781,6 +802,21 @@ static VALUE release_gvl_and_run_sampling_trigger_loop(VALUE instance) {
    ;
  }

+  #ifndef NO_GVL_INSTRUMENTATION
+    if (state->gvl_profiling_enabled) {


We could move the ifndef inside the if and raise on #else? Sounds like otherwise we may accidentally end up in situations with gvl_profiling_enabled == true silently failing to do anything? I realize you're guarding against this in the ruby side of things but just like you enforce types in this code, doesn't hurt to be extra careful

Added in c443893

…ubies

**What does this PR do?** This PR fixes a flaky spec introduced by #3929: a spec to check that an error was raised in a background thread implicitly depended on a race (that the background thread ran before the rspec thread did) and thus started failing when the race was lost. By actually synchronizing with the background thread using the `on_failure_proc`, we now guarantee that the background thread has the chance to run as expected. **Motivation:** Our goal is to always have zero flaky specs in the profiler! **Additional Notes:** Fixes DataDog/ruby-guild#179 **How to test the change?** Validate that CI is still green.

**What does this PR do?** This PR builds atop #3929 that added support for GVL profiling for Ruby 3.3+ and makes GVL profiling also work for Ruby 3.2 . Supporting GVL profiling on Ruby 3.2 needed special additional work. That's because while in Ruby 3.2 we have the GVL instrumentation API giving us the events we need to profile the GVL, we're missing: 1. Getting the Ruby thread `VALUE` as an argument in GVL instrumentation API events 2. The `rb_internal_thread_specific` API that allows us to attach in a thread-safe way data to Ruby thread objects Both 1 and 2 were only introduced in Ruby 3.3, and our implementation of GVL profiling relies/relied on them. This PR... reimplements 1 & 2 in an alternative way, allowing us to keep our existing design for 3.3+, while also supporting the older Ruby version. I've split it into two commits: i. Abstracting access and management of 1 & 2 into a new set of files (`gvl_profiling_helper.c`/`gvl_profiling_helper.h`). These new files are zero-overhead abstractions for most situations. ii. Implementing 1 & 2 for Ruby 3.2. **Motivation:** We believe GVL profiling is quite an important observability feature for Ruby, and thus we want to support it on all Ruby versions that provide the GVL instrumentation API. **Additional Notes:** To solve 1, we're using native level thread-locals (GCC's `__thread`) to keep a pointer to the underlying Ruby `rb_thread_t` structure. This is more complex than than "just keep it on a thread-local" because: a) Ruby reuses native threads. When a Ruby thread dies, Ruby keeps the underlying native thread around for a bit, and if another Ruby thread is born very quickly after the previous one, Ruby will reuse the native thread and attach it to the new Ruby thread. To avoid incorrectly reusing the thread-locals, we install an event hook on Ruby thread start, and make sure to clean any native thread-locals when a new thread stats. b) Some of the GVL instrumentation API events are emitted while the thread does not have the GVL and so we need to be careful when we can and cannot read VM information. Thus, we only initialize the thread-local during the `RUBY_INTERNAL_THREAD_EVENT_RESUMED` which is emitted while the thread owns the GVL. c) Since we don't get the current thread in events, we need to get a bit... creative. Thus, what we do is in `RUBY_INTERNAL_THREAD_EVENT_RESUMED`, because we know the current thread MUST own the GVL, we read from the internal Ruby VM state which thread is the GVL owner to find the info we need. With a + b + c together we are able to keep a pointer to the underlying `rb_thread_t` up-to-date in a native thread local, thus replacing the need to get a `VALUE thread` as an argument. To solve 2, we rely on an important observation: there's a `VALUE stat_insn_usage` field inside `rb_thread_t` that's unused and seems to have effectively been forgotten about. There's nowhere in the VM code that's writing or reading it (other than marking it for GC), and not even git history reveals a time where this field was used. I could not find any other references to this field anywhere else. Thus, we make use of this field to store the information we need, as a replacement for `rb_internal_thread_specific`. Since presumably Ruby 3.2 will never see this field either removed or used during its remaining maintenance release period this should work fine, and we have a nice clean solution for 3.3+. **How to test the change?** Happily, with the changes on this PR, the existing test coverage we added for GVL profiling on 3.3 is also green on 3.2! :)

p-datadog · 2024-10-01T16:20:32Z

👍

This PR adds the Ruby GVL profiling feature to the documentation. This feature was introduced in DataDog/dd-trace-rb#3929 and mentioned in the 2.4.0 release notes https://github.com/DataDog/dd-trace-rb/releases/tag/v2.4.0 but we had not yet updated the docs to mention it.

* [NO-TICKET] Add Ruby GVL profiling to documentation This PR adds the Ruby GVL profiling feature to the documentation. This feature was introduced in DataDog/dd-trace-rb#3929 and mentioned in the 2.4.0 release notes https://github.com/DataDog/dd-trace-rb/releases/tag/v2.4.0 but we had not yet updated the docs to mention it. * Apply suggestions from code review Co-authored-by: Sandra (neko) <165049174+neko-dd@users.noreply.github.com> --------- Co-authored-by: Sandra (neko) <165049174+neko-dd@users.noreply.github.com>

ivoanjo added 30 commits September 18, 2024 10:15

Bootstrap gvl profiling hook

e53c00f

Added recording of when gvl waiting starts

5a91954

Track waiting on GVL state in thread specific variables and trigger s…

1ee127f

…ampling

Wire up postponed job execution

f5bcd4f

Still figuring out how to handle sampling of threads after waiting on…

31ce66d

… gvl

Wire up waiting for gvl sampling, still missing actually attaching th…

2f3dd97

…e state

Wire up is_gvl_waiting_state into a state_label

339e18d

Small cleanups from a self-review

a8f771b

Make GVL profiling not break older Rubies

e655976

Fix crash when no options are provided

3d6e238

Add default arguments handling to _native_sample

83e38c6

This test helper method was getting very annoying to maintain every time we needed to test new features.

Add testing for new is_gvl_waiting_state flag

bf4d86d

Refactor "Waiting for GVL" logic away from update_metrics_and_sample

3db9e5f

Add testing for gvl_waiting_at per-thread initialization

30a245b

Add test coverage for thread_context_collector_on_gvl_waiting

8f61036

Add validation that input is thread

206b880

We're using low-level Ruby APIs that don't look before they jump, so let's be careful to do our own validation to avoid crashes.

Add testing for on_gvl_running

e3c1138

Add test coverage for samples during/after Waiting for GVL

7faab42

Add test coverage and explain why sample_after_gvl_running is needed

f9c919a

Add test coverage for cpu-time behavior

22a54ae

Sync datadog_ruby_common.h header changes

a6f6646

Linting fixes

e6093c8

Restore support for legacy Rubies

e0ceba5

Minor spec cleanups

147f227

Make waiting_for_gvl_threshold_ns configurable instead of being har…

524e9d1

…dcoded This setting is going to be the main overhead lever for the Waiting for GVL feature. Having it configurable will allow us to play with it and better find good defaults.

Convert CpuAndWallTimeWorker.new to use keyword arguments

0c7c1b5

This method has way too many positional arguments, let's try to avoid this footgun by using keyword arguments instead.

Expose gvl_profiling_enabled argument when initializing CpuAndWallT…

d05beb3

…imeWorker

ivoanjo added 8 commits September 18, 2024 13:44

Introduce setting for enabling GVL profiling

ceda29e

Update type signatures for profiling component class

6787399

Minor tweak to wording in warning

26914e1

Update ProfilingDevelopment doc with GVL profiling notes

bbebed3

Minor linting autofix

90f1aa3

Fix skipping on legacy Rubies not stopping the profiler

567a98c

This skip was only called after a previous before started the profiler, so this left the test suite in an incorrect state.

Split up variants of _native_gvl_profiling_hook_active to fix unuse…

ea2bb1b

…d argument warning

Disable GC profiling in spec to avoid dealing with warning

82aa51a

ivoanjo requested review from a team as code owners September 19, 2024 10:27

github-actions bot added core Involves Datadog core libraries profiling Involves Datadog profiling labels Sep 19, 2024

ivoanjo requested a review from AlexJF September 19, 2024 10:41

AlexJF approved these changes Sep 19, 2024

View reviewed changes

Raise ArgumentError when trying to use GVL profiling on unsupported R…

c443893

…ubies

ivoanjo enabled auto-merge September 20, 2024 09:18

ivoanjo merged commit 52a8d99 into master Sep 20, 2024
158 of 166 checks passed

ivoanjo deleted the ivoanjo/prof-10422-gvl-instrumentation-spike branch September 20, 2024 09:27

github-actions bot added this to the 2.4.0 milestone Sep 20, 2024

ivoanjo mentioned this pull request Sep 20, 2024

[PROF-10422] Fix flaky spec in profiler due to race #3934

Merged

ivoanjo mentioned this pull request Sep 23, 2024

[PROF-10588] Support GVL profiling preview on Ruby 3.2 #3939

Merged

ivoanjo added feature Involves a product feature and removed core Involves Datadog core libraries labels Sep 23, 2024

y9v mentioned this pull request Oct 11, 2024

Bump to version 2.4.0 #3994

Merged

ivoanjo mentioned this pull request Dec 5, 2024

[NO-TICKET] Add Ruby GVL profiling to documentation DataDog/documentation#26616

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PROF-10422] Add GVL profiling as a preview feature #3929

[PROF-10422] Add GVL profiling as a preview feature #3929

ivoanjo commented Sep 19, 2024 •

edited

Loading

AlexJF Sep 19, 2024

ivoanjo Sep 20, 2024

AlexJF Sep 19, 2024 •

edited

Loading

ivoanjo Sep 20, 2024

p-datadog commented Oct 1, 2024

		will continue to mark the thread as being in this state, until `on_gvl_running` + `sample_after_gvl_running` happen and
		clear the `per_thread_gvl_waiting_timestamp`, which will make samples revert back to the regular behavior.

[PROF-10422] Add GVL profiling as a preview feature #3929

[PROF-10422] Add GVL profiling as a preview feature #3929

Conversation

ivoanjo commented Sep 19, 2024 • edited Loading

AlexJF Sep 19, 2024

Choose a reason for hiding this comment

ivoanjo Sep 20, 2024

Choose a reason for hiding this comment

AlexJF Sep 19, 2024 • edited Loading

Choose a reason for hiding this comment

ivoanjo Sep 20, 2024

Choose a reason for hiding this comment

p-datadog commented Oct 1, 2024

ivoanjo commented Sep 19, 2024 •

edited

Loading

AlexJF Sep 19, 2024 •

edited

Loading