[PROF-9147] Add workaround for Ruby VM bug causing crash in gc_finalize_deferred #3473

ivoanjo · 2024-02-22T11:14:16Z

What does this PR do?

This PR adds a workaround for a VM bug (https://bugs.ruby-lang.org/issues/19991) that has affected at least two customers (the other being #3430).

The workaround is... not very pretty... but I believe will get the job done, for the reasons explained in the big comment I added.

Motivation:

The good thing about this crash is that it seems extremely rare.

The bad thing is that without this workaround, only moving to Ruby 3.3+ is the fix, and we still have a lot of customers on older Ruby versions.

We've documented in https://docs.datadoghq.com/profiler/profiler_troubleshooting/ruby/#segmentation-faults-in-gc_finalize_deferred-in-ruby-versions-26-to-32 that the no signals workaround can be used to work around the issue but:

a. Enabling it for everyone on < 3.3 is really crappy, as it yields biased samples and the crash only happens in the presence of specific types of Ruby objects (e.g. objects with finalizers) that not all apps will run into.

b. For folks that are affected by this issue, it may take quite a while for them to connect the bug to the profiler, since their crash log will point at gc_finalize_deferred and won't mention datadog at all.

Additional Notes:

N/A

How to test the change?

This is the million dollar question. I am unable to reproduce the issue locally just by running the profiler.

I was able to simulate it by changing Ruby's vm_trace.c as follows:

@@ -1748,6 +1748,8 @@ rb_workqueue_register(unsigned flags, rb_postponed_job_func_t func, void *data)
     return TRUE;
 }

+static void dummy_method(void *_unused) { }
+
 void
 rb_postponed_job_flush(rb_vm_t *vm)
 {
@@ -1775,7 +1777,11 @@ rb_postponed_job_flush(rb_vm_t *vm)
             while ((index = vm->postponed_job_index) > 0) {
                 if (ATOMIC_CAS(vm->postponed_job_index, index, index-1) == index) {
                     rb_postponed_job_t *pjob = &vm->postponed_job_buffer[index-1];
-                    (*pjob->func)(pjob->data);
+                    rb_postponed_job_func_t the_func = (*pjob->func);
+                    if (the_func != dummy_method) rb_postponed_job_register_one(0, dummy_method, NULL);
+                    //if (the_func != dummy_method) rb_postponed_job_register_one(0, dummy_method, GET_VM()->objspace);
+                    (the_func)(pjob->data);
                 }

The idea being that it's equivalent to the profiler SIGPROF interruption landing right between reading pjob->func and pjob->data.

I then ran the following Ruby snippet (which triggers gc_finalize_deferred):

class UsesFinalizer
  DUMMY_FINALIZER = proc { }
  def initialize = ObjectSpace.define_finalizer(self, DUMMY_FINALIZER)
end
UsesFinalizer.new while true

...and this gets me a crash that matches the ones linked in

If instead of passing NULL, we uncommend the second option that passes in GET_VM()->objspace, the crash goes away.

Thus I believe that my fix works and is equivalent to what the experiment above does.

I plan to reach out to both customers that saw this issue and see if they'd be up for testing the fix.

For Datadog employees:

If this PR touches code that signs or publishes builds or packages, or handles
credentials of any kind, I've requested a review from @DataDog/security-design-and-guidance.
This PR doesn't touch any of that.

…ze_deferred **What does this PR do?** This PR adds a workaround for a VM bug (https://bugs.ruby-lang.org/issues/19991) that has affected at least two customers (the other being #3430) The workaround is... not very pretty... but I believe will get the job done, for the reasons explained in the big comment I added. **Motivation:** The good thing about this crash is that it seems extremely rare. The bad thing is that without this workaround, only moving to Ruby 3.3+ is the fix, and we still have a lot of customers on older Ruby versions. We've documented in https://docs.datadoghq.com/profiler/profiler_troubleshooting/ruby/#segmentation-faults-in-gc_finalize_deferred-in-ruby-versions-26-to-32 that the no signals workaround can be used to work around the issue but: a. Enabling it for everyone on < 3.3 is really crappy, as it yields biased samples and the crash only happens in the presence of specific types of Ruby objects (e.g. objects with finalizers) that not all apps will run into. b. For folks that are affected by this issue, it may take quite a while for them to connect the bug to the profiler, since their crash log will point at `gc_finalize_deferred` and won't mention datadog at all. **Additional Notes:** N/A **How to test the change?** This is the million dollar question. I am unable to reproduce the issue locally just by running the profiler. I was able to simulate it by changing Ruby's `vm_trace.c` as follows: ``` @@ -1748,6 +1748,8 @@ rb_workqueue_register(unsigned flags, rb_postponed_job_func_t func, void *data) return TRUE; } +static void dummy_method(void *_unused) { } + void rb_postponed_job_flush(rb_vm_t *vm) { @@ -1775,7 +1777,11 @@ rb_postponed_job_flush(rb_vm_t *vm) while ((index = vm->postponed_job_index) > 0) { if (ATOMIC_CAS(vm->postponed_job_index, index, index-1) == index) { rb_postponed_job_t *pjob = &vm->postponed_job_buffer[index-1]; - (*pjob->func)(pjob->data); + rb_postponed_job_func_t the_func = (*pjob->func); + if (the_func != dummy_method) rb_postponed_job_register_one(0, dummy_method, NULL); + //if (the_func != dummy_method) rb_postponed_job_register_one(0, dummy_method, GET_VM()->objspace); + (the_func)(pjob->data); } ``` The idea being that it's equivalent to the profiler SIGPROF interruption landing right between reading `pjob->func` and `pjob->data`. I then ran the following Ruby snippet (which triggers `gc_finalize_deferred`): ```ruby class UsesFinalizer DUMMY_FINALIZER = proc { } def initialize = ObjectSpace.define_finalizer(self, DUMMY_FINALIZER) end UsesFinalizer.new while true ``` ...and this gets me a crash that matches the ones linked in * https://bugs.ruby-lang.org/issues/19991 * #3430 If instead of passing `NULL`, we uncommend the second option that passes in `GET_VM()->objspace`, the crash goes away. Thus I believe that my fix works and is equivalent to what the experiment above does.

codecov-commenter · 2024-02-22T11:31:56Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (81e6060) 98.23% compared to head (f44ed93) 98.23%.

Additional details and impacted files

@@           Coverage Diff           @@
##           master    #3473   +/-   ##
=======================================
  Coverage   98.23%   98.23%           
=======================================
  Files        1275     1275           
  Lines       75167    75167           
  Branches     3548     3548           
=======================================
+ Hits        73842    73844    +2     
+ Misses       1325     1323    -2

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

AlexJF

Smart!

p-datadog · 2024-02-22T19:24:17Z

ext/ddtrace_profiling_native_extension/collectors_cpu_and_wall_time_worker.c

+    //
+    // Thus, our workaround is simple: we pass in objspace as our argument, just in case the clobbering happens.
+    // In the happy path, we never use this argument so it makes no difference. In the buggy path, we avoid crashing the VM.
+    int result = rb_postponed_job_register(0, sample_from_postponed_job, gc_finalize_deferred_workaround /* instead of NULL */);


This took me some time to understand: the gc_finalize_deferred_workaround argument will be used in the call to the other function that is in the postponed queue, and the only function (in the Ruby core itself) that has a meaningful argument takes ObjectSpace.

So, if there is some other library/extension which also uses postponed jobs, it could still crash - and could possibly be given wrong argument (i.e. ObjectSpace instead of whatever the correct argument was stored earlier).

So, if there is some other library/extension which also uses postponed jobs, it could still crash - and could possibly be given wrong argument (i.e. ObjectSpace instead of whatever the correct argument was stored earlier).

Yes! Although as per the discussion in ruby/ruby#8949 (comment) it seems this is quite rare, so if we ever run into one of those (which I really hope we don't) the no signals workaround will still work to avoid crashes.

ivoanjo · 2024-03-01T09:41:01Z

Thanks everyone for reviewing! I've additionally gotten good feedback in #3430 (comment) that this fix works in practice, so I'm going to go ahead and merge it in.

**What does this PR do?** This PR updates the Ruby troubleshooting documentation, now that a a new workaround exists and is auto-applied (<DataDog/dd-trace-rb#3473>). **Motivation:** Clarify that using the "no signals" workaround is no longer the recommended way to solve this issue.

* [PROF-9147] Document workaround for Ruby gc_finalize_deferred **What does this PR do?** This PR updates the Ruby troubleshooting documentation, now that a a new workaround exists and is auto-applied (<DataDog/dd-trace-rb#3473>). **Motivation:** Clarify that using the "no signals" workaround is no longer the recommended way to solve this issue. * Update content/en/profiler/profiler_troubleshooting/ruby.md --------- Co-authored-by: Brett Blue <84536271+brett0000FF@users.noreply.github.com>

ivoanjo requested a review from a team as a code owner February 22, 2024 11:14

github-actions bot added the profiling Involves Datadog profiling label Feb 22, 2024

AlexJF approved these changes Feb 22, 2024

View reviewed changes

ivoanjo mentioned this pull request Feb 22, 2024

Segmentation faults in Sidekiq even when no jobs running #3430

Closed

p-datadog reviewed Feb 22, 2024

View reviewed changes

p-datadog approved these changes Feb 22, 2024

View reviewed changes

ivoanjo merged commit 9e4a267 into master Mar 1, 2024
220 checks passed

ivoanjo deleted the ivoanjo/prof-9147-gc-finalize-deferred-workaround branch March 1, 2024 09:41

github-actions bot added this to the 2.0 milestone Mar 1, 2024

ivoanjo mentioned this pull request Mar 1, 2024

[PROF-9147] Document workaround for Ruby gc_finalize_deferred DataDog/documentation#22019

Merged

1 task

ivoanjo modified the milestones: 2.0, 1.21.0 Mar 14, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PROF-9147] Add workaround for Ruby VM bug causing crash in gc_finalize_deferred #3473

[PROF-9147] Add workaround for Ruby VM bug causing crash in gc_finalize_deferred #3473

ivoanjo commented Feb 22, 2024

codecov-commenter commented Feb 22, 2024

AlexJF left a comment

p-datadog Feb 22, 2024

ivoanjo Mar 1, 2024

ivoanjo commented Mar 1, 2024

[PROF-9147] Add workaround for Ruby VM bug causing crash in gc_finalize_deferred #3473

[PROF-9147] Add workaround for Ruby VM bug causing crash in gc_finalize_deferred #3473

Conversation

ivoanjo commented Feb 22, 2024

codecov-commenter commented Feb 22, 2024

Codecov Report

AlexJF left a comment

Choose a reason for hiding this comment

p-datadog Feb 22, 2024

Choose a reason for hiding this comment

ivoanjo Mar 1, 2024

Choose a reason for hiding this comment

ivoanjo commented Mar 1, 2024