Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PROF-9147] Add workaround for Ruby VM bug causing crash in gc_finalize_deferred #3473

Merged
merged 1 commit into from
Mar 1, 2024

Conversation

ivoanjo
Copy link
Member

@ivoanjo ivoanjo commented Feb 22, 2024

What does this PR do?

This PR adds a workaround for a VM bug (https://bugs.ruby-lang.org/issues/19991) that has affected at least two customers (the other being #3430).

The workaround is... not very pretty... but I believe will get the job done, for the reasons explained in the big comment I added.

Motivation:

The good thing about this crash is that it seems extremely rare.

The bad thing is that without this workaround, only moving to Ruby 3.3+ is the fix, and we still have a lot of customers on older Ruby versions.

We've documented in https://docs.datadoghq.com/profiler/profiler_troubleshooting/ruby/#segmentation-faults-in-gc_finalize_deferred-in-ruby-versions-26-to-32 that the no signals workaround can be used to work around the issue but:

a. Enabling it for everyone on < 3.3 is really crappy, as it yields biased samples and the crash only happens in the presence of specific types of Ruby objects (e.g. objects with finalizers) that not all apps will run into.

b. For folks that are affected by this issue, it may take quite a while for them to connect the bug to the profiler, since their crash log will point at gc_finalize_deferred and won't mention datadog at all.

Additional Notes:

N/A

How to test the change?

This is the million dollar question. I am unable to reproduce the issue locally just by running the profiler.

I was able to simulate it by changing Ruby's vm_trace.c as follows:

@@ -1748,6 +1748,8 @@ rb_workqueue_register(unsigned flags, rb_postponed_job_func_t func, void *data)
     return TRUE;
 }

+static void dummy_method(void *_unused) { }
+
 void
 rb_postponed_job_flush(rb_vm_t *vm)
 {
@@ -1775,7 +1777,11 @@ rb_postponed_job_flush(rb_vm_t *vm)
             while ((index = vm->postponed_job_index) > 0) {
                 if (ATOMIC_CAS(vm->postponed_job_index, index, index-1) == index) {
                     rb_postponed_job_t *pjob = &vm->postponed_job_buffer[index-1];
-                    (*pjob->func)(pjob->data);
+                    rb_postponed_job_func_t the_func = (*pjob->func);
+                    if (the_func != dummy_method) rb_postponed_job_register_one(0, dummy_method, NULL);
+                    //if (the_func != dummy_method) rb_postponed_job_register_one(0, dummy_method, GET_VM()->objspace);
+                    (the_func)(pjob->data);
                 }

The idea being that it's equivalent to the profiler SIGPROF interruption landing right between reading pjob->func and pjob->data.

I then ran the following Ruby snippet (which triggers gc_finalize_deferred):

class UsesFinalizer
  DUMMY_FINALIZER = proc { }
  def initialize = ObjectSpace.define_finalizer(self, DUMMY_FINALIZER)
end
UsesFinalizer.new while true

...and this gets me a crash that matches the ones linked in

If instead of passing NULL, we uncommend the second option that passes in GET_VM()->objspace, the crash goes away.

Thus I believe that my fix works and is equivalent to what the experiment above does.

I plan to reach out to both customers that saw this issue and see if they'd be up for testing the fix.

For Datadog employees:

  • If this PR touches code that signs or publishes builds or packages, or handles
    credentials of any kind, I've requested a review from @DataDog/security-design-and-guidance.
  • This PR doesn't touch any of that.

…ze_deferred

**What does this PR do?**

This PR adds a workaround for a VM bug
(https://bugs.ruby-lang.org/issues/19991) that has affected at least
two customers (the other being
#3430)

The workaround is... not very pretty... but I believe will get the
job done, for the reasons explained in the big comment I added.

**Motivation:**

The good thing about this crash is that it seems extremely rare.

The bad thing is that without this workaround, only moving to Ruby 3.3+
is the fix, and we still have a lot of customers on older Ruby
versions.

We've documented in
https://docs.datadoghq.com/profiler/profiler_troubleshooting/ruby/#segmentation-faults-in-gc_finalize_deferred-in-ruby-versions-26-to-32
that the no signals workaround can be used to work around the issue
but:

a. Enabling it for everyone on < 3.3 is really crappy, as it yields
biased samples and the crash only happens in the presence of
specific types of Ruby objects (e.g. objects with finalizers) that
not all apps will run into.

b. For folks that are affected by this issue, it may take quite
a while for them to connect the bug to the profiler, since their
crash log will point at `gc_finalize_deferred` and won't mention
datadog at all.

**Additional Notes:**

N/A

**How to test the change?**

This is the million dollar question. I am unable to reproduce the
issue locally just by running the profiler.

I was able to simulate it by changing Ruby's `vm_trace.c` as follows:

```
@@ -1748,6 +1748,8 @@ rb_workqueue_register(unsigned flags, rb_postponed_job_func_t func, void *data)
     return TRUE;
 }

+static void dummy_method(void *_unused) { }
+
 void
 rb_postponed_job_flush(rb_vm_t *vm)
 {
@@ -1775,7 +1777,11 @@ rb_postponed_job_flush(rb_vm_t *vm)
             while ((index = vm->postponed_job_index) > 0) {
                 if (ATOMIC_CAS(vm->postponed_job_index, index, index-1) == index) {
                     rb_postponed_job_t *pjob = &vm->postponed_job_buffer[index-1];
-                    (*pjob->func)(pjob->data);
+                    rb_postponed_job_func_t the_func = (*pjob->func);
+                    if (the_func != dummy_method) rb_postponed_job_register_one(0, dummy_method, NULL);
+                    //if (the_func != dummy_method) rb_postponed_job_register_one(0, dummy_method, GET_VM()->objspace);
+                    (the_func)(pjob->data);
                 }
```

The idea being that it's equivalent to the profiler SIGPROF
interruption landing right between reading `pjob->func` and
`pjob->data`.

I then ran the following Ruby snippet (which triggers
`gc_finalize_deferred`):

```ruby
class UsesFinalizer
  DUMMY_FINALIZER = proc { }
  def initialize = ObjectSpace.define_finalizer(self, DUMMY_FINALIZER)
end
UsesFinalizer.new while true
```

...and this gets me a crash that matches the ones linked in
* https://bugs.ruby-lang.org/issues/19991
* #3430

If instead of passing `NULL`, we uncommend the second option that
passes in `GET_VM()->objspace`, the crash goes away.

Thus I believe that my fix works and is equivalent to what
the experiment above does.
@ivoanjo ivoanjo requested a review from a team as a code owner February 22, 2024 11:14
@github-actions github-actions bot added the profiling Involves Datadog profiling label Feb 22, 2024
@codecov-commenter
Copy link

Codecov Report

All modified and coverable lines are covered by tests ✅

Comparison is base (81e6060) 98.23% compared to head (f44ed93) 98.23%.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #3473   +/-   ##
=======================================
  Coverage   98.23%   98.23%           
=======================================
  Files        1275     1275           
  Lines       75167    75167           
  Branches     3548     3548           
=======================================
+ Hits        73842    73844    +2     
+ Misses       1325     1323    -2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

Copy link
Contributor

@AlexJF AlexJF left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Smart!

//
// Thus, our workaround is simple: we pass in objspace as our argument, just in case the clobbering happens.
// In the happy path, we never use this argument so it makes no difference. In the buggy path, we avoid crashing the VM.
int result = rb_postponed_job_register(0, sample_from_postponed_job, gc_finalize_deferred_workaround /* instead of NULL */);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This took me some time to understand: the gc_finalize_deferred_workaround argument will be used in the call to the other function that is in the postponed queue, and the only function (in the Ruby core itself) that has a meaningful argument takes ObjectSpace.

So, if there is some other library/extension which also uses postponed jobs, it could still crash - and could possibly be given wrong argument (i.e. ObjectSpace instead of whatever the correct argument was stored earlier).

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, if there is some other library/extension which also uses postponed jobs, it could still crash - and could possibly be given wrong argument (i.e. ObjectSpace instead of whatever the correct argument was stored earlier).

Yes! Although as per the discussion in ruby/ruby#8949 (comment) it seems this is quite rare, so if we ever run into one of those (which I really hope we don't) the no signals workaround will still work to avoid crashes.

@ivoanjo
Copy link
Member Author

ivoanjo commented Mar 1, 2024

Thanks everyone for reviewing! I've additionally gotten good feedback in #3430 (comment) that this fix works in practice, so I'm going to go ahead and merge it in.

@ivoanjo ivoanjo merged commit 9e4a267 into master Mar 1, 2024
220 checks passed
@ivoanjo ivoanjo deleted the ivoanjo/prof-9147-gc-finalize-deferred-workaround branch March 1, 2024 09:41
@github-actions github-actions bot added this to the 2.0 milestone Mar 1, 2024
ivoanjo added a commit to DataDog/documentation that referenced this pull request Mar 1, 2024
**What does this PR do?**

This PR updates the Ruby troubleshooting documentation, now that a
a new workaround exists and is auto-applied
(<DataDog/dd-trace-rb#3473>).

**Motivation:**

Clarify that using the "no signals" workaround is no longer the
recommended way to solve this issue.
@ivoanjo ivoanjo modified the milestones: 2.0, 1.21.0 Mar 14, 2024
brett0000FF added a commit to DataDog/documentation that referenced this pull request Mar 19, 2024
* [PROF-9147] Document workaround for Ruby gc_finalize_deferred

**What does this PR do?**

This PR updates the Ruby troubleshooting documentation, now that a
a new workaround exists and is auto-applied
(<DataDog/dd-trace-rb#3473>).

**Motivation:**

Clarify that using the "no signals" workaround is no longer the
recommended way to solve this issue.

* Update content/en/profiler/profiler_troubleshooting/ruby.md

---------

Co-authored-by: Brett Blue <84536271+brett0000FF@users.noreply.github.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
profiling Involves Datadog profiling
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants