-
Notifications
You must be signed in to change notification settings - Fork 377
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PROF-9147] Add workaround for Ruby VM bug causing crash in gc_finalize_deferred #3473
[PROF-9147] Add workaround for Ruby VM bug causing crash in gc_finalize_deferred #3473
Conversation
…ze_deferred **What does this PR do?** This PR adds a workaround for a VM bug (https://bugs.ruby-lang.org/issues/19991) that has affected at least two customers (the other being #3430) The workaround is... not very pretty... but I believe will get the job done, for the reasons explained in the big comment I added. **Motivation:** The good thing about this crash is that it seems extremely rare. The bad thing is that without this workaround, only moving to Ruby 3.3+ is the fix, and we still have a lot of customers on older Ruby versions. We've documented in https://docs.datadoghq.com/profiler/profiler_troubleshooting/ruby/#segmentation-faults-in-gc_finalize_deferred-in-ruby-versions-26-to-32 that the no signals workaround can be used to work around the issue but: a. Enabling it for everyone on < 3.3 is really crappy, as it yields biased samples and the crash only happens in the presence of specific types of Ruby objects (e.g. objects with finalizers) that not all apps will run into. b. For folks that are affected by this issue, it may take quite a while for them to connect the bug to the profiler, since their crash log will point at `gc_finalize_deferred` and won't mention datadog at all. **Additional Notes:** N/A **How to test the change?** This is the million dollar question. I am unable to reproduce the issue locally just by running the profiler. I was able to simulate it by changing Ruby's `vm_trace.c` as follows: ``` @@ -1748,6 +1748,8 @@ rb_workqueue_register(unsigned flags, rb_postponed_job_func_t func, void *data) return TRUE; } +static void dummy_method(void *_unused) { } + void rb_postponed_job_flush(rb_vm_t *vm) { @@ -1775,7 +1777,11 @@ rb_postponed_job_flush(rb_vm_t *vm) while ((index = vm->postponed_job_index) > 0) { if (ATOMIC_CAS(vm->postponed_job_index, index, index-1) == index) { rb_postponed_job_t *pjob = &vm->postponed_job_buffer[index-1]; - (*pjob->func)(pjob->data); + rb_postponed_job_func_t the_func = (*pjob->func); + if (the_func != dummy_method) rb_postponed_job_register_one(0, dummy_method, NULL); + //if (the_func != dummy_method) rb_postponed_job_register_one(0, dummy_method, GET_VM()->objspace); + (the_func)(pjob->data); } ``` The idea being that it's equivalent to the profiler SIGPROF interruption landing right between reading `pjob->func` and `pjob->data`. I then ran the following Ruby snippet (which triggers `gc_finalize_deferred`): ```ruby class UsesFinalizer DUMMY_FINALIZER = proc { } def initialize = ObjectSpace.define_finalizer(self, DUMMY_FINALIZER) end UsesFinalizer.new while true ``` ...and this gets me a crash that matches the ones linked in * https://bugs.ruby-lang.org/issues/19991 * #3430 If instead of passing `NULL`, we uncommend the second option that passes in `GET_VM()->objspace`, the crash goes away. Thus I believe that my fix works and is equivalent to what the experiment above does.
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #3473 +/- ##
=======================================
Coverage 98.23% 98.23%
=======================================
Files 1275 1275
Lines 75167 75167
Branches 3548 3548
=======================================
+ Hits 73842 73844 +2
+ Misses 1325 1323 -2 ☔ View full report in Codecov by Sentry. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Smart!
// | ||
// Thus, our workaround is simple: we pass in objspace as our argument, just in case the clobbering happens. | ||
// In the happy path, we never use this argument so it makes no difference. In the buggy path, we avoid crashing the VM. | ||
int result = rb_postponed_job_register(0, sample_from_postponed_job, gc_finalize_deferred_workaround /* instead of NULL */); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This took me some time to understand: the gc_finalize_deferred_workaround
argument will be used in the call to the other function that is in the postponed queue, and the only function (in the Ruby core itself) that has a meaningful argument takes ObjectSpace.
So, if there is some other library/extension which also uses postponed jobs, it could still crash - and could possibly be given wrong argument (i.e. ObjectSpace instead of whatever the correct argument was stored earlier).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So, if there is some other library/extension which also uses postponed jobs, it could still crash - and could possibly be given wrong argument (i.e. ObjectSpace instead of whatever the correct argument was stored earlier).
Yes! Although as per the discussion in ruby/ruby#8949 (comment) it seems this is quite rare, so if we ever run into one of those (which I really hope we don't) the no signals workaround will still work to avoid crashes.
Thanks everyone for reviewing! I've additionally gotten good feedback in #3430 (comment) that this fix works in practice, so I'm going to go ahead and merge it in. |
**What does this PR do?** This PR updates the Ruby troubleshooting documentation, now that a a new workaround exists and is auto-applied (<DataDog/dd-trace-rb#3473>). **Motivation:** Clarify that using the "no signals" workaround is no longer the recommended way to solve this issue.
* [PROF-9147] Document workaround for Ruby gc_finalize_deferred **What does this PR do?** This PR updates the Ruby troubleshooting documentation, now that a a new workaround exists and is auto-applied (<DataDog/dd-trace-rb#3473>). **Motivation:** Clarify that using the "no signals" workaround is no longer the recommended way to solve this issue. * Update content/en/profiler/profiler_troubleshooting/ruby.md --------- Co-authored-by: Brett Blue <84536271+brett0000FF@users.noreply.github.com>
What does this PR do?
This PR adds a workaround for a VM bug (https://bugs.ruby-lang.org/issues/19991) that has affected at least two customers (the other being #3430).
The workaround is... not very pretty... but I believe will get the job done, for the reasons explained in the big comment I added.
Motivation:
The good thing about this crash is that it seems extremely rare.
The bad thing is that without this workaround, only moving to Ruby 3.3+ is the fix, and we still have a lot of customers on older Ruby versions.
We've documented in https://docs.datadoghq.com/profiler/profiler_troubleshooting/ruby/#segmentation-faults-in-gc_finalize_deferred-in-ruby-versions-26-to-32 that the no signals workaround can be used to work around the issue but:
a. Enabling it for everyone on < 3.3 is really crappy, as it yields biased samples and the crash only happens in the presence of specific types of Ruby objects (e.g. objects with finalizers) that not all apps will run into.
b. For folks that are affected by this issue, it may take quite a while for them to connect the bug to the profiler, since their crash log will point at
gc_finalize_deferred
and won't mention datadog at all.Additional Notes:
N/A
How to test the change?
This is the million dollar question. I am unable to reproduce the issue locally just by running the profiler.
I was able to simulate it by changing Ruby's
vm_trace.c
as follows:The idea being that it's equivalent to the profiler SIGPROF interruption landing right between reading
pjob->func
andpjob->data
.I then ran the following Ruby snippet (which triggers
gc_finalize_deferred
):...and this gets me a crash that matches the ones linked in
If instead of passing
NULL
, we uncommend the second option that passes inGET_VM()->objspace
, the crash goes away.Thus I believe that my fix works and is equivalent to what the experiment above does.
I plan to reach out to both customers that saw this issue and see if they'd be up for testing the fix.
For Datadog employees:
credentials of any kind, I've requested a review from
@DataDog/security-design-and-guidance
.