Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PROF-9342] Introduce profiler workaround for Ruby Dir interruption bug #3720

Merged
merged 22 commits into from
Jul 2, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
22 commits
Select commit Hold shift + click to select a range
b03696e
Expose methods for holding/resuming interruptions
ivoanjo Apr 11, 2024
aeb2db6
Bootstrap setting for controlling dir interruption workaround
ivoanjo Apr 24, 2024
88f1164
Add monkey patches for affected dir class APIs
ivoanjo May 10, 2024
3c51234
Add inline suggestion to function
ivoanjo Jun 12, 2024
1721270
Add benchmark for hold/resume interruptions
ivoanjo Jun 12, 2024
42c40b2
Add helper to load monkey patches (similar to our Kernel monkey patches)
ivoanjo Jun 13, 2024
25a96a1
Add type signatures for dir monkey patches
ivoanjo Jun 13, 2024
57e64bf
Load new file when loading profiler
ivoanjo Jun 13, 2024
0e80c77
Wire up dir interruption workaround to setting
ivoanjo Jun 13, 2024
577fdb0
Add test coverage for DirMonkeyPatches
ivoanjo Jun 13, 2024
43a3e5e
Add comments to make Rubocop happy
ivoanjo Jun 14, 2024
9152470
Add nice description to dir interruption workaround setting
ivoanjo Jun 14, 2024
b30ca79
Tweak spec for compatibility with Ruby 2.7
ivoanjo Jun 14, 2024
d6eebe9
Load `DirMonkeyPatches` before referencing it to avoid failing on JRuby
ivoanjo Jun 14, 2024
30601a2
Tweak spec to remove assumption that no signals workaround is disable…
ivoanjo Jun 14, 2024
9ad82fe
Make rubocop happy
ivoanjo Jun 14, 2024
8099c94
Tweak naming of functions to clarify that it's signals being held
ivoanjo Jun 18, 2024
ba0d6b5
Avoid ruby2_keywords and instead have variants for Ruby 2 and 3
ivoanjo Jun 18, 2024
a6366f9
Add test to make sure we don't forget to register new profiler benchm…
ivoanjo Jun 19, 2024
3501237
Add new benchmark to validate_benchmarks_spec
ivoanjo Jun 19, 2024
533eeca
Merge branch 'master' into ivoanjo/prof-9342-dir-interruption-workaround
ivoanjo Jul 1, 2024
259e134
Add clarification for expected behavior of yield
ivoanjo Jul 2, 2024
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
44 changes: 44 additions & 0 deletions benchmarks/profiler_hold_resume_interruptions.rb
Original file line number Diff line number Diff line change
@@ -0,0 +1,44 @@
# Used to quickly run benchmark under RSpec as part of the usual test suite, to validate it didn't bitrot
VALIDATE_BENCHMARK_MODE = ENV['VALIDATE_BENCHMARK'] == 'true'

return unless __FILE__ == $PROGRAM_NAME || VALIDATE_BENCHMARK_MODE

require 'benchmark/ips'
require 'datadog'
require 'pry'
require_relative 'dogstatsd_reporter'

# This benchmark measures the performance of the hold/resume interruptions used by the DirMonkeyPatches
class ProfilerHoldResumeInterruptions
def create_profiler
Datadog.configure do |c|
c.profiling.enabled = true
end
Datadog::Profiling.wait_until_running
end

def run_benchmark
Benchmark.ips do |x|
benchmark_time = VALIDATE_BENCHMARK_MODE ? { time: 0.01, warmup: 0 } : { time: 10, warmup: 2 }
x.config(
**benchmark_time,
suite: report_to_dogstatsd_if_enabled_via_environment_variable(benchmark_name: 'profiler_hold_resume_interruptions')
)

x.report("hold / resume") do
Datadog::Profiling::Collectors::CpuAndWallTimeWorker._native_hold_signals
Datadog::Profiling::Collectors::CpuAndWallTimeWorker._native_resume_signals
end

x.save! 'profiler_hold_resume_interruptions-results.json' unless VALIDATE_BENCHMARK_MODE
x.compare!
end
end
end

puts "Current pid is #{Process.pid}"

ProfilerHoldResumeInterruptions.new.instance_exec do
create_profiler
run_benchmark
end
Original file line number Diff line number Diff line change
Expand Up @@ -222,6 +222,8 @@ static VALUE _native_with_blocked_sigprof(DDTRACE_UNUSED VALUE self);
static VALUE rescued_sample_allocation(VALUE tracepoint_data);
static void delayed_error(struct cpu_and_wall_time_worker_state *state, const char *error);
static VALUE _native_delayed_error(DDTRACE_UNUSED VALUE self, VALUE instance, VALUE error_msg);
static VALUE _native_hold_signals(DDTRACE_UNUSED VALUE self);
static VALUE _native_resume_signals(DDTRACE_UNUSED VALUE self);

// Note on sampler global state safety:
//
Expand Down Expand Up @@ -285,7 +287,9 @@ void collectors_cpu_and_wall_time_worker_init(VALUE profiling_module) {
rb_define_singleton_method(collectors_cpu_and_wall_time_worker_class, "_native_allocation_count", _native_allocation_count, 0);
rb_define_singleton_method(collectors_cpu_and_wall_time_worker_class, "_native_is_running?", _native_is_running, 1);
rb_define_singleton_method(testing_module, "_native_current_sigprof_signal_handler", _native_current_sigprof_signal_handler, 0);
// TODO: Remove `_native_is_running` from `testing_module` once `prof-correctness` has been updated to not need it
rb_define_singleton_method(collectors_cpu_and_wall_time_worker_class, "_native_hold_signals", _native_hold_signals, 0);
rb_define_singleton_method(collectors_cpu_and_wall_time_worker_class, "_native_resume_signals", _native_resume_signals, 0);
// TODO: Remove `_native_is_running` from `testing_module` (should be in class) once `prof-correctness` has been updated to not need it
rb_define_singleton_method(testing_module, "_native_is_running?", _native_is_running, 1);
rb_define_singleton_method(testing_module, "_native_install_testing_signal_handler", _native_install_testing_signal_handler, 0);
rb_define_singleton_method(testing_module, "_native_remove_testing_signal_handler", _native_remove_testing_signal_handler, 0);
Expand Down Expand Up @@ -1159,3 +1163,17 @@ static VALUE _native_delayed_error(DDTRACE_UNUSED VALUE self, VALUE instance, VA

return Qnil;
}

// Masks SIGPROF interruptions for the current thread. Please don't use this -- you may end up with incomplete
// profiling data.
static VALUE _native_hold_signals(DDTRACE_UNUSED VALUE self) {
block_sigprof_signal_handler_from_running_in_current_thread();
return Qtrue;
}

// Unmasks SIGPROF interruptions for the current thread. If there's a pending sample, it'll be triggered inside this
// method.
static VALUE _native_resume_signals(DDTRACE_UNUSED VALUE self) {
unblock_sigprof_signal_handler_from_running_in_current_thread();
return Qtrue;
}
Original file line number Diff line number Diff line change
Expand Up @@ -91,7 +91,7 @@ void remove_sigprof_signal_handler(void) {
if (sigaction(SIGPROF, &signal_handler_config, NULL) != 0) rb_sys_fail("Failure while removing the signal handler");
}

static void toggle_sigprof_signal_handler_for_current_thread(int action) {
static inline void toggle_sigprof_signal_handler_for_current_thread(int action) {
sigset_t signals_to_toggle;
sigemptyset(&signals_to_toggle);
sigaddset(&signals_to_toggle, SIGPROF);
Expand Down
16 changes: 16 additions & 0 deletions lib/datadog/core/configuration/settings.rb
Original file line number Diff line number Diff line change
Expand Up @@ -397,6 +397,22 @@ def initialize(*_)
end
end

# The profiler gathers data by sending `SIGPROF` unix signals to Ruby application threads.
#
# We've discovered that this can trigger a bug in a number of Ruby APIs in the `Dir` class, as
# described in https://github.com/DataDog/dd-trace-rb/issues/3450 . This workaround prevents the issue
# from happening by monkey patching the affected APIs.
#
# (In the future, once a fix lands upstream, we'll disable this workaround for Rubies that don't need it)
#
# @default `DD_PROFILING_DIR_INTERRUPTION_WORKAROUND_ENABLED` environment variable as a boolean,
# otherwise `true`
option :dir_interruption_workaround_enabled do |o|
o.env 'DD_PROFILING_DIR_INTERRUPTION_WORKAROUND_ENABLED'
o.type :bool
o.default true
end

# Configures how much wall-time overhead the profiler targets. The profiler will dynamically adjust the
# interval between samples it takes so as to try and maintain the property that it spends no longer than
# this amount of wall-clock time profiling. For example, with the default value of 2%, the profiler will
Expand Down
1 change: 1 addition & 0 deletions lib/datadog/profiling.rb
Original file line number Diff line number Diff line change
Expand Up @@ -136,6 +136,7 @@ def self.allocation_count # rubocop:disable Lint/NestedMethodDefinition (On purp
return false unless supported?

require_relative 'profiling/ext/forking'
require_relative 'profiling/ext/dir_monkey_patches'
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be more functionally named? ("what" instead of "how")

WDYT of dir_signal_masking?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I went with this name because I know it will show up in the profiling flamegraph for customers.

My thinking is that it's clearer if you see "oh, why is there a datadog thing here? ah, it's monkey patching my uses of dir". Signal masking makes me think a bit more along hte lines of "what is this signal masking thing? what is masking anyway?".

Copy link
Contributor

@lloeki lloeki Jul 1, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ha, interesting! I was thinking that if I saw that in a stack trace or something I would wonder "why is Datadog monkey patching dir stuff at all?". In a way "monkey patching" is how but not what/why. In a way it's also redundant with ext, the sort of conventional way to say "this folder contains things that extend stuff". dir_signal_fix maybe?

But at this stage that's becoming bikeshedding and certainly not a blocker. Feel free to move forward.

Adding links to upstream issues in the comment at the top of the file would help too I think (I reread it and with only the Datadog issue link it makes it like this is a Datadog issue when it's really an upstream issue)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the joke about naming things being hard applies here >_>. I suspect/fear just relying on ext/ may be too subtle (e.g. when you're looking a flamegraph), but am open to changing if folks get confused about the current name.

+1 I've added a link to the upstream ticket in 259e134.

require_relative 'profiling/collectors/info'
require_relative 'profiling/collectors/code_provenance'
require_relative 'profiling/collectors/cpu_and_wall_time_worker'
Expand Down
13 changes: 13 additions & 0 deletions lib/datadog/profiling/component.rb
Original file line number Diff line number Diff line change
Expand Up @@ -75,6 +75,10 @@ def self.build_profiler_component(settings:, agent_settings:, optional_tracer:)
crashtracker = build_crashtracker(settings, transport)
profiler = Profiling::Profiler.new(worker: worker, scheduler: scheduler, optional_crashtracker: crashtracker)

if dir_interruption_workaround_enabled?(settings, no_signals_workaround_enabled)
Datadog::Profiling::Ext::DirMonkeyPatches.apply!
end

[profiler, { profiling_enabled: true }]
end

Expand Down Expand Up @@ -445,6 +449,15 @@ def self.build_profiler_component(settings:, agent_settings:, optional_tracer:)
libmysqlclient_version < Gem::Version.new('5.0.0') &&
header_version >= Gem::Version.new('10.0.0'))
end

private_class_method def self.dir_interruption_workaround_enabled?(settings, no_signals_workaround_enabled)
return false if no_signals_workaround_enabled

# NOTE: In the future this method will evolve to check for Ruby versions affected and not apply the workaround
# when it's not needed but currently all known Ruby versions are affected.

settings.profiling.advanced.dir_interruption_workaround_enabled
end
end
end
end
Loading
Loading