[NO-TICKET] Add memory leak testing for profiling + fix (small) memory leak in profiler #3852

ivoanjo · 2024-08-20T09:55:47Z

What does this PR do?

The ruby_memcheck gem provides a Ruby-specific user-friendly wrapper around valgrind.

This PR sets up this gem to run the profiling test suite in CI with this gem, thus helping us catch memory leaks during development.

It also fixes a few memory leaks:

We were leaking a small amount of memory when the Ruby process forked with the profiler enabled (we cleaned up the per-thread context hashtable, but did not call free on the things it contained)
The stack collector native test code was a bit YOLO with exceptions and would not release memory in a few cases when exceptions were raised. This code was only used for running specs, so it did not impact actual usage, but solving it allows us to have a 100% clean ruby_memcheck run

Motivation:

Improve validation of our native bits.

Additional Notes:

I've chosen to use GitHub Actions for the new CI job, rather than CircleCI. I don't have a very strong preference for either, it's just that it was easier to use the latest valgrind version by picking the latest Ubuntu image on GHA.

In some cases, some of our tests were hanging or timing out due to how valgrind runs code in a sequential manner. I've added a new tag to be able to skip such tests.

I also explored using valgrind's fair-sched option (see Shopify/ruby_memcheck#51) but ran into issues with its output being incomplete, so decided to go the skip route instead.

Initially I started testing ruby_memcheck with Ruby 3.3, but on that version the filtering of valgrind output is too aggressive for our setup, and it did not report anything (see here for more details). Thus, I moved my experiments to Ruby 3.4 and it worked much better.

How to test the change?

You can run the new job with bundle exec rake spec:profiling:memcheck.

Here's a successful run of the new CI job: https://github.com/DataDog/dd-trace-rb/actions/runs/10468881841/job/28990709088

Also, if you're curious how it looks, comment freeing any libdatadog resource or other profiler native resource, and you will see valgrind complaining. Here's what happened when I did that: https://github.com/DataDog/dd-trace-rb/actions/runs/10469151578/job/28991521438

The [`ruby_memcheck`](https://github.com/Shopify/ruby_memcheck) gem provides a Ruby-specific user-friendly wrapper around valgrind. In some cases, some of our tests were hanging or timing out due to how valgrind runs code in a sequential manner. I've added a new tag to be able to skip such tests. I also explored using valgrind's `fair-sched` option (see Shopify/ruby_memcheck#51) but ran into issues with its output being incomplete, so decided to go the skip route instead. You can run the new job with `bundle exec rake spec:profiling:memcheck`.

This memory leak is specific to the testing code -- `Datadog::Profiling::Collectors::Stack::Testing._native_sample` was missing exception handling, and thus would leak memory for the specs that were triggering exceptions. In production, all this memory is managed via Ruby objects (specifically the `Collectors::ThreadContext`) which make sure to free it when they get garbage collected. (I found this memory leak using the ruby_memleak gem.)

While re-examining the memory management code for the per-thread contexts, I realize we were going a bit too "high-level" during `_native_reset_after_fork` and clearing the hashtable containing the per-thread contexts without also cleaning the contexts. This was a real production memory leak: every time the Ruby VM used fork, we were leaking the thread contexts for all threads active prior to the fork. I suspect we got lucky because most apps don't refork: there's a master/original process, and every new child gets created from it. The biggest exception is probably the pitchfork web server, which explicitly employs reforking.

Since `sampling_buffer_new` validates the input, we weren't properly handling an exception from there. To avoid introducing a lot of boilerplate to solve this leak, let's instead call `check_max_frames` before doing any allocation: this is what we do in production, and it keeps `_native_sample` simple.

While using a recent Ruby 3.4 development build (to hunt down memory leaks using `ruby_memcheck`), one of our tests started failing with > rb_raise(rb_eRuntimeError, "BUG: rb_native_thread* is null. Is this Ruby running with RUBY_MN_THREADS=1?"); This exception had been added as a "not sure what to do, just in case", but it looks like that even without MN threads it can happen. (I suspect we may have found a Ruby thread that's so new that it hasn't had a native thread assigned). For now let's instead return 0 -- this value is only used for displaying the thread id, so it seems better to have missing info (a zero), rather than stopping profiling. We can always circle back on this once Ruby 3.4 is stable and if we see this is happening often.

It looks like Ruby is leaking thread data when a fork happens. I'm not sure we can do anything about this, so added to the suppressions for now.

…eview1 It looks like for 3.4.0-preview1, the stack shown is slightly different, so let's add a second suppression to support it. (I opted to not generalize the existing suppression as I'm thinking we may not want to make it too generic, so as to not end up with false-positives.)

I've chosen to use GitHub Actions for the new CI job, rather than CircleCI. I don't have a very strong preference for either, it's just that it was easier to use the latest valgrind version by picking the latest Ubuntu image on GHA.

Otherwise, at least at time of writing, "ubuntu-latest" is still Ubuntu 22.04 (lol).

codecov-commenter · 2024-08-20T10:12:34Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 97.85%. Comparing base (d2498fb) to head (5bc00e1).
Report is 43 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff             @@
##           master    #3852      +/-   ##
==========================================
+ Coverage   97.83%   97.85%   +0.01%     
==========================================
  Files        1264     1269       +5     
  Lines       75725    75868     +143     
  Branches     3729     3736       +7     
==========================================
+ Hits        74084    74237     +153     
+ Misses       1641     1631      -10

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

pr-commenter · 2024-08-20T10:25:23Z

Benchmarks

Benchmark execution time: 2024-08-21 14:58:38

Comparing candidate commit 168452e in PR branch ivoanjo/check-profiler-memleaks with baseline commit b7b99b2 in branch master.

Found 1 performance improvements and 3 performance regressions! Performance is the same for 19 metrics, 2 unstable metrics.

scenario:profiler - Allocations (profiling enabled)

🟥 throughput [-129105.509op/s; -117994.240op/s] or [-2.532%; -2.314%]

scenario:profiler - profiler gc

🟥 throughput [-11415.749op/s; -10845.169op/s] or [-3.035%; -2.883%]

scenario:profiler - sample timeline=false

🟥 throughput [-0.180op/s; -0.150op/s] or [-2.510%; -2.095%]

scenario:tracing - trace.to_digest

🟩 throughput [+3445.937op/s; +4248.734op/s] or [+2.103%; +2.593%]

This test failed in CI when running inside valgrind: ``` Failures: 1) Datadog::Profiling::Collectors::ThreadContext#sample_after_gc when there is gc information to record when timeline is enabled creates a Garbage Collection sample using the timestamp set by on_gc_finish, converted to epoch ns Failure/Error: expect(gc_sample.labels.fetch(:end_timestamp_ns)).to be_between(@time_before, @time_after) expected 1724250478003178125 to be between 1724250478003215154 and 1724250478003295874 (inclusive) # ./spec/datadog/profiling/collectors/thread_context_spec.rb:1024:in 'block (5 levels) in <top (required)>' # ./spec/spec_helper.rb:231:in 'block (2 levels) in <top (required)>' # ./spec/spec_helper.rb:116:in 'block (2 levels) in <top (required)>' # ./vendor/bundle/ruby/3.4.0+0/gems/webmock-3.23.1/lib/webmock/rspec.rb:39:in 'block (2 levels) in <top (required)>' # ./vendor/bundle/ruby/3.4.0+0/gems/rspec-wait-0.0.10/lib/rspec/wait.rb:47:in 'block (2 levels) in <top (required)>' Finished in 3 minutes 54.8 seconds (files took 25.77 seconds to load) 701 examples, 1 failure, 20 pending Failed examples: rspec ./spec/datadog/profiling/collectors/thread_context_spec.rb:1021 # Datadog::Profiling::Collectors::ThreadContext#sample_after_gc when there is gc information to record when timeline is enabled creates a Garbage Collection sample using the timestamp set by on_gc_finish, converted to epoch ns Randomized with seed 49098 ``` I couldn't quite reproduce it, but given the really small delta between the expected time and target time (37029 nanoseconds before start time), I'm strongly suspecting this failure may be due to our monotonic to system timestamp conversion code. In particular, we use monotonic timestamps in the profiler, and then we convert them to epoch timestamps by comparing the difference between the monotonic clock and the system clock. There can be some drift in this comparison because we cache the difference rather than compute it every time (see comments on `time_helpers.c:monotonic_to_system_epoch_ns`). So by resetting the cache before this test, hopefully any flakyness due to the drifting will go away.

ivoanjo · 2024-08-22T14:59:35Z

The failure on dd-gitlab/benchmarks is expected. Specifically, this is the failure:

+ bundle exec ruby benchmarks/profiler_gc.rb
Current pid is 3104
ruby 3.3.0 (2023-12-25 revision 5124f9ac75) [x86_64-linux]
Warming up --------------------------------------
profiler - profiler gc
                    benchmarks/profiler_gc.rb:41:in `_native_sample_after_gc': wrong number of arguments (given 2, expected 1) (ArgumentError)
        Datadog::Profiling::Collectors::ThreadContext::Testing._native_sample_after_gc(@collector, false)
                                                                                       ^^^^^^^^^^^^^^^^^
	from benchmarks/profiler_gc.rb:41:in `block (2 levels) in run_benchmark'

This failure is happening because I've added an extra argument to _native_sample_after_gc on this branch BUT the benchmarking platform runs the new version of the benchmark code (which passes 2 arguments) with the code in master (which expects only one).

Once we merge this PR the issue will go away. I've considered a workaround, but changing the benchmark/or the method/or the benchmarking platform just for this rare occasion does not seem to be worth the trouble (IMO -- happy to discuss if other people aren't convinced).

**What does this PR do?** This PR builds atop the work from #3852 and #3862 to enable running the profiler test suite using the AddressSanitizer ("asan") tool (see https://github.com/google/sanitizers/wiki/AddressSanitizerLeakSanitizer ). This check will enable us to find bugs in the profiler that may not be otherwise caught. **Motivation:** Improve validation of the profiler native extension. **Additional Notes:** The "asan" tool is built into the clang compiler toolchain, and needs Ruby to be built with a special configuration. The special configuration is not yet available in the upstream `ruby/setup-ruby` github action, so I needed to fork it and push a small tweak to make it available. (The Ruby builds were added in ruby/ruby-dev-builder#10 ). **How to test the change?** Here's a passing CI run: https://github.com/DataDog/dd-trace-rb/actions/runs/10524502494/job/29161364590 I've also tested locally by adding a memory leak (e.g. for instance commenting out `ddog_Vec_Tag_drop(tags);` in `http_transport.c` and confirmed the issue was flagged.

**What does this PR do?** This PR builds atop the work from #3852 and #3862 to enable running the profiler test suite using the AddressSanitizer ("asan") tool (see https://github.com/google/sanitizers/wiki/AddressSanitizerLeakSanitizer ). This check will enable us to find bugs in the profiler that may not be otherwise caught. **Motivation:** Improve validation of the profiler native extension. **Additional Notes:** The "asan" tool is built into the clang compiler toolchain, and needs Ruby to be built with a special configuration. The special configuration is not yet available in the upstream `ruby/setup-ruby` github action, so I needed to fork it and push a small tweak to make it available. (The Ruby builds were added in ruby/ruby-dev-builder#10 ). **How to test the change?** Here's a passing CI run: https://github.com/DataDog/dd-trace-rb/actions/runs/10524502494/job/29161364590 I've also tested it by adding a memory leak (e.g. for instance commenting out `ddog_Vec_Tag_drop(tags);` in `http_transport.c` and confirmed the issue was flagged. Here's the failing CI run: https://github.com/DataDog/dd-trace-rb/actions/runs/10524803685/job/29162274392

AlexJF · 2024-08-23T11:45:19Z

ext/datadog_profiling_native_extension/collectors_thread_context.c

+  if (reset_monotonic_to_system_state == Qtrue) {
+    state->time_converter_state = (monotonic_to_system_epoch_state) MONOTONIC_TO_SYSTEM_EPOCH_INITIALIZER;
+  }


Can't seem to find any code overriding this state. Why do we need to do this?

See commit description for c8f2338 -- it's my attempt at fixing a flaky spec that showed up.

Ack! (not used to seeing useful commit descriptions 😂 )

Ack! (not used to seeing useful commit descriptions 😂 )

🔥🔥🔥

ivoanjo added 15 commits August 20, 2024 09:50

Ignore any leaks reported from pkg-config, since we don't control it

bf84177

Add suppression for Ruby leaking thread data in a fork

8f35a5e

It looks like Ruby is leaking thread data when a fork happens. I'm not sure we can do anything about this, so added to the suppressions for now.

Fix profiler spec by using webrick from git on Ruby 3.4

5f48a9b

Add GitHub Action for testing for memory leaks in CI

e188e32

I've chosen to use GitHub Actions for the new CI job, rather than CircleCI. I don't have a very strong preference for either, it's just that it was easier to use the latest valgrind version by picking the latest Ubuntu image on GHA.

Install valgrind in CI before using it

877abf3

Confirm that suppressions folder is expected not to be packaged

1688da4

Make rubocop happy

a8ee486

Use ubuntu 24.04 as a base image to get latest valgrind

9785a30

Otherwise, at least at time of writing, "ubuntu-latest" is still Ubuntu 22.04 (lol).

Add suppression for tr tool

b1dc345

ivoanjo requested review from a team as code owners August 20, 2024 09:55

github-actions bot added the profiling Involves Datadog profiling label Aug 20, 2024

p-datadog approved these changes Aug 20, 2024

View reviewed changes

ivoanjo added 3 commits August 21, 2024 15:22

Empty commit to re-trigger CI

168452e

Update gc benchmark with new argument on test method

5bc00e1

ivoanjo mentioned this pull request Aug 22, 2024

[NO-TICKET] Small set of profiler cleanups from ASAN testing branch #3862

Merged

ivoanjo added the performance Involves performance (e.g. CPU, memory, etc) label Aug 22, 2024

ivoanjo added this to the 2.4.0 milestone Aug 22, 2024

ivoanjo mentioned this pull request Aug 23, 2024

[NO-TICKET] Add more memory leak testing for profiling using asan #3864

Merged

AlexJF reviewed Aug 23, 2024

View reviewed changes

ivoanjo added dev/testing Involves testing processes (e.g. RSpec) dev/ci Involves CircleCI, GitHub Actions, or GitLab labels Aug 23, 2024

AlexJF approved these changes Aug 23, 2024

View reviewed changes

ivoanjo merged commit 29e7e69 into master Aug 23, 2024
186 of 187 checks passed

ivoanjo deleted the ivoanjo/check-profiler-memleaks branch August 23, 2024 13:50

ivoanjo removed dev/testing Involves testing processes (e.g. RSpec) dev/ci Involves CircleCI, GitHub Actions, or GitLab labels Aug 23, 2024

ivoanjo added bug Involves a bug and removed performance Involves performance (e.g. CPU, memory, etc) labels Sep 23, 2024

anmarchenko mentioned this pull request Oct 8, 2024

[SDTEST-744] add memory leaks tests with ruby_memcheck DataDog/datadog-ci-rb#246

Merged

y9v mentioned this pull request Oct 11, 2024

Bump to version 2.4.0 #3994

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NO-TICKET] Add memory leak testing for profiling + fix (small) memory leak in profiler #3852

[NO-TICKET] Add memory leak testing for profiling + fix (small) memory leak in profiler #3852

ivoanjo commented Aug 20, 2024

codecov-commenter commented Aug 20, 2024 •

edited

Loading

pr-commenter bot commented Aug 20, 2024 •

edited

Loading

ivoanjo commented Aug 22, 2024

AlexJF Aug 23, 2024

ivoanjo Aug 23, 2024

AlexJF Aug 23, 2024

ivoanjo Aug 23, 2024

[NO-TICKET] Add memory leak testing for profiling + fix (small) memory leak in profiler #3852

[NO-TICKET] Add memory leak testing for profiling + fix (small) memory leak in profiler #3852

Conversation

ivoanjo commented Aug 20, 2024

codecov-commenter commented Aug 20, 2024 • edited Loading

Codecov Report

pr-commenter bot commented Aug 20, 2024 • edited Loading

Benchmarks

scenario:profiler - Allocations (profiling enabled)

scenario:profiler - profiler gc

scenario:profiler - sample timeline=false

scenario:tracing - trace.to_digest

ivoanjo commented Aug 22, 2024

AlexJF Aug 23, 2024

Choose a reason for hiding this comment

ivoanjo Aug 23, 2024

Choose a reason for hiding this comment

AlexJF Aug 23, 2024

Choose a reason for hiding this comment

ivoanjo Aug 23, 2024

Choose a reason for hiding this comment

codecov-commenter commented Aug 20, 2024 •

edited

Loading

pr-commenter bot commented Aug 20, 2024 •

edited

Loading