Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[NO-TICKET] Add memory leak testing for profiling + fix (small) memory leak in profiler #3852

Merged
merged 18 commits into from
Aug 23, 2024

Commits on Aug 20, 2024

  1. Bootstrap support for running profiling tests with ruby_memcheck

    The [`ruby_memcheck`](https://github.com/Shopify/ruby_memcheck) gem
    provides a Ruby-specific user-friendly wrapper around valgrind.
    
    In some cases, some of our tests were hanging or timing out due
    to how valgrind runs code in a sequential manner. I've added a
    new tag to be able to skip such tests.
    
    I also explored using valgrind's `fair-sched` option
    (see Shopify/ruby_memcheck#51) but ran
    into issues with its output being incomplete, so decided to go
    the skip route instead.
    
    You can run the new job with `bundle exec rake spec:profiling:memcheck`.
    ivoanjo committed Aug 20, 2024
    Configuration menu
    Copy the full SHA
    6ba1e5f View commit details
    Browse the repository at this point in the history
  2. Avoid leaking memory in stack collector spec

    This memory leak is specific to the testing code --
    `Datadog::Profiling::Collectors::Stack::Testing._native_sample` was
    missing exception handling, and thus would leak memory for the specs
    that were triggering exceptions.
    
    In production, all this memory is managed via Ruby objects
    (specifically the `Collectors::ThreadContext`) which make sure to
    free it when they get garbage collected.
    
    (I found this memory leak using the ruby_memleak gem.)
    ivoanjo committed Aug 20, 2024
    Configuration menu
    Copy the full SHA
    252ce6b View commit details
    Browse the repository at this point in the history
  3. Fix memory leak when Ruby process used fork

    While re-examining the memory management code for the per-thread
    contexts, I realize we were going a bit too "high-level" during
    `_native_reset_after_fork` and clearing the hashtable containing the
    per-thread contexts without also cleaning the contexts.
    
    This was a real production memory leak: every time the Ruby VM used
    fork, we were leaking the thread contexts for all threads active
    prior to the fork.
    
    I suspect we got lucky because most apps don't refork: there's a
    master/original process, and every new child gets created from it.
    The biggest exception is probably the pitchfork web server, which
    explicitly employs reforking.
    ivoanjo committed Aug 20, 2024
    Configuration menu
    Copy the full SHA
    451a6ba View commit details
    Browse the repository at this point in the history
  4. Configuration menu
    Copy the full SHA
    bf84177 View commit details
    Browse the repository at this point in the history
  5. Avoid leaking memory in stack collector spec when max_frames is invalid

    Since `sampling_buffer_new` validates the input, we weren't properly
    handling an exception from there.
    
    To avoid introducing a lot of boilerplate to solve this leak,
    let's instead call `check_max_frames` before doing any allocation: this
    is what we do in production, and it keeps `_native_sample` simple.
    ivoanjo committed Aug 20, 2024
    Configuration menu
    Copy the full SHA
    df9b242 View commit details
    Browse the repository at this point in the history
  6. Fix exception in ruby 3.4 development builds

    While using a recent Ruby 3.4 development build (to hunt down memory
    leaks using `ruby_memcheck`), one of our tests started failing with
    
    > rb_raise(rb_eRuntimeError, "BUG: rb_native_thread* is null. Is this Ruby running with RUBY_MN_THREADS=1?");
    
    This exception had been added as a "not sure what to do, just in case",
    but it looks like that even without MN threads it can happen.
    
    (I suspect we may have found a Ruby thread that's so new that it hasn't
    had a native thread assigned).
    
    For now let's instead return 0 -- this value is only used for displaying
    the thread id, so it seems better to have missing info (a zero), rather
    than stopping profiling.
    
    We can always circle back on this once Ruby 3.4 is stable and if we
    see this is happening often.
    ivoanjo committed Aug 20, 2024
    Configuration menu
    Copy the full SHA
    94fa2c4 View commit details
    Browse the repository at this point in the history
  7. Add suppression for Ruby leaking thread data in a fork

    It looks like Ruby is leaking thread data when a fork happens. I'm not
    sure we can do anything about this, so added to the suppressions for
    now.
    ivoanjo committed Aug 20, 2024
    Configuration menu
    Copy the full SHA
    8f35a5e View commit details
    Browse the repository at this point in the history
  8. Configuration menu
    Copy the full SHA
    5f48a9b View commit details
    Browse the repository at this point in the history
  9. Add another suppression for Ruby native thread leak for Ruby 3.4.0-pr…

    …eview1
    
    It looks like for 3.4.0-preview1, the stack shown is slightly different,
    so let's add a second suppression to support it.
    
    (I opted to not generalize the existing suppression as I'm thinking we
    may not want to make it too generic, so as to not end up with
    false-positives.)
    ivoanjo committed Aug 20, 2024
    Configuration menu
    Copy the full SHA
    92e8dd5 View commit details
    Browse the repository at this point in the history
  10. Add GitHub Action for testing for memory leaks in CI

    I've chosen to use GitHub Actions for the new CI job, rather than
    CircleCI. I don't have a very strong preference for either, it's just
    that it was easier to use the latest valgrind version by picking the
    latest Ubuntu image on GHA.
    ivoanjo committed Aug 20, 2024
    Configuration menu
    Copy the full SHA
    e188e32 View commit details
    Browse the repository at this point in the history
  11. Configuration menu
    Copy the full SHA
    877abf3 View commit details
    Browse the repository at this point in the history
  12. Configuration menu
    Copy the full SHA
    1688da4 View commit details
    Browse the repository at this point in the history
  13. Make rubocop happy

    ivoanjo committed Aug 20, 2024
    Configuration menu
    Copy the full SHA
    a8ee486 View commit details
    Browse the repository at this point in the history
  14. Use ubuntu 24.04 as a base image to get latest valgrind

    Otherwise, at least at time of writing, "ubuntu-latest" is still
    Ubuntu 22.04 (lol).
    ivoanjo committed Aug 20, 2024
    Configuration menu
    Copy the full SHA
    9785a30 View commit details
    Browse the repository at this point in the history
  15. Add suppression for tr tool

    ivoanjo committed Aug 20, 2024
    Configuration menu
    Copy the full SHA
    b1dc345 View commit details
    Browse the repository at this point in the history

Commits on Aug 21, 2024

  1. Empty commit to re-trigger CI

    ivoanjo committed Aug 21, 2024
    Configuration menu
    Copy the full SHA
    168452e View commit details
    Browse the repository at this point in the history

Commits on Aug 22, 2024

  1. Attempt to fix flaky spec due to timing issues

    This test failed in CI when running inside valgrind:
    
    ```
    Failures:
    
      1) Datadog::Profiling::Collectors::ThreadContext#sample_after_gc when there is gc information to record when timeline is enabled creates a Garbage Collection sample using the timestamp set by on_gc_finish, converted to epoch ns
        Failure/Error: expect(gc_sample.labels.fetch(:end_timestamp_ns)).to be_between(@time_before, @time_after)
          expected 1724250478003178125 to be between 1724250478003215154 and 1724250478003295874 (inclusive)
        # ./spec/datadog/profiling/collectors/thread_context_spec.rb:1024:in 'block (5 levels) in <top (required)>'
        # ./spec/spec_helper.rb:231:in 'block (2 levels) in <top (required)>'
        # ./spec/spec_helper.rb:116:in 'block (2 levels) in <top (required)>'
        # ./vendor/bundle/ruby/3.4.0+0/gems/webmock-3.23.1/lib/webmock/rspec.rb:39:in 'block (2 levels) in <top (required)>'
        # ./vendor/bundle/ruby/3.4.0+0/gems/rspec-wait-0.0.10/lib/rspec/wait.rb:47:in 'block (2 levels) in <top (required)>'
    
    Finished in 3 minutes 54.8 seconds (files took 25.77 seconds to load)
    701 examples, 1 failure, 20 pending
    
    Failed examples:
    
    rspec ./spec/datadog/profiling/collectors/thread_context_spec.rb:1021 # Datadog::Profiling::Collectors::ThreadContext#sample_after_gc when there is gc information to record when timeline is enabled creates a Garbage Collection sample using the timestamp set by on_gc_finish, converted to epoch ns
    
    Randomized with seed 49098
    ```
    
    I couldn't quite reproduce it, but given the really small delta
    between the expected time and target time (37029 nanoseconds before
    start time), I'm strongly suspecting this failure may be due to our
    monotonic to system timestamp conversion code.
    
    In particular, we use monotonic timestamps in the profiler, and
    then we convert them to epoch timestamps by comparing the difference
    between the monotonic clock and the system clock.
    
    There can be some drift in this comparison because we cache the
    difference rather than compute it every time (see comments on
    `time_helpers.c:monotonic_to_system_epoch_ns`).
    
    So by resetting the cache before this test, hopefully any
    flakyness due to the drifting will go away.
    ivoanjo committed Aug 22, 2024
    Configuration menu
    Copy the full SHA
    c8f2338 View commit details
    Browse the repository at this point in the history
  2. Configuration menu
    Copy the full SHA
    5bc00e1 View commit details
    Browse the repository at this point in the history