Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PROF-10128] Workaround heap profiling flaky spec on Ruby 3.3+ #3848

Merged
merged 1 commit into from
Aug 19, 2024

Commits on Aug 15, 2024

  1. [PROF-10128] Workaround heap profiling flaky spec on Ruby 3.3+

    **What does this PR do?**
    
    This PR is a second attempt at a workaround for this flaky spec:
    
    * https://app.circleci.com/pipelines/github/DataDog/dd-trace-rb/15742/workflows/9da2d5f2-3a15-4659-9643-219f52b56b5d/jobs/572636
    * https://app.circleci.com/pipelines/github/DataDog/dd-trace-rb/15799/workflows/983ab3f8-2b83-4fd7-b0bc-43f257b924c3/jobs/574214
    * https://app.circleci.com/pipelines/github/DataDog/dd-trace-rb/15817/workflows/e67525cd-b5d3-435e-9eb2-ee345a8a344d/jobs/574731
    
    ```
    Failures:
    
      1) Datadog::Profiling::StackRecorder#stats with heap profiling enabled includes heap recorder snapshot
         Failure/Error:
                   expect(stack_recorder.stats).to match(
                     hash_including(
                       :heap_recorder_snapshot => hash_including(
                         # Records for dead objects should have gone away
                         :num_object_records => live_heap_samples + age0_heap_samples,
                         # We allocate from 3 different locations in this test but only 2
                         # of them are for objects which should be alive at serialization time
                         :num_heap_records => 2,
    
                         # The update done during serialization should reflect the
    
           expected {:heap_recorder_snapshot=>{:last_update_objects_alive=>7, :last_update_objects_dead=>9, :last_update_...ation_time_ns_max=>221980, :serialization_time_ns_min=>221980, :serialization_time_ns_total=>221980} to match #<RSpec::Mocks::ArgumentMatchers::HashIncludingMatcher:0x00007f974467e250 @expected={:heap_recorder_snapshot=>#<RSpec::Mocks::ArgumentMatchers::HashIncludingMatcher:0x00007f974467e318 @expected={:num_object_records=>9, :num_heap_records=>2, :last_update_objects_alive=>6, :last_update_objects_dead=>10, :last_update_objects_skipped=>3, :last_update_objects_frozen=>3}>}>
           Diff:
           @@ -1,12 +1,8 @@
           -#<RSpec::Mocks::ArgumentMatchers::HashIncludingMatcher:0x00007f974467e250
           - @expected=
           -  {:heap_recorder_snapshot=>
           -    #<RSpec::Mocks::ArgumentMatchers::HashIncludingMatcher:0x00007f974467e318
           -     @expected=
           -      {:num_object_records=>9,
           -       :num_heap_records=>2,
           -       :last_update_objects_alive=>6,
           -       :last_update_objects_dead=>10,
           -       :last_update_objects_skipped=>3,
           -       :last_update_objects_frozen=>3}>}>
           +:heap_recorder_snapshot => {:last_update_objects_alive=>7, :last_update_objects_dead=>9, :last_update_objects_frozen=>3, :last_update_objects_skipped=>3, :num_heap_records=>3, :num_object_records=>10},
           +:serialization_failures => 0,
           +:serialization_successes => 1,
           +:serialization_time_ns_avg => 221980.0,
           +:serialization_time_ns_max => 221980,
           +:serialization_time_ns_min => 221980,
           +:serialization_time_ns_total => 221980,
    
         # ./spec/datadog/profiling/stack_recorder_spec.rb:1044:in `block (4 levels) in <top (required)>'
         # ./spec/spec_helper.rb:230:in `block (2 levels) in <top (required)>'
         # ./spec/spec_helper.rb:115:in `block (2 levels) in <top (required)>'
         # /usr/local/bundle/gems/webmock-3.23.1/lib/webmock/rspec.rb:39:in `block (2 levels) in <top (required)>'
         # /usr/local/bundle/gems/rspec-wait-0.0.10/lib/rspec/wait.rb:47:in `block (2 levels) in <top (required)>'
    
    Finished in 24.11 seconds (files took 1.38 seconds to load)
    738 examples, 1 failure, 20 pending
    
    Failed examples:
    
    rspec ./spec/datadog/profiling/stack_recorder_spec.rb:1014 # Datadog::Profiling::StackRecorder#stats with heap profiling enabled includes heap recorder snapshot
    
    Randomized with seed 48141
    ```
    
    This flaky spec is quite weird: this test creates a number of throwaway
    objects, and then calls `GC.start` expecting Ruby to garbage collect
    them.
    
    Because these objects are garbage collected, our heap profiler then also
    marks the objects as dead and cleans up its own internal tracking.
    
    So, what's keeping these objects alive? Apparently, nothing.
    
    I followed the same steps from https://bugs.ruby-lang.org/issues/19460#note-4
    and there were no references active to the object!
    
    Specifically I added a
    `ObjectSpace.dump_all(output: File.open("dump-#{Process.pid}.json", "w+"))`
    AFTER the call to `GC.start`, and the object is still on the heap, even
    though there are no references to it.
    
    Thus our working theory is that this is similar to the Ruby discussion
    above: the Ruby GC is conservative, not precise, so somewhere there's
    some leftover piece of state (variable? register? that's causing Ruby
    to decide not to clean the object, even though there don't seem to be
    any references to it.
    
    As an additional datapoint, this issue seems to only happen on Ruby 3.3
    in CI. Locally, I also could only reproduce it on Ruby 3.3 (I tried
    3.3.4) and 3.4.0-preview1. Interestingly enough, I have debug builds of
    Ruby that are built with lower optimization and a lot of debug info and
    neither my 3.3 nor my 3.4 debug builds reproduced this issue, which
    perhaps suggests that some compiler optimization is leaving a
    reference to the object that hasn't been cleaned yet.
    
    For future reference, I can reproduce this using
    `rspec spec/datadog/profiling --seed 48141 --fail-fast` on top of
    `2fc03d5ae5860d4e9a75ce3825fba95ed288a1a2` (with the workaround
    removed).
    
    **Motivation:**
    
    Our usual goal of zero flaky profiler specs always applies.
    
    Fixes DataDog/ruby-guild#178
    
    **Additional Notes:**
    
    I chose to create a hash instead of an object as this shows
    up clearly in the debug output -- e.g. you can immediately
    tell which is the object that wasn't supposed to still be alive.
    
    **How to test the change?**
    
    Verify that CI still passes.
    ivoanjo committed Aug 15, 2024
    Configuration menu
    Copy the full SHA
    246d8cd View commit details
    Browse the repository at this point in the history