Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PROF-10128] Workaround heap profiling flaky spec on Ruby 3.3+ #3848

Merged
merged 1 commit into from
Aug 19, 2024

Conversation

ivoanjo
Copy link
Member

@ivoanjo ivoanjo commented Aug 15, 2024

What does this PR do?

This PR is a second attempt at a workaround for this flaky spec:

Failures:

  1) Datadog::Profiling::StackRecorder#stats with heap profiling enabled includes heap recorder snapshot
     Failure/Error:
               expect(stack_recorder.stats).to match(
                 hash_including(
                   :heap_recorder_snapshot => hash_including(
                     # Records for dead objects should have gone away
                     :num_object_records => live_heap_samples + age0_heap_samples,
                     # We allocate from 3 different locations in this test but only 2
                     # of them are for objects which should be alive at serialization time
                     :num_heap_records => 2,

                     # The update done during serialization should reflect the

       expected {:heap_recorder_snapshot=>{:last_update_objects_alive=>7, :last_update_objects_dead=>9, :last_update_...ation_time_ns_max=>221980, :serialization_time_ns_min=>221980, :serialization_time_ns_total=>221980} to match #<RSpec::Mocks::ArgumentMatchers::HashIncludingMatcher:0x00007f974467e250 @expected={:heap_recorder_snapshot=>#<RSpec::Mocks::ArgumentMatchers::HashIncludingMatcher:0x00007f974467e318 @expected={:num_object_records=>9, :num_heap_records=>2, :last_update_objects_alive=>6, :last_update_objects_dead=>10, :last_update_objects_skipped=>3, :last_update_objects_frozen=>3}>}>
       Diff:
       @@ -1,12 +1,8 @@
       -#<RSpec::Mocks::ArgumentMatchers::HashIncludingMatcher:0x00007f974467e250
       - @expected=
       -  {:heap_recorder_snapshot=>
       -    #<RSpec::Mocks::ArgumentMatchers::HashIncludingMatcher:0x00007f974467e318
       -     @expected=
       -      {:num_object_records=>9,
       -       :num_heap_records=>2,
       -       :last_update_objects_alive=>6,
       -       :last_update_objects_dead=>10,
       -       :last_update_objects_skipped=>3,
       -       :last_update_objects_frozen=>3}>}>
       +:heap_recorder_snapshot => {:last_update_objects_alive=>7, :last_update_objects_dead=>9, :last_update_objects_frozen=>3, :last_update_objects_skipped=>3, :num_heap_records=>3, :num_object_records=>10},
       +:serialization_failures => 0,
       +:serialization_successes => 1,
       +:serialization_time_ns_avg => 221980.0,
       +:serialization_time_ns_max => 221980,
       +:serialization_time_ns_min => 221980,
       +:serialization_time_ns_total => 221980,

     # ./spec/datadog/profiling/stack_recorder_spec.rb:1044:in `block (4 levels) in <top (required)>'
     # ./spec/spec_helper.rb:230:in `block (2 levels) in <top (required)>'
     # ./spec/spec_helper.rb:115:in `block (2 levels) in <top (required)>'
     # /usr/local/bundle/gems/webmock-3.23.1/lib/webmock/rspec.rb:39:in `block (2 levels) in <top (required)>'
     # /usr/local/bundle/gems/rspec-wait-0.0.10/lib/rspec/wait.rb:47:in `block (2 levels) in <top (required)>'

Finished in 24.11 seconds (files took 1.38 seconds to load)
738 examples, 1 failure, 20 pending

Failed examples:

rspec ./spec/datadog/profiling/stack_recorder_spec.rb:1014 # Datadog::Profiling::StackRecorder#stats with heap profiling enabled includes heap recorder snapshot

Randomized with seed 48141

This flaky spec is quite weird: this test creates a number of throwaway objects, and then calls GC.start expecting Ruby to garbage collect them.

Because these objects are garbage collected, our heap profiler then also marks the objects as dead and cleans up its own internal tracking.

So, what's keeping these objects alive? Apparently, nothing.

I followed the same steps from https://bugs.ruby-lang.org/issues/19460#note-4 and there were no references active to the object!

Specifically I added a
ObjectSpace.dump_all(output: File.open("dump-#{Process.pid}.json", "w+")) AFTER the call to GC.start, and the object is still on the heap, even though there are no references to it.

Thus our working theory is that this is similar to the Ruby discussion above: the Ruby GC is conservative, not precise, so somewhere there's some leftover piece of state (variable? register? that's causing Ruby to decide not to clean the object, even though there don't seem to be any references to it.

As an additional datapoint, this issue seems to only happen on Ruby 3.3 in CI. Locally, I also could only reproduce it on Ruby 3.3 (I tried 3.3.4) and 3.4.0-preview1. Interestingly enough, I have debug builds of Ruby that are built with lower optimization and a lot of debug info and neither my 3.3 nor my 3.4 debug builds reproduced this issue, which perhaps suggests that some compiler optimization is leaving a reference to the object that hasn't been cleaned yet.

For future reference, I can reproduce this using
rspec spec/datadog/profiling --seed 48141 --fail-fast on top of 2fc03d5ae5860d4e9a75ce3825fba95ed288a1a2 (with the workaround removed).

Motivation:

Our usual goal of zero flaky profiler specs always applies.

Fixes https://github.com/datadog/ruby-guild/issues/178

Additional Notes:

I chose to create a hash instead of an object as this shows up clearly in the debug output -- e.g. you can immediately tell which is the object that wasn't supposed to still be alive.

How to test the change?

Verify that CI still passes.

**What does this PR do?**

This PR is a second attempt at a workaround for this flaky spec:

* https://app.circleci.com/pipelines/github/DataDog/dd-trace-rb/15742/workflows/9da2d5f2-3a15-4659-9643-219f52b56b5d/jobs/572636
* https://app.circleci.com/pipelines/github/DataDog/dd-trace-rb/15799/workflows/983ab3f8-2b83-4fd7-b0bc-43f257b924c3/jobs/574214
* https://app.circleci.com/pipelines/github/DataDog/dd-trace-rb/15817/workflows/e67525cd-b5d3-435e-9eb2-ee345a8a344d/jobs/574731

```
Failures:

  1) Datadog::Profiling::StackRecorder#stats with heap profiling enabled includes heap recorder snapshot
     Failure/Error:
               expect(stack_recorder.stats).to match(
                 hash_including(
                   :heap_recorder_snapshot => hash_including(
                     # Records for dead objects should have gone away
                     :num_object_records => live_heap_samples + age0_heap_samples,
                     # We allocate from 3 different locations in this test but only 2
                     # of them are for objects which should be alive at serialization time
                     :num_heap_records => 2,

                     # The update done during serialization should reflect the

       expected {:heap_recorder_snapshot=>{:last_update_objects_alive=>7, :last_update_objects_dead=>9, :last_update_...ation_time_ns_max=>221980, :serialization_time_ns_min=>221980, :serialization_time_ns_total=>221980} to match #<RSpec::Mocks::ArgumentMatchers::HashIncludingMatcher:0x00007f974467e250 @expected={:heap_recorder_snapshot=>#<RSpec::Mocks::ArgumentMatchers::HashIncludingMatcher:0x00007f974467e318 @expected={:num_object_records=>9, :num_heap_records=>2, :last_update_objects_alive=>6, :last_update_objects_dead=>10, :last_update_objects_skipped=>3, :last_update_objects_frozen=>3}>}>
       Diff:
       @@ -1,12 +1,8 @@
       -#<RSpec::Mocks::ArgumentMatchers::HashIncludingMatcher:0x00007f974467e250
       - @expected=
       -  {:heap_recorder_snapshot=>
       -    #<RSpec::Mocks::ArgumentMatchers::HashIncludingMatcher:0x00007f974467e318
       -     @expected=
       -      {:num_object_records=>9,
       -       :num_heap_records=>2,
       -       :last_update_objects_alive=>6,
       -       :last_update_objects_dead=>10,
       -       :last_update_objects_skipped=>3,
       -       :last_update_objects_frozen=>3}>}>
       +:heap_recorder_snapshot => {:last_update_objects_alive=>7, :last_update_objects_dead=>9, :last_update_objects_frozen=>3, :last_update_objects_skipped=>3, :num_heap_records=>3, :num_object_records=>10},
       +:serialization_failures => 0,
       +:serialization_successes => 1,
       +:serialization_time_ns_avg => 221980.0,
       +:serialization_time_ns_max => 221980,
       +:serialization_time_ns_min => 221980,
       +:serialization_time_ns_total => 221980,

     # ./spec/datadog/profiling/stack_recorder_spec.rb:1044:in `block (4 levels) in <top (required)>'
     # ./spec/spec_helper.rb:230:in `block (2 levels) in <top (required)>'
     # ./spec/spec_helper.rb:115:in `block (2 levels) in <top (required)>'
     # /usr/local/bundle/gems/webmock-3.23.1/lib/webmock/rspec.rb:39:in `block (2 levels) in <top (required)>'
     # /usr/local/bundle/gems/rspec-wait-0.0.10/lib/rspec/wait.rb:47:in `block (2 levels) in <top (required)>'

Finished in 24.11 seconds (files took 1.38 seconds to load)
738 examples, 1 failure, 20 pending

Failed examples:

rspec ./spec/datadog/profiling/stack_recorder_spec.rb:1014 # Datadog::Profiling::StackRecorder#stats with heap profiling enabled includes heap recorder snapshot

Randomized with seed 48141
```

This flaky spec is quite weird: this test creates a number of throwaway
objects, and then calls `GC.start` expecting Ruby to garbage collect
them.

Because these objects are garbage collected, our heap profiler then also
marks the objects as dead and cleans up its own internal tracking.

So, what's keeping these objects alive? Apparently, nothing.

I followed the same steps from https://bugs.ruby-lang.org/issues/19460#note-4
and there were no references active to the object!

Specifically I added a
`ObjectSpace.dump_all(output: File.open("dump-#{Process.pid}.json", "w+"))`
AFTER the call to `GC.start`, and the object is still on the heap, even
though there are no references to it.

Thus our working theory is that this is similar to the Ruby discussion
above: the Ruby GC is conservative, not precise, so somewhere there's
some leftover piece of state (variable? register? that's causing Ruby
to decide not to clean the object, even though there don't seem to be
any references to it.

As an additional datapoint, this issue seems to only happen on Ruby 3.3
in CI. Locally, I also could only reproduce it on Ruby 3.3 (I tried
3.3.4) and 3.4.0-preview1. Interestingly enough, I have debug builds of
Ruby that are built with lower optimization and a lot of debug info and
neither my 3.3 nor my 3.4 debug builds reproduced this issue, which
perhaps suggests that some compiler optimization is leaving a
reference to the object that hasn't been cleaned yet.

For future reference, I can reproduce this using
`rspec spec/datadog/profiling --seed 48141 --fail-fast` on top of
`2fc03d5ae5860d4e9a75ce3825fba95ed288a1a2` (with the workaround
removed).

**Motivation:**

Our usual goal of zero flaky profiler specs always applies.

Fixes DataDog/ruby-guild#178

**Additional Notes:**

I chose to create a hash instead of an object as this shows
up clearly in the debug output -- e.g. you can immediately
tell which is the object that wasn't supposed to still be alive.

**How to test the change?**

Verify that CI still passes.
@ivoanjo ivoanjo requested a review from a team as a code owner August 15, 2024 18:00
@github-actions github-actions bot added the dev/testing Involves testing processes (e.g. RSpec) label Aug 15, 2024
@codecov-commenter
Copy link

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 97.85%. Comparing base (d2498fb) to head (246d8cd).
Report is 22 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff             @@
##           master    #3848      +/-   ##
==========================================
+ Coverage   97.83%   97.85%   +0.01%     
==========================================
  Files        1264     1269       +5     
  Lines       75725    75868     +143     
  Branches     3729     3736       +7     
==========================================
+ Hits        74084    74239     +155     
+ Misses       1641     1629      -12     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@pr-commenter
Copy link

pr-commenter bot commented Aug 15, 2024

Benchmarks

Benchmark execution time: 2024-08-15 18:36:18

Comparing candidate commit 246d8cd in PR branch ivoanjo/prof-10128-workaround-flaky-spec with baseline commit 1402a32 in branch master.

Found 0 performance improvements and 1 performance regressions! Performance is the same for 22 metrics, 2 unstable metrics.

scenario:profiler - stack collector

  • 🟥 throughput [-105.751op/s; -104.475op/s] or [-3.392%; -3.351%]

@ivoanjo ivoanjo merged commit b7b99b2 into master Aug 19, 2024
186 checks passed
@ivoanjo ivoanjo deleted the ivoanjo/prof-10128-workaround-flaky-spec branch August 19, 2024 11:07
@github-actions github-actions bot added this to the 2.3.0 milestone Aug 19, 2024
@ivoanjo ivoanjo added the profiling Involves Datadog profiling label Sep 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dev/testing Involves testing processes (e.g. RSpec) profiling Involves Datadog profiling
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants