-
Notifications
You must be signed in to change notification settings - Fork 375
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[NO-TICKET] Add memory leak testing for profiling + fix (small) memory leak in profiler #3852
Commits on Aug 20, 2024
-
Bootstrap support for running profiling tests with ruby_memcheck
The [`ruby_memcheck`](https://github.com/Shopify/ruby_memcheck) gem provides a Ruby-specific user-friendly wrapper around valgrind. In some cases, some of our tests were hanging or timing out due to how valgrind runs code in a sequential manner. I've added a new tag to be able to skip such tests. I also explored using valgrind's `fair-sched` option (see Shopify/ruby_memcheck#51) but ran into issues with its output being incomplete, so decided to go the skip route instead. You can run the new job with `bundle exec rake spec:profiling:memcheck`.
Configuration menu - View commit details
-
Copy full SHA for 6ba1e5f - Browse repository at this point
Copy the full SHA 6ba1e5fView commit details -
Avoid leaking memory in stack collector spec
This memory leak is specific to the testing code -- `Datadog::Profiling::Collectors::Stack::Testing._native_sample` was missing exception handling, and thus would leak memory for the specs that were triggering exceptions. In production, all this memory is managed via Ruby objects (specifically the `Collectors::ThreadContext`) which make sure to free it when they get garbage collected. (I found this memory leak using the ruby_memleak gem.)
Configuration menu - View commit details
-
Copy full SHA for 252ce6b - Browse repository at this point
Copy the full SHA 252ce6bView commit details -
Fix memory leak when Ruby process used fork
While re-examining the memory management code for the per-thread contexts, I realize we were going a bit too "high-level" during `_native_reset_after_fork` and clearing the hashtable containing the per-thread contexts without also cleaning the contexts. This was a real production memory leak: every time the Ruby VM used fork, we were leaking the thread contexts for all threads active prior to the fork. I suspect we got lucky because most apps don't refork: there's a master/original process, and every new child gets created from it. The biggest exception is probably the pitchfork web server, which explicitly employs reforking.
Configuration menu - View commit details
-
Copy full SHA for 451a6ba - Browse repository at this point
Copy the full SHA 451a6baView commit details -
Configuration menu - View commit details
-
Copy full SHA for bf84177 - Browse repository at this point
Copy the full SHA bf84177View commit details -
Avoid leaking memory in stack collector spec when max_frames is invalid
Since `sampling_buffer_new` validates the input, we weren't properly handling an exception from there. To avoid introducing a lot of boilerplate to solve this leak, let's instead call `check_max_frames` before doing any allocation: this is what we do in production, and it keeps `_native_sample` simple.
Configuration menu - View commit details
-
Copy full SHA for df9b242 - Browse repository at this point
Copy the full SHA df9b242View commit details -
Fix exception in ruby 3.4 development builds
While using a recent Ruby 3.4 development build (to hunt down memory leaks using `ruby_memcheck`), one of our tests started failing with > rb_raise(rb_eRuntimeError, "BUG: rb_native_thread* is null. Is this Ruby running with RUBY_MN_THREADS=1?"); This exception had been added as a "not sure what to do, just in case", but it looks like that even without MN threads it can happen. (I suspect we may have found a Ruby thread that's so new that it hasn't had a native thread assigned). For now let's instead return 0 -- this value is only used for displaying the thread id, so it seems better to have missing info (a zero), rather than stopping profiling. We can always circle back on this once Ruby 3.4 is stable and if we see this is happening often.
Configuration menu - View commit details
-
Copy full SHA for 94fa2c4 - Browse repository at this point
Copy the full SHA 94fa2c4View commit details -
Add suppression for Ruby leaking thread data in a fork
It looks like Ruby is leaking thread data when a fork happens. I'm not sure we can do anything about this, so added to the suppressions for now.
Configuration menu - View commit details
-
Copy full SHA for 8f35a5e - Browse repository at this point
Copy the full SHA 8f35a5eView commit details -
Configuration menu - View commit details
-
Copy full SHA for 5f48a9b - Browse repository at this point
Copy the full SHA 5f48a9bView commit details -
Add another suppression for Ruby native thread leak for Ruby 3.4.0-pr…
…eview1 It looks like for 3.4.0-preview1, the stack shown is slightly different, so let's add a second suppression to support it. (I opted to not generalize the existing suppression as I'm thinking we may not want to make it too generic, so as to not end up with false-positives.)
Configuration menu - View commit details
-
Copy full SHA for 92e8dd5 - Browse repository at this point
Copy the full SHA 92e8dd5View commit details -
Add GitHub Action for testing for memory leaks in CI
I've chosen to use GitHub Actions for the new CI job, rather than CircleCI. I don't have a very strong preference for either, it's just that it was easier to use the latest valgrind version by picking the latest Ubuntu image on GHA.
Configuration menu - View commit details
-
Copy full SHA for e188e32 - Browse repository at this point
Copy the full SHA e188e32View commit details -
Configuration menu - View commit details
-
Copy full SHA for 877abf3 - Browse repository at this point
Copy the full SHA 877abf3View commit details -
Configuration menu - View commit details
-
Copy full SHA for 1688da4 - Browse repository at this point
Copy the full SHA 1688da4View commit details -
Configuration menu - View commit details
-
Copy full SHA for a8ee486 - Browse repository at this point
Copy the full SHA a8ee486View commit details -
Use ubuntu 24.04 as a base image to get latest valgrind
Otherwise, at least at time of writing, "ubuntu-latest" is still Ubuntu 22.04 (lol).
Configuration menu - View commit details
-
Copy full SHA for 9785a30 - Browse repository at this point
Copy the full SHA 9785a30View commit details -
Configuration menu - View commit details
-
Copy full SHA for b1dc345 - Browse repository at this point
Copy the full SHA b1dc345View commit details
Commits on Aug 21, 2024
-
Configuration menu - View commit details
-
Copy full SHA for 168452e - Browse repository at this point
Copy the full SHA 168452eView commit details
Commits on Aug 22, 2024
-
Attempt to fix flaky spec due to timing issues
This test failed in CI when running inside valgrind: ``` Failures: 1) Datadog::Profiling::Collectors::ThreadContext#sample_after_gc when there is gc information to record when timeline is enabled creates a Garbage Collection sample using the timestamp set by on_gc_finish, converted to epoch ns Failure/Error: expect(gc_sample.labels.fetch(:end_timestamp_ns)).to be_between(@time_before, @time_after) expected 1724250478003178125 to be between 1724250478003215154 and 1724250478003295874 (inclusive) # ./spec/datadog/profiling/collectors/thread_context_spec.rb:1024:in 'block (5 levels) in <top (required)>' # ./spec/spec_helper.rb:231:in 'block (2 levels) in <top (required)>' # ./spec/spec_helper.rb:116:in 'block (2 levels) in <top (required)>' # ./vendor/bundle/ruby/3.4.0+0/gems/webmock-3.23.1/lib/webmock/rspec.rb:39:in 'block (2 levels) in <top (required)>' # ./vendor/bundle/ruby/3.4.0+0/gems/rspec-wait-0.0.10/lib/rspec/wait.rb:47:in 'block (2 levels) in <top (required)>' Finished in 3 minutes 54.8 seconds (files took 25.77 seconds to load) 701 examples, 1 failure, 20 pending Failed examples: rspec ./spec/datadog/profiling/collectors/thread_context_spec.rb:1021 # Datadog::Profiling::Collectors::ThreadContext#sample_after_gc when there is gc information to record when timeline is enabled creates a Garbage Collection sample using the timestamp set by on_gc_finish, converted to epoch ns Randomized with seed 49098 ``` I couldn't quite reproduce it, but given the really small delta between the expected time and target time (37029 nanoseconds before start time), I'm strongly suspecting this failure may be due to our monotonic to system timestamp conversion code. In particular, we use monotonic timestamps in the profiler, and then we convert them to epoch timestamps by comparing the difference between the monotonic clock and the system clock. There can be some drift in this comparison because we cache the difference rather than compute it every time (see comments on `time_helpers.c:monotonic_to_system_epoch_ns`). So by resetting the cache before this test, hopefully any flakyness due to the drifting will go away.
Configuration menu - View commit details
-
Copy full SHA for c8f2338 - Browse repository at this point
Copy the full SHA c8f2338View commit details -
Configuration menu - View commit details
-
Copy full SHA for 5bc00e1 - Browse repository at this point
Copy the full SHA 5bc00e1View commit details