-
Notifications
You must be signed in to change notification settings - Fork 377
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow the dynamic sampling rate overhead target to be set #3310
Allow the dynamic sampling rate overhead target to be set #3310
Conversation
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #3310 +/- ##
==========================================
- Coverage 98.23% 98.22% -0.01%
==========================================
Files 1253 1253
Lines 72690 72711 +21
Branches 3415 3415
==========================================
+ Hits 71405 71421 +16
- Misses 1285 1290 +5 ☔ View full report in Codecov by Sentry. |
Nice! I've sent a few notes privately, but I think this is in great shape :D |
The profiler has a sophisticated dynamic sampling rate mechanism to tune the sample collection rate in order to target a desired level of wall-clock overhead time. Unfortunately, the default value of 2% is too much for our application; we find a noticeable and unfortunately not acceptable impact on our application's p95/p99 response times. This commit adds a configuration option dynamic_sampling_rate_overhead_target_percentage which allows this to be controlled. The tradeoff for reducing the overhead target is that the profiles will have fewer samples in them and hence be less accurate.
This way, if we are targeting a lower overhead, the time between flushes increases, so each pprof file should have ~the same number of samples as before.
This way we can rely on a sane setting set up at initialization of the component, rather than needing to delay init in the CpuAndWallTimeWorker.
Shorten the user-visible setting (and env var) + remote the definitions in ext (make the settings be the owner of the default).
This setting now always comes from `Profiling::Component`, and so I've removed the old default to avoid confusion (since it would never be used in practice).
f39ae30
to
a17c5e9
Compare
Apparently on JRuby 9.2 the same float can be represented by a different object.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Was a bit worried by the
If this is used to reduce the sampling rate, the flushing interval is increased to compensate, so that the size of the flushed pprof files is roughly the same.
Part of the PR description (worth removing given it's no longer true I think!) but I see this approach was changed after (what I assume to be) the initial interactions with @ivoanjo.
It's not a big problem to change the timeouts but it makes it extremely hard at the moment to have good metric plots if you don't have a fixed interval that you can rollup over (you'd potentially get a lot of see-saw action in any graphs coming from profile metrics).
Good point -- I guess this is another good reason for why this setting is not quite recommended unless you have a very specific use-case. |
Merging away! 🎉 |
@KJTsanaktsidis This has been included in our latest release. Thank you for your contribution! |
What does this PR do?
This PR adds an option
dynamic_sampling_rate_overhead_target_percentage
overhead_target_percentage
which allows the dynamic sampling rate in the profiler to be controlled.If this is used to reduce the sampling rate, the flushing interval is increased to compensate, so that the size of the flushed pprof files is roughly the same.
Motivation:
The profiler has a sophisticated dynamic sampling rate mechanism to tune the sample collection rate in order to target a desired level of wall-clock overhead time. Unfortunately, the default value of 2% is too much for our application; we find a noticeable and unfortunately not acceptable impact on our application's p95/p99 response times.
Additional Notes:
How to test the change?
(Edit by @ivoanjo): Setting
DD_PROFILING_UPLOAD_PERIOD
to > 60 will emit less profiles, and settingDD_PROFILING_OVERHEAD_TARGET_PERCENTAGE
will to something different will make the profiler sample less/more than usual.For Datadog employees:
credentials of any kind, I've requested a review from
@DataDog/security-design-and-guidance
.Unsure? Have a question? Request a review!