Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PROF-10123] Add placeholder for skipped samples in allocation profiler, second try #3792

Merged

Conversation

ivoanjo
Copy link
Member

@ivoanjo ivoanjo commented Jul 18, 2024

What does this PR do?

This PR fixes a corner case issue with the allocation profiler: if too many objects get allocated, we "clamp" the weight of every individual sample (to avoid introducing bias) to a maximum value.

But we previously did not in any way represent the clamping on the profile itself, which meant that the total number of allocations reported by the profiler would be missing any object counts that were clamped.

In practice, we don't expect this to happen very often, as we have a high clamping value MAX_ALLOC_WEIGHT = 65535 (it's in collectors_cpu_and_wall_time_worker.c).

This PR then introduces an extra placeholder sample, when needed, to represent allocations that went over the clamped value.

Thus the total amount of objects allocated reported by the profiler now becomes accurate, even in the presence of clamping.

This is the second try at this feature. The first attempt was #3760 which due to a bug we ended up reverting in #3775. The revert was because of a bug in the interaction between allocation profiling and heap profiling. 5c7bd5f is the fix for the issue that led to the revert, and that commit's description includes an explanation of why the first PR was incorrect.

Motivation:

We're considering lowering the maximum clamping value, which would mean it would be hit more often, and thus it's important to fix this issue before we do that.

Additional Notes:

N/A

How to test the change?

I've added test coverage for this change. I also used the following snippet to quickly experiment while I was developing this PR:

Datadog::Profiling.wait_until_running
allocator = Thread.new { loop { BasicObject.new } }
sleep 10
allocator.kill; allocator.join
puts "Allocated #{GC.stat(:total_allocated_objects)} objects total"

Here's how it looks (it's the teal frame on the left):

image

You can observe that in this testcase with a massive amount of allocations we would be missing around ~10% of allocation counts, which we now track correctly.

**What does this PR do?**

This PR fixes a corner case issue with the allocation profiler: if too
many objects get allocated, we "clamp" the weight of every individual
sample (to avoid introducing bias) to a maximum value.

But we previously did not in any way represent the clamping on the
profile itself, which meant that the total number of allocations
reported by the profiler would be missing any object counts that
were clamped.

In practice, we don't expect this to happen very often, as we have
a high clamping value `MAX_ALLOC_WEIGHT = 65535` (it's in
`collectors_cpu_and_wall_time_worker.c`).

This PR then introduces an extra placeholder sample, when needed,
to represent allocations that went over the clamped value.

Thus the total amount of objects allocated reported by the profiler
now becomes accurate, even in the presence of clamping.

**Motivation:**

We're considering lowering the maximum clamping value, which would mean
it would be hit more often, and thus it's important to fix this issue
before we do that.

**Additional Notes:**

N/A

**How to test the change?**

I've added test coverage for this change. I also used the following
snippet to quickly experiment while I was developing this PR:

```ruby
Datadog::Profiling.wait_until_running
allocator = Thread.new { loop { BasicObject.new } }
sleep 10
allocator.kill; allocator.join
puts "Allocated #{GC.stat(:total_allocated_objects)} objects total"
```
…ises

Due to a bug in the integration between how the heap sampler works and
the new "missing allocations/skipped samples" placeholder, we could
cause one of the heap sampler sanity checks to fail, and thus raise
an exception.

This exception would be correctly logged, e.g. something like
> W, [2024-07-08T11:24:21.212654 #226523]  WARN -- datadog: [datadog]
> (lib/datadog/profiling/collectors/cpu_and_wall_time_worker.rb:72:in
> `rescue in block (2 levels) in start') CpuAndWallTimeWorker thread
> error. Cause: RuntimeError Ended a heap recording that was not
> started Location: missing-testcase.rb:2:in `new'

The exception would also cause the profiler to try to shut down.

But this is where we would run into issues: the scheduler still tries
to report one last profile before shutting down, and because the
above exception left the stack recorder with the active slot mutex
locked, this would make the scheduler deadlock forever, thus
preventing the Ruby VM from shutting down.

Part 1 on fixing this issue is to correctly unlock the mutex, thus
getting rid of the deadlock.

Part 2 (next commit) will be to not report a profile when the profiler
stopped with a failure. This seems like a prudent way to go, since the
profile may be in an inconsistent state.

(Out of curiosity, both 1 or 2 would suffice to avoid the issue, but
I think both seem useful individually to avoid future issues.)
When the worker component fails, we also stop the scheduler. But, up
until now, if there was enough data to be reported, we would still try
to serialize and report a last profile.

This is not an entirely great idea, as the profile may not be in a
consistent state. So, instead, this commit adds a flag to mark this
situation and the scheduler will use it to skip any attempts to
report data.
Oooops! This spec definitely should not be running on legacy Rubies.

It actually started to flake with a recent change, and it took me
a while to actually remember that it shouldn't even be running in
the first place.
…eap profiler, try 2

Adding the "skipped samples" placeholder breaks the heap profiler, as
it starts failing with:

> WARN -- datadog: [datadog] CpuAndWallTimeWorker thread error. Cause:
> RuntimeError Ended a heap recording that was not started
> WARN -- datadog: [datadog] Detected issue with profiler (worker
> component), stopping profiling.

This is because the stack recorder up until now used the check for
`alloc_samples > 0` as a proxy for "this is an allocation sample, and
someone already called `track_object(...)` with it".

The addition of the skipped samples placeholder broke this assumption,
as it uses `alloc_samples > 0` but explicitly there's no object being
tracked.

To fix this issue, I've added a new argument to the `sample_values`
that explicitly marks a sample as also being heap (or not). Thus,
we can now explicitly tell the stack recorder which situation we are in
using that flag, instead of trying to derive it from `alloc_samples >
0`.

This is actually the second attempt at fixing this. The first attempt
got reverted along with
#3775 because it tried to
detect placeholders (and thus used a placeholder as a proxy for
"someone called `track_object(...)`". This was incorrect because
there's other placeholders than just the "skipped samples" (case in
point: the "In native code" placeholder), which can happen together
with an object allocation.
By getting the timestamp before allocating any memory, we can safely
rely on the exception-raising behavior of our helper, rather than
needing to handle clearing the state if we get a failure
mid-initialization.

Having a timestamp that's a few ns before is not expected to impact the
discrete dynamic sampler behavior.
@ivoanjo ivoanjo requested a review from a team as a code owner July 18, 2024 09:09
@github-actions github-actions bot added the profiling Involves Datadog profiling label Jul 18, 2024
@codecov-commenter
Copy link

codecov-commenter commented Jul 18, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 97.91%. Comparing base (f6ec4ac) to head (16005f4).
Report is 77 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #3792   +/-   ##
=======================================
  Coverage   97.91%   97.91%           
=======================================
  Files        1246     1246           
  Lines       75037    75091   +54     
  Branches     3629     3632    +3     
=======================================
+ Hits        73473    73527   +54     
  Misses       1564     1564           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

ivoanjo added a commit that referenced this pull request Jul 18, 2024
… weights

**What does this PR do?**

This PR lowers the profiling internal constant `MAX_ALLOC_WEIGHT` from
65535 to 10000.

This value is used when clamping very high allocation profiling
weights.

Aka: Each allocation sample taken by the profiler has a weight
assigned to it; this weight is clamped (e.g. limited to) this maximum
value. The remaining weight is assigned to a separate "skipped samples"
placeholder (see #3792).

**Motivation:**

Very large weights on samples can produce biased results; by lowering
the maximum we reduce the maximum bias that can ever be introduced.

**Additional Notes:**

I've gathered data from a number of apps when choosing this value, see
<https://docs.google.com/document/d/1lWLB714wlLBBq6T4xZyAc4a5wtWhSmr4-hgiPKeErlA/edit>
for details (Datadog-only link, sorry!).

**How to test the change?**

The tests for #3792 also cover this change (although the changes are
otherwise independent).
@marcotc
Copy link
Member

marcotc commented Jul 23, 2024

We're considering lowering the maximum clamping value, which would mean it would be hit more often, and thus it's important to fix this issue before we do that.

Can you explain why this is beneficial?

@ivoanjo
Copy link
Member Author

ivoanjo commented Jul 24, 2024

We're considering lowering the maximum clamping value, which would mean it would be hit more often, and thus it's important to fix this issue before we do that.

Can you explain why this is beneficial?

  1. The placeholder

The placeholder is beneficial because prior to this PR we didn't "flag" in any way in the profile the missing data.

Thus, if an app allocated 1M objects and the profiler chose to sample less often than every MAX_ALLOC_WEIGHT objects, then prior to this PR you'd see a lower number than 1M in the profiler/profiler metrics.

For instance, taking my image above as an example

image

...this example allocated ~96M objects, but the profiler would not report the ~9M that were skipped, so you'd only see 86M objects reported.

  1. Lowering the MAX_ALLOC_WEIGHT value

This is now done in #3793 .

Reusing some of the explanation I also used in slack:

The base idea with our current sampling approach is that we sample periodically and assign the spent resource during that period to the stack we chose to sample. E.g.:

  • If we sample every 10ms of time, we attribute 10ms to a given stack (for wall-time profiling)
  • If we sample every 10 objects, we attribute 10 objects (for allocation profiling)

But allocation profiling poses an issue (more than cpu, for instance): if an application is creating objects like crazy, we may not want to sample as often, to avoid imposing a lot of overhead on the application (every 10 allocations in millions of objects is a lot).

So the profiler adjusts this dynamically based on overhead and number of objects. But this poses an issue: what if the profiler decides to sample really rarely? E.g. let's say the profiler samples every 100k objects. Attributing 100k "objects" to a single sample would be quite extreme.

Thus we introduced the MAX_ALLOC_WEIGHT that says "you can't attribute more than this value to a single allocation sample". Lowering MAX_ALLOC_WEIGHT reduces the potential for bias as it lowers the maximum that any single unlucky sample can be "blamed" for -- even if the profiler chooses to sample every 100k or more objects anyway.

At some point "SA" stood for "Skipped Allocations", but after a few
renames elsewhere this is kinda confusing so let's make it "SS" for
"Skipped Samples".
@pr-commenter
Copy link

pr-commenter bot commented Jul 24, 2024

Benchmarks

Benchmark execution time: 2024-07-24 08:57:46

Comparing candidate commit 16005f4 in PR branch ivoanjo/prof-10123-placeholder-missing-allocations-try2 with baseline commit f6ec4ac in branch master.

Found 0 performance improvements and 0 performance regressions! Performance is the same for 10 metrics, 2 unstable metrics.

@ivoanjo
Copy link
Member Author

ivoanjo commented Jul 24, 2024

Thanks for the reviews!

@marcotc hopefully it's OK with you, I'm going to go ahead and merge this one, please do comment further with any feedback, as I really value it and I'm happy to open PRs to address it :)

@ivoanjo ivoanjo merged commit 9a7002c into master Jul 24, 2024
171 checks passed
@ivoanjo ivoanjo deleted the ivoanjo/prof-10123-placeholder-missing-allocations-try2 branch July 24, 2024 09:05
@github-actions github-actions bot added this to the 2.3.0 milestone Jul 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
profiling Involves Datadog profiling
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants