Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PROF-4905] Implement support for code hotspots in CpuAndWallTime collector #2342

Merged
merged 6 commits into from
Nov 1, 2022

Conversation

ivoanjo
Copy link
Member

@ivoanjo ivoanjo commented Oct 31, 2022

What does this PR do?:

This PR reimplements support for the code hotspots feature (connecting traces to profiles) in the CpuAndWallTime collector.

Although the implementation is different, the approach is conceptually the same as provided by Datadog::Profiling::TraceIdentifiers::Ddtrace for the old profiler codepaths.

Key contraints in this implementation are:

  • No objects are created to get this information
  • No Ruby code is called to get this information

These constraints are key to:

  • Maintain performance on the CpuAndWallTime collector (we never get out of C code).
  • Keep precise control of the Global VM Lock. Since we never get out of C, we take the sample in one pass. Calling into Ruby code may mean being "interrupted" in the middle of sampling, which would break the internal consistency of the profiler. (Assuming that we never lose the GVL makes the profiler simpler and more performant).
  • Correctness of customer code. Sampling happens from customer threads directly (using rb_postponed_job_register_one) so we don't want to call Ruby methods and potentially trigger side effects during a sample.
  • Memory profiling. We don't yet implement allocation or heap profiling, but it's coming soon and we don't want to allocate memory to track memory allocations.

Thus, I chose to implement this feature by accessing the tracer context directly from the sampling process. Because all the information needed is available as instance variables of the objects, they can be read safely and within the needed constraints.

This implementation also includes extensive checks to make sure that if tracing evolves or is not available, profiling safely skips getting the information.

Furthermore, it includes an integration test with a real tracer, so that we'll immediately know when an incompatibility is introduced, so that the profiler can be updated.

Motivation:

Code hotspots is a key feature of our APM, allowing customers to see the profile for an individual trace.

Additional Notes:

This first implementation does not include support for endpoint aggregation; this will come in a separate PR.

How to test the change?:

This change includes code coverage. It can also be tested by enabling profiling and tracing on a Ruby app, and checking that the "Code Hotspots" tab on a generated trace is showing the
profile for that trace.

Right now we don't particularly need to do anything more, but just in
case we evolve it in the future, this note may come in handy.
…lector

**What does this PR do?**:

This PR reimplements support for the code hotspots feature (connecting
traces to profiles) in the `CpuAndWallTime` collector.

Although the implementation is different, the approach is conceptually
the same as provided by `Datadog::Profiling::TraceIdentifiers::Ddtrace`
for the old profiler codepaths.

Key contraints in this implementation are:
* No objects are created to get this information
* No Ruby code is called to get this information

These constraints are key to:
* Maintain performance on the `CpuAndWallTime` collector (we never get
  out of C code).
* Keep precise control of the Global VM Lock. Since we never get out of
  C, we take the sample in one pass. Calling into Ruby code may mean
  being "interrupted" in the middle of sampling, which would break
  the internal consistency of the profiler. (Assuming that we never lose
  the GVL makes the profiler simpler and more performant).
* Correctness of customer code. Sampling happens from customer threads
  directly (using `rb_postponed_job_register_one`) so we don't want to
  call Ruby methods and potentially trigger side effects during a sample.
* Memory profiling. We don't yet implement allocation or heap
  profiling, but it's coming soon and we don't want to allocate memory
  to track memory allocations.

Thus, I chose to implement this feature by accessing the tracer context
directly from the sampling process. Because all the information needed
is available as instance variables of the objects, they can be read
safely and within the needed constraints.

This implementation also includes extensive checks to make sure that
if tracing evolves or is not available, profiling safely skips
getting the information.

Furthermore, it includes an integration test with a real tracer,
so that we'll immediately know when an incompatibility is introduced,
so that the profiler can be updated.

**Motivation**:

Code hotspots is a key feature of our APM, allowing customers to see
the profile for an individual trace.

**Additional Notes**:

This first implementation does not include support for endpoint
aggregation; this will come in a separate PR.

**How to test the change?**:

This change includes code coverage. It can also be tested by
enabling profiling and tracing on a Ruby app, and checking that
the "Code Hotspots" tab on a generated trace is showing the
profile for that trace.
@ivoanjo ivoanjo requested a review from a team October 31, 2022 13:38
@github-actions github-actions bot added core Involves Datadog core libraries profiling Involves Datadog profiling labels Oct 31, 2022
@codecov-commenter
Copy link

Codecov Report

Merging #2342 (b63650a) into master (36b1b9c) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master    #2342   +/-   ##
=======================================
  Coverage   98.29%   98.30%           
=======================================
  Files        1092     1092           
  Lines       58090    58145   +55     
=======================================
+ Hits        57101    57159   +58     
+ Misses        989      986    -3     
Impacted Files Coverage Δ
lib/datadog/core/configuration/components.rb 98.85% <ø> (ø)
spec/datadog/core/configuration/components_spec.rb 100.00% <ø> (ø)
.../datadog/profiling/collectors/cpu_and_wall_time.rb 80.00% <100.00%> (+10.00%) ⬆️
...g/profiling/collectors/cpu_and_wall_time_worker.rb 97.14% <100.00%> (ø)
...dog/profiling/collectors/cpu_and_wall_time_spec.rb 97.50% <100.00%> (+0.35%) ⬆️
...filing/collectors/cpu_and_wall_time_worker_spec.rb 100.00% <100.00%> (ø)
...adog/tracing/contrib/sidekiq/server_tracer_spec.rb 100.00% <0.00%> (+0.83%) ⬆️
...adog/tracing/contrib/sidekiq/client_tracer_spec.rb 100.00% <0.00%> (+2.08%) ⬆️
spec/support/synchronization_helpers.rb 84.21% <0.00%> (+2.63%) ⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

Comment on lines +577 to +579
if (label_pos > max_label_count) {
rb_raise(rb_eRuntimeError, "BUG: Unexpected label_pos (%d) > max_label_count (%d)", label_pos, max_label_count);
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this ever happen?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like someone has to modify the function for this to happen.
Does it make sense to have one of those "conditional assertion that does not exist in the production release"?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can this ever happen?
It looks like someone has to modify the function for this to happen.

Yes, exactly! I've pushed bb95c1c to better explain why that's there so you and others don't have to "reverse engineer" my intent ;)

Does it make sense to have one of those "conditional assertion that does not exist in the production release"?

In this particular case, I don't think so, although I have been thinking about some mechanism to do that. But before I explain why, I'm curious why you ask:

  • Is it because you're concerned about performance? If so, since this is a comparison between a constant and a local that recently got used, I'd expect this to be really cheap. In fact, I wonder if the compiler can be smart enough to optimize it out, by looking at the flow of label_pos in the function and proving it never can get bigger than max_label_count.

  • Is it because you're concerned about correctness? The profiler is correctly built to handle exceptions. When they happen, we log them out and then stop profiling. I think for these "this should never happen" situations, stopping profiling entirely seems like a reasonable thing to do to avoid any impact on the user application.


That said, here's why "In this particular case, I don't think so": Because I fear that contributors and customers may not realize such a mode exists.

In fact, there's a known fork of dd-trace-rb that changes exactly this function --
zendesk@192e5c8 -- and when that fork gets updated with this change, I wanted to make it easier to update this code without issues.

In its current approach, any incorrect change to this function would appear quite soon in a quite dramatic way (the profiler stops), which I think would allow us/customers to catch any bugs sooner.


Finally, I have been thinking about having some mode that we could use in CI/testing to enable extra checks, especially extra checks that may be expensive/break customer setups and we wouldn't want to have on.

A good example is compiler warnings -- we don't know which version of the C compiler our customers are using, so using -Werror to turn compiler warnings into errors on the release version of the gem is a bad idea -- suddenly a newer version of the compiler gets released and the profiler stops working because there's a new warning.

BUUUT we could enable this mode for just for CI/development, and it would allow us to catch new warnings as we introduce them.

So yeah, I haven't yet looked into such a setup, but suggestions welcome. (But in this case, even if such a setup was already in place, I would still not move this check to be done only on CI/dev mode, for the reasons I explained above).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the detailed response.

I see this check as a compiler shortcoming: the compiler should catch this. So my thought was "let's augment the compilation step then".
It seems like leaving the check as is is the best option today, so I'm happy with that.

@@ -759,3 +819,38 @@ static VALUE _native_stats(DDTRACE_UNUSED VALUE _self, VALUE collector_instance)

return stats_as_ruby_hash(state);
}

static void trace_identifiers_for(struct cpu_and_wall_time_collector_state *state, VALUE thread, struct trace_identifiers *trace_identifiers_result) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we have the GVL when calling this function?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes! Added additional documentation in c7f6401, but currently the whole sampling approach is done while holding the GVL.

unsigned long long local_root_span_id = NUM2ULL(numeric_local_root_span_id);
unsigned long long span_id = NUM2ULL(numeric_span_id);

snprintf(trace_identifiers_result->local_root_span_id_buffer, MAXIMUM_LENGTH_64_BIT_IDENTIFIER, "%llu", local_root_span_id);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"%llu"

TIL
206273_LLU

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

unsigned long long long long long long long long long long long long... ;)

Copy link
Member

@marcotc marcotc left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The implementation of trace_identifiers_for looks pretty short and easy to follow!

@ivoanjo ivoanjo merged commit 5983634 into master Nov 1, 2022
@ivoanjo ivoanjo deleted the ivoanjo/prof-4905-code-hotspots branch November 1, 2022 16:42
@github-actions github-actions bot added this to the 1.6.0 milestone Nov 1, 2022
ivoanjo added a commit that referenced this pull request Nov 18, 2022
…e collector

**What does this PR do?**:

This PR reimplements support for the profiling endpoint feature
in the `CpuAndWallTime` collector.

Although the implementation is different, the approach is conceptually
the same as provided by `Datadog::Profiling::TraceIdentifiers::Ddtrace`
for the old profiler codepaths.

The constraints of this implementation are the same as the ones in
 #2342 and effectively this is only a small extension of the work
done in that PR.

**Motivation**:

Endpoint profiling allows customers to filter down their profiles to
individual endpoints (for web apps) that they may be interested
in investigating.

**Additional Notes**:

Comparing a Ruby string with a `char *` is really annoying.

**How to test the change?**:

Beyond the code coverage included by the change, it can also be
tested by enabling profiling and tracing for a Ruby web app, and
then checking that the application's endpoints show up in the right
sidebar in the profiling UX.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
core Involves Datadog core libraries profiling Involves Datadog profiling
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants