[PROF-4905] Implement support for code hotspots in CpuAndWallTime collector #2342

ivoanjo · 2022-10-31T13:38:07Z

What does this PR do?:

This PR reimplements support for the code hotspots feature (connecting traces to profiles) in the CpuAndWallTime collector.

Although the implementation is different, the approach is conceptually the same as provided by Datadog::Profiling::TraceIdentifiers::Ddtrace for the old profiler codepaths.

Key contraints in this implementation are:

No objects are created to get this information
No Ruby code is called to get this information

These constraints are key to:

Maintain performance on the CpuAndWallTime collector (we never get out of C code).
Keep precise control of the Global VM Lock. Since we never get out of C, we take the sample in one pass. Calling into Ruby code may mean being "interrupted" in the middle of sampling, which would break the internal consistency of the profiler. (Assuming that we never lose the GVL makes the profiler simpler and more performant).
Correctness of customer code. Sampling happens from customer threads directly (using rb_postponed_job_register_one) so we don't want to call Ruby methods and potentially trigger side effects during a sample.
Memory profiling. We don't yet implement allocation or heap profiling, but it's coming soon and we don't want to allocate memory to track memory allocations.

Thus, I chose to implement this feature by accessing the tracer context directly from the sampling process. Because all the information needed is available as instance variables of the objects, they can be read safely and within the needed constraints.

This implementation also includes extensive checks to make sure that if tracing evolves or is not available, profiling safely skips getting the information.

Furthermore, it includes an integration test with a real tracer, so that we'll immediately know when an incompatibility is introduced, so that the profiler can be updated.

Motivation:

Code hotspots is a key feature of our APM, allowing customers to see the profile for an individual trace.

Additional Notes:

This first implementation does not include support for endpoint aggregation; this will come in a separate PR.

How to test the change?:

This change includes code coverage. It can also be tested by enabling profiling and tracing on a Ruby app, and checking that the "Code Hotspots" tab on a generated trace is showing the
profile for that trace.

…abels

Right now we don't particularly need to do anything more, but just in case we evolve it in the future, this note may come in handy.

…lector **What does this PR do?**: This PR reimplements support for the code hotspots feature (connecting traces to profiles) in the `CpuAndWallTime` collector. Although the implementation is different, the approach is conceptually the same as provided by `Datadog::Profiling::TraceIdentifiers::Ddtrace` for the old profiler codepaths. Key contraints in this implementation are: * No objects are created to get this information * No Ruby code is called to get this information These constraints are key to: * Maintain performance on the `CpuAndWallTime` collector (we never get out of C code). * Keep precise control of the Global VM Lock. Since we never get out of C, we take the sample in one pass. Calling into Ruby code may mean being "interrupted" in the middle of sampling, which would break the internal consistency of the profiler. (Assuming that we never lose the GVL makes the profiler simpler and more performant). * Correctness of customer code. Sampling happens from customer threads directly (using `rb_postponed_job_register_one`) so we don't want to call Ruby methods and potentially trigger side effects during a sample. * Memory profiling. We don't yet implement allocation or heap profiling, but it's coming soon and we don't want to allocate memory to track memory allocations. Thus, I chose to implement this feature by accessing the tracer context directly from the sampling process. Because all the information needed is available as instance variables of the objects, they can be read safely and within the needed constraints. This implementation also includes extensive checks to make sure that if tracing evolves or is not available, profiling safely skips getting the information. Furthermore, it includes an integration test with a real tracer, so that we'll immediately know when an incompatibility is introduced, so that the profiler can be updated. **Motivation**: Code hotspots is a key feature of our APM, allowing customers to see the profile for an individual trace. **Additional Notes**: This first implementation does not include support for endpoint aggregation; this will come in a separate PR. **How to test the change?**: This change includes code coverage. It can also be tested by enabling profiling and tracing on a Ruby app, and checking that the "Code Hotspots" tab on a generated trace is showing the profile for that trace.

codecov-commenter · 2022-10-31T13:47:33Z

Codecov Report

Merging #2342 (b63650a) into master (36b1b9c) will increase coverage by 0.00%.
The diff coverage is 100.00%.

@@           Coverage Diff           @@
##           master    #2342   +/-   ##
=======================================
  Coverage   98.29%   98.30%           
=======================================
  Files        1092     1092           
  Lines       58090    58145   +55     
=======================================
+ Hits        57101    57159   +58     
+ Misses        989      986    -3

Impacted Files	Coverage Δ
lib/datadog/core/configuration/components.rb	`98.85% <ø> (ø)`
spec/datadog/core/configuration/components_spec.rb	`100.00% <ø> (ø)`
.../datadog/profiling/collectors/cpu_and_wall_time.rb	`80.00% <100.00%> (+10.00%)`	⬆️
...g/profiling/collectors/cpu_and_wall_time_worker.rb	`97.14% <100.00%> (ø)`
...dog/profiling/collectors/cpu_and_wall_time_spec.rb	`97.50% <100.00%> (+0.35%)`	⬆️
...filing/collectors/cpu_and_wall_time_worker_spec.rb	`100.00% <100.00%> (ø)`
...adog/tracing/contrib/sidekiq/server_tracer_spec.rb	`100.00% <0.00%> (+0.83%)`	⬆️
...adog/tracing/contrib/sidekiq/client_tracer_spec.rb	`100.00% <0.00%> (+2.08%)`	⬆️
spec/support/synchronization_helpers.rb	`84.21% <0.00%> (+2.63%)`	⬆️

📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more

marcotc · 2022-10-31T21:12:10Z

ext/ddtrace_profiling_native_extension/collectors_cpu_and_wall_time.c

+  if (label_pos > max_label_count) {
+    rb_raise(rb_eRuntimeError, "BUG: Unexpected label_pos (%d) > max_label_count (%d)", label_pos, max_label_count);
+  }


Can this ever happen?

It looks like someone has to modify the function for this to happen.
Does it make sense to have one of those "conditional assertion that does not exist in the production release"?

Can this ever happen?
It looks like someone has to modify the function for this to happen.

Yes, exactly! I've pushed bb95c1c to better explain why that's there so you and others don't have to "reverse engineer" my intent ;)

Does it make sense to have one of those "conditional assertion that does not exist in the production release"?

In this particular case, I don't think so, although I have been thinking about some mechanism to do that. But before I explain why, I'm curious why you ask:

Is it because you're concerned about performance? If so, since this is a comparison between a constant and a local that recently got used, I'd expect this to be really cheap. In fact, I wonder if the compiler can be smart enough to optimize it out, by looking at the flow of label_pos in the function and proving it never can get bigger than max_label_count.

Is it because you're concerned about correctness? The profiler is correctly built to handle exceptions. When they happen, we log them out and then stop profiling. I think for these "this should never happen" situations, stopping profiling entirely seems like a reasonable thing to do to avoid any impact on the user application.

That said, here's why "In this particular case, I don't think so": Because I fear that contributors and customers may not realize such a mode exists.

In fact, there's a known fork of dd-trace-rb that changes exactly this function --
zendesk@192e5c8 -- and when that fork gets updated with this change, I wanted to make it easier to update this code without issues.

In its current approach, any incorrect change to this function would appear quite soon in a quite dramatic way (the profiler stops), which I think would allow us/customers to catch any bugs sooner.

Finally, I have been thinking about having some mode that we could use in CI/testing to enable extra checks, especially extra checks that may be expensive/break customer setups and we wouldn't want to have on.

A good example is compiler warnings -- we don't know which version of the C compiler our customers are using, so using -Werror to turn compiler warnings into errors on the release version of the gem is a bad idea -- suddenly a newer version of the compiler gets released and the profiler stops working because there's a new warning.

BUUUT we could enable this mode for just for CI/development, and it would allow us to catch new warnings as we introduce them.

So yeah, I haven't yet looked into such a setup, but suggestions welcome. (But in this case, even if such a setup was already in place, I would still not move this check to be done only on CI/dev mode, for the reasons I explained above).

Thanks for the detailed response.

I see this check as a compiler shortcoming: the compiler should catch this. So my thought was "let's augment the compilation step then".
It seems like leaving the check as is is the best option today, so I'm happy with that.

marcotc · 2022-10-31T21:14:16Z

ext/ddtrace_profiling_native_extension/collectors_cpu_and_wall_time.c

@@ -759,3 +819,38 @@ static VALUE _native_stats(DDTRACE_UNUSED VALUE _self, VALUE collector_instance)

  return stats_as_ruby_hash(state);
 }
+
+static void trace_identifiers_for(struct cpu_and_wall_time_collector_state *state, VALUE thread, struct trace_identifiers *trace_identifiers_result) {


Do we have the GVL when calling this function?

Yes! Added additional documentation in c7f6401, but currently the whole sampling approach is done while holding the GVL.

marcotc · 2022-10-31T21:16:39Z

ext/ddtrace_profiling_native_extension/collectors_cpu_and_wall_time.c

+  unsigned long long local_root_span_id = NUM2ULL(numeric_local_root_span_id);
+  unsigned long long span_id = NUM2ULL(numeric_span_id);
+
+  snprintf(trace_identifiers_result->local_root_span_id_buffer, MAXIMUM_LENGTH_64_BIT_IDENTIFIER, "%llu", local_root_span_id);


"%llu"

TIL

unsigned long long long long long long long long long long long long... ;)

marcotc

The implementation of trace_identifiers_for looks pretty short and easy to follow!

…e collector **What does this PR do?**: This PR reimplements support for the profiling endpoint feature in the `CpuAndWallTime` collector. Although the implementation is different, the approach is conceptually the same as provided by `Datadog::Profiling::TraceIdentifiers::Ddtrace` for the old profiler codepaths. The constraints of this implementation are the same as the ones in #2342 and effectively this is only a small extension of the work done in that PR. **Motivation**: Endpoint profiling allows customers to filter down their profiles to individual endpoints (for web apps) that they may be interested in investigating. **Additional Notes**: Comparing a Ruby string with a `char *` is really annoying. **How to test the change?**: Beyond the code coverage included by the change, it can also be tested by enabling profiling and tracing for a Ruby web app, and then checking that the application's endpoints show up in the right sidebar in the profiling UX.

ivoanjo added 4 commits October 31, 2022 09:27

Refactor trigger_sample_for_thread in preparation for adding more l…

f5a3279

…abels

Add note about monotonic clock going backwards spuriously

1f6af60

Document potential issues in signal handler function

42ddbad

Right now we don't particularly need to do anything more, but just in case we evolve it in the future, this note may come in handy.

ivoanjo requested a review from a team October 31, 2022 13:38

github-actions bot added core Involves Datadog core libraries profiling Involves Datadog profiling labels Oct 31, 2022

marcotc reviewed Oct 31, 2022

View reviewed changes

ivoanjo added 2 commits November 1, 2022 09:25

Add note about validation of label_pos

bb95c1c

Document that trace_identifiers_for gets called while holding the GVL

c7f6401

marcotc approved these changes Nov 1, 2022

View reviewed changes

ivoanjo merged commit 5983634 into master Nov 1, 2022

ivoanjo deleted the ivoanjo/prof-4905-code-hotspots branch November 1, 2022 16:42

github-actions bot added this to the 1.6.0 milestone Nov 1, 2022

ivoanjo mentioned this pull request Nov 18, 2022

[PROF-6463] Implement support for endpoint profiling in CpuAndWallTime collector #2384

Merged

ivoanjo mentioned this pull request Mar 6, 2024

[PROF-9263] Add experimental support for profiling code hotspots when used with opentelemetry ruby gem #3510

Merged

ivoanjo mentioned this pull request Oct 9, 2024

[PROF-10679] Add preview support for correlating profiling with otel ruby gem #3984

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PROF-4905] Implement support for code hotspots in CpuAndWallTime collector #2342

[PROF-4905] Implement support for code hotspots in CpuAndWallTime collector #2342

ivoanjo commented Oct 31, 2022

codecov-commenter commented Oct 31, 2022

marcotc Oct 31, 2022

marcotc Oct 31, 2022

ivoanjo Nov 1, 2022

marcotc Nov 1, 2022

marcotc Oct 31, 2022

ivoanjo Nov 1, 2022

marcotc Oct 31, 2022

ivoanjo Nov 1, 2022

marcotc left a comment

[PROF-4905] Implement support for code hotspots in CpuAndWallTime collector #2342

[PROF-4905] Implement support for code hotspots in CpuAndWallTime collector #2342

Conversation

ivoanjo commented Oct 31, 2022

codecov-commenter commented Oct 31, 2022

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

marcotc left a comment

Choose a reason for hiding this comment