-
Notifications
You must be signed in to change notification settings - Fork 375
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[PROF-4905] Implement support for code hotspots in CpuAndWallTime collector #2342
Conversation
Right now we don't particularly need to do anything more, but just in case we evolve it in the future, this note may come in handy.
…lector **What does this PR do?**: This PR reimplements support for the code hotspots feature (connecting traces to profiles) in the `CpuAndWallTime` collector. Although the implementation is different, the approach is conceptually the same as provided by `Datadog::Profiling::TraceIdentifiers::Ddtrace` for the old profiler codepaths. Key contraints in this implementation are: * No objects are created to get this information * No Ruby code is called to get this information These constraints are key to: * Maintain performance on the `CpuAndWallTime` collector (we never get out of C code). * Keep precise control of the Global VM Lock. Since we never get out of C, we take the sample in one pass. Calling into Ruby code may mean being "interrupted" in the middle of sampling, which would break the internal consistency of the profiler. (Assuming that we never lose the GVL makes the profiler simpler and more performant). * Correctness of customer code. Sampling happens from customer threads directly (using `rb_postponed_job_register_one`) so we don't want to call Ruby methods and potentially trigger side effects during a sample. * Memory profiling. We don't yet implement allocation or heap profiling, but it's coming soon and we don't want to allocate memory to track memory allocations. Thus, I chose to implement this feature by accessing the tracer context directly from the sampling process. Because all the information needed is available as instance variables of the objects, they can be read safely and within the needed constraints. This implementation also includes extensive checks to make sure that if tracing evolves or is not available, profiling safely skips getting the information. Furthermore, it includes an integration test with a real tracer, so that we'll immediately know when an incompatibility is introduced, so that the profiler can be updated. **Motivation**: Code hotspots is a key feature of our APM, allowing customers to see the profile for an individual trace. **Additional Notes**: This first implementation does not include support for endpoint aggregation; this will come in a separate PR. **How to test the change?**: This change includes code coverage. It can also be tested by enabling profiling and tracing on a Ruby app, and checking that the "Code Hotspots" tab on a generated trace is showing the profile for that trace.
Codecov Report
@@ Coverage Diff @@
## master #2342 +/- ##
=======================================
Coverage 98.29% 98.30%
=======================================
Files 1092 1092
Lines 58090 58145 +55
=======================================
+ Hits 57101 57159 +58
+ Misses 989 986 -3
📣 We’re building smart automated test selection to slash your CI/CD build times. Learn more |
if (label_pos > max_label_count) { | ||
rb_raise(rb_eRuntimeError, "BUG: Unexpected label_pos (%d) > max_label_count (%d)", label_pos, max_label_count); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this ever happen?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like someone has to modify the function for this to happen.
Does it make sense to have one of those "conditional assertion that does not exist in the production release"?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can this ever happen?
It looks like someone has to modify the function for this to happen.
Yes, exactly! I've pushed bb95c1c to better explain why that's there so you and others don't have to "reverse engineer" my intent ;)
Does it make sense to have one of those "conditional assertion that does not exist in the production release"?
In this particular case, I don't think so, although I have been thinking about some mechanism to do that. But before I explain why, I'm curious why you ask:
-
Is it because you're concerned about performance? If so, since this is a comparison between a constant and a local that recently got used, I'd expect this to be really cheap. In fact, I wonder if the compiler can be smart enough to optimize it out, by looking at the flow of
label_pos
in the function and proving it never can get bigger thanmax_label_count
. -
Is it because you're concerned about correctness? The profiler is correctly built to handle exceptions. When they happen, we log them out and then stop profiling. I think for these "this should never happen" situations, stopping profiling entirely seems like a reasonable thing to do to avoid any impact on the user application.
That said, here's why "In this particular case, I don't think so": Because I fear that contributors and customers may not realize such a mode exists.
In fact, there's a known fork of dd-trace-rb that changes exactly this function --
zendesk@192e5c8 -- and when that fork gets updated with this change, I wanted to make it easier to update this code without issues.
In its current approach, any incorrect change to this function would appear quite soon in a quite dramatic way (the profiler stops), which I think would allow us/customers to catch any bugs sooner.
Finally, I have been thinking about having some mode that we could use in CI/testing to enable extra checks, especially extra checks that may be expensive/break customer setups and we wouldn't want to have on.
A good example is compiler warnings -- we don't know which version of the C compiler our customers are using, so using -Werror
to turn compiler warnings into errors on the release version of the gem is a bad idea -- suddenly a newer version of the compiler gets released and the profiler stops working because there's a new warning.
BUUUT we could enable this mode for just for CI/development, and it would allow us to catch new warnings as we introduce them.
So yeah, I haven't yet looked into such a setup, but suggestions welcome. (But in this case, even if such a setup was already in place, I would still not move this check to be done only on CI/dev mode, for the reasons I explained above).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the detailed response.
I see this check as a compiler shortcoming: the compiler should catch this. So my thought was "let's augment the compilation step then".
It seems like leaving the check as is is the best option today, so I'm happy with that.
@@ -759,3 +819,38 @@ static VALUE _native_stats(DDTRACE_UNUSED VALUE _self, VALUE collector_instance) | |||
|
|||
return stats_as_ruby_hash(state); | |||
} | |||
|
|||
static void trace_identifiers_for(struct cpu_and_wall_time_collector_state *state, VALUE thread, struct trace_identifiers *trace_identifiers_result) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we have the GVL when calling this function?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes! Added additional documentation in c7f6401, but currently the whole sampling approach is done while holding the GVL.
unsigned long long local_root_span_id = NUM2ULL(numeric_local_root_span_id); | ||
unsigned long long span_id = NUM2ULL(numeric_span_id); | ||
|
||
snprintf(trace_identifiers_result->local_root_span_id_buffer, MAXIMUM_LENGTH_64_BIT_IDENTIFIER, "%llu", local_root_span_id); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
unsigned long long long long long long long long long long long long
... ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The implementation of trace_identifiers_for
looks pretty short and easy to follow!
…e collector **What does this PR do?**: This PR reimplements support for the profiling endpoint feature in the `CpuAndWallTime` collector. Although the implementation is different, the approach is conceptually the same as provided by `Datadog::Profiling::TraceIdentifiers::Ddtrace` for the old profiler codepaths. The constraints of this implementation are the same as the ones in #2342 and effectively this is only a small extension of the work done in that PR. **Motivation**: Endpoint profiling allows customers to filter down their profiles to individual endpoints (for web apps) that they may be interested in investigating. **Additional Notes**: Comparing a Ruby string with a `char *` is really annoying. **How to test the change?**: Beyond the code coverage included by the change, it can also be tested by enabling profiling and tracing for a Ruby web app, and then checking that the application's endpoints show up in the right sidebar in the profiling UX.
What does this PR do?:
This PR reimplements support for the code hotspots feature (connecting traces to profiles) in the
CpuAndWallTime
collector.Although the implementation is different, the approach is conceptually the same as provided by
Datadog::Profiling::TraceIdentifiers::Ddtrace
for the old profiler codepaths.Key contraints in this implementation are:
These constraints are key to:
CpuAndWallTime
collector (we never get out of C code).rb_postponed_job_register_one
) so we don't want to call Ruby methods and potentially trigger side effects during a sample.Thus, I chose to implement this feature by accessing the tracer context directly from the sampling process. Because all the information needed is available as instance variables of the objects, they can be read safely and within the needed constraints.
This implementation also includes extensive checks to make sure that if tracing evolves or is not available, profiling safely skips getting the information.
Furthermore, it includes an integration test with a real tracer, so that we'll immediately know when an incompatibility is introduced, so that the profiler can be updated.
Motivation:
Code hotspots is a key feature of our APM, allowing customers to see the profile for an individual trace.
Additional Notes:
This first implementation does not include support for endpoint aggregation; this will come in a separate PR.
How to test the change?:
This change includes code coverage. It can also be tested by enabling profiling and tracing on a Ruby app, and checking that the "Code Hotspots" tab on a generated trace is showing the
profile for that trace.