[PROF-10679] Add preview support for correlating profiling with otel ruby gem #3984

ivoanjo · 2024-10-09T15:00:32Z

What does this PR do?

This PR picks up from where #3510 left off and adds preview support for correlating profiling data with traces/spans being tracked by the otel ruby gem (including endpoint profiling). (And specifically this PR just adds a few things on top of #3510, but does not otherwise change it significantly).

Note that this differs from the recommended way of using opentelemetry/otel with dd-trace-rb, which is to follow the instructions from https://docs.datadoghq.com/tracing/trace_collection/custom_instrumentation/otel_instrumentation/ruby/ .

The key difference is -- this PR makes code hotspots work even for setups that opt to not use require 'datadog/opentelemetry' (which we still recommend as the easier path).

The approach taken here is similar to #2342 and #3466: we peek inside the implementation of the otel gem to extract the information we need (namely the span id, local root span id, trace type, and trace endpoint). This approach is potentially brittle, which is why the code is written very defensively, with the aim of never breaking the application (or profiling) if something is off -- it just won't collect code hotspots.

This feature is off by default. This PR adds a new setting to control this feature:

Via an environment variable (DD_PROFILING_PREVIEW_OTEL_CONTEXT_ENABLED -> set to only/both/false/true)
Via code c.profiling.advanced.preview_otel_context_enabled = -> set to "only"/"both"/true/false

Setting this feature to both means that the profiler will automatically pick up context from either dd-trace-rb tracing or otel library; whereas setting it to only means that the profiler will skip dd-trace-rb tracing entirely and only read context from the otel library.

(This was suggested by @AlexJF in #3510 (comment), and also doubles as a way to raise confidence in releasing this since it allows us to have it off by default)

Motivation:

We have a customer interested in running this setup, so hopefully they'll be able to start using this feature.

Additional Notes:

For this feature to work, otel traces need to be tagged with the dd-trace-rb runtime-id. See below for an example of how to do this.

If the runtime-id is not set up, it will not be possible to see profiles scoped to individual requests in the datadog UX.

How to test the change?

On top of the added test coverage, I was able to see code hotspots working for the following sinatra example app:

require 'bundler/inline'

gemfile do
  source 'https://rubygems.org'

  gem 'rackup'
  gem 'dogstatsd-ruby'
  gem 'datadog', git: 'https://github.com/datadog/dd-trace-rb', branch: 'ivoanjo/prof-9263-otlp-ruby-code-hotspots'
  gem 'sinatra'
  gem 'opentelemetry-api'
  gem 'opentelemetry-sdk'
  gem 'opentelemetry-instrumentation-sinatra'
  gem 'opentelemetry-exporter-otlp'
  gem 'pry'
end

require 'sinatra/base'
require 'opentelemetry/sdk'
require 'pry'

Datadog.configure do |c|
  c.service = 'ivoanjo-testing-opentelemetry-test'
  c.profiling.enabled = true
  c.profiling.advanced.preview_otel_context_enabled = :only
end

# Configure OpenTelemetry
OpenTelemetry::SDK.configure do |c|
  c.service_name = 'ivoanjo-testing-opentelemetry-test'
  c.use 'OpenTelemetry::Instrumentation::Sinatra'
end

class MyApp < Sinatra::Base
  get '/' do
    OpenTelemetry::Trace.current_span.add_attributes({'runtime-id' => Datadog::Core::Environment::Identity.id})
    sleep 1
    'Hello, OpenTelemetry!'
  end
end

MyApp.run!

And here's how it looks:

and

…tel gem Things missing: * Specs conflict with ddtrace otel specs (need to poke at appraisals) * Missing endpoint support

While we don't need the actual span object to read the span ids, we will need it to read the endpoint names.

… specs I'm... unhappy about this, but couldn't think of anything better that wouldn't involve refactoring the ddtrace tracing otel support and that seems even worse.

Sigh old rubies...

This setting is off by default for now, while we're still previewing the feature.

Could not make it use the more specific signature, unfortunately.

...otherwise this is a confusing sharp edge that I myself just ran into.

codecov-commenter · 2024-10-09T15:10:23Z

Codecov Report

Attention: Patch coverage is 99.03382% with 2 lines in your changes missing coverage. Please review.

Please upload report for BASE (master@e5a774a). Learn more about missing BASE report.
Report is 3 commits behind head on master.

Files with missing lines	Patch %	Lines
lib/datadog/core/configuration/settings.rb	94.11%	1 Missing ⚠️
...atadog/profiling/collectors/thread_context_spec.rb	99.15%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff            @@
##             master    #3984   +/-   ##
=========================================
  Coverage          ?   97.84%           
=========================================
  Files             ?     1314           
  Lines             ?    78701           
  Branches          ?     3909           
=========================================
  Hits              ?    77004           
  Misses            ?     1697           
  Partials          ?        0

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

pr-commenter · 2024-10-09T15:32:15Z

Benchmarks

Benchmark execution time: 2024-10-09 15:39:55

Comparing candidate commit 9c40c0e in PR branch ivoanjo/prof-10679-otlp-ruby-code-hotspots-v2 with baseline commit e5a774a in branch master.

Found 0 performance improvements and 1 performance regressions! Performance is the same for 22 metrics, 2 unstable metrics.

scenario:profiler - sample timeline=false

🟥 throughput [-0.893op/s; -0.879op/s] or [-12.543%; -12.349%]

y9v · 2024-10-09T18:41:37Z

spec/datadog/profiling/collectors/thread_context_spec.rb

+              expect(@t1_span_id.to_i).to be > 0
+              expect(@t1_local_root_span_id.to_i).to be > 0


matter of personal preference probably, but this would also work:

expect(@t1_span_id.to_i).to be_positive

It's not a very strong reason, but I like the > 0 as being more explicit if zero is included or not.

In Ruby, 0 is neither positive nor negative... But you can still get a few weirdnesses:

[14] pry(main)> 0.0 => 0.0 [15] pry(main)> -0.0 => -0.0 [16] pry(main)> (0.0).object_id => -9223372036854775806 [17] pry(main)> (-0.0).object_id => 2660 [19] pry(main)> [0.0, -0.0].map(&:positive?) => [false, false] [20] pry(main)> [0.0, -0.0].map(&:negative?) => [false, false]

so IDK, I feel like > 0 is ultra-clear while positive may leave you wondering :)

you are right, I didn't know about -0.0.negative? thing

I guess it makes sense for -0.0 have negative? => false because -0.0 == 0.0 but yea... computers 🤣

AlexJF · 2024-10-09T21:06:33Z

ext/datadog_profiling_native_extension/collectors_thread_context.c

+  if (context_storage == Qnil || !RB_TYPE_P(context_storage, T_ARRAY)) return;
+
+  VALUE otel_current_span_key = get_otel_current_span_key(state);
+  if (otel_current_span_key == Qfalse) return;


Nit: Noting that all the other short-circuiting ifs are essentially of the form if (xxx == Qnil). Only otel_current_span_key behaves differently. I can imagine some time in the future when this code is less fresh in our minds this could bite us back. Granted tests hopefully catch it but perhaps not a wild idea to flip the values we use as not-found and not-resolved markers so all short-circuiting ifs can be of the == Qnil form?

Fixed in 2b99cb4 (PR #3987)

AlexJF · 2024-10-09T21:22:08Z

ext/datadog_profiling_native_extension/collectors_thread_context.c

+    VALUE span_context = span_and_context.span_context;
+    if (span_context == Qnil) return;
+
+    VALUE span_id = rb_ivar_get(span_context, at_span_id_id /* @span_id */);
+    VALUE span_trace_id = rb_ivar_get(span_context, at_trace_id_id /* @trace_id */);
+    if (span_id == Qnil || span_trace_id == Qnil || !RB_TYPE_P(span_id, T_STRING) || !RB_TYPE_P(span_trace_id, T_STRING)) return;


Every time you call otel_span_and_context_from you end up doing these things (refer to lines 1655-1661 above).

Could we not include all of these into otel_span_and_context_from? It seems like it'd improve readability and maintainability a fair bit? Would also lead otel_span to be something like { .span_id, .trace_id, .span } which also feels like a simpler API?

Yup, it makes for a nice simplification in 0c124ec (PR #3987)

AlexJF · 2024-10-09T21:34:00Z

ext/datadog_profiling_native_extension/collectors_thread_context.c

+
+    VALUE span_id = rb_ivar_get(span_context, at_span_id_id /* @span_id */);
+    VALUE span_trace_id = rb_ivar_get(span_context, at_trace_id_id /* @trace_id */);
+    if (span_id == Qnil || span_trace_id == Qnil || !RB_TYPE_P(span_id, T_STRING) || !RB_TYPE_P(span_trace_id, T_STRING)) return;


Minor: Giving up on the whole thing at this point seems wasteful.

I realize this is part of your strategy of assuming any unexpected thing meaning we may not be operating under the same constraints we considered when we first added this. But at this point we know we have a valid active_span and may have already found a chain of multiple valid spans.

Would it be that dangerous to break rather than return here and just carry on, taking the last valid span in the chain as the root span?

Hmmm it's a reasonable point.

My thinking is twofold:

If we get the wrong local root span id then I think we'd be sending incorrect data downstream. I can imagine us trying to investigate "hey, why does this span not show up correctly when 'span and children' is selected, but if I select it directly it works?" and how hard it would be to track it down to us being lenient here.

If this class gets changed, then potentially we won't be able to extract data from any instance. It seems unlikely (but not impossible, I'll admit), that we'll have some instances with the data, but not others. So I'm not sure if in practice this "mixed situation would happen"

...so I'm not entirely convinced it's worth trying to be lenient here 🤔

AlexJF · 2024-10-09T21:47:07Z

lib/datadog/profiling/component.rb

+          when 'both'
+            :both
+          when 'only'
+            :only
+          else
+            false
+          end


Q: I was wondering if this was a style choice or if there's some limitation of the settings system that prevents us from using :both, :only and false directly as the settings' normalized values?

Fixed in 222b4d4 (PR #3987). To be honest I wasn't sure about some of the details in the settings system, but on a second pass it works quite nicely.

ivoanjo · 2024-10-10T08:32:14Z

Thanks folks! As we want to get this out asap, I'll merge the PR as-is and open a follow-up PR to address your feedback.

…etry context This PR documents the setting introduced in version 2.4.0 of the Datadog Ruby library (DataDog/dd-trace-rb#3984) to have the Ruby profiler directly read from the opentelemetry context. This allows customers that want to use the opentelemetry Ruby library together with profiling to get the full profiler feature set. (I've also gone ahead and did some small tweaks to the text: Suggest using the latest version in the config example; Suggest using Ruby 3.1+ in the requirements).

ivoanjo added 15 commits September 25, 2024 15:58

[PROF-9263] First working version of getting trace identifiers from o…

c579077

…tel gem Things missing: * Specs conflict with ddtrace otel specs (need to poke at appraisals) * Missing endpoint support

Refactor otel_span_context_from to also return span

538aed9

While we don't need the actual span object to read the span ids, we will need it to read the endpoint names.

Add support for collecting trace endpoint from otel spans

1c8a9a7

Add coverage for trace with invalid span

fc26465

Allow specs for otel sdk without ddtrace to co-exist with the ddtrace…

a6164bd

… specs I'm... unhappy about this, but couldn't think of anything better that wouldn't involve refactoring the ddtrace tracing otel support and that seems even worse.

Fix specs breaking on Ruby 2.3 due to missing String#unpack1

5dcec3f

Sigh old rubies...

Add appraisal gemfiles/lockfiles for opentelemetry_otlp configuration

468cb13

Remove unneeded Rubocop config

dd865ee

Apply standardrb code style fixes

19646f2

Support running in situations where Gem.loaded_specs is not available

6b67c03

Merge branch 'master' into ivoanjo/prof-10679-otlp-ruby-code-hotspots-v2

83cf3de

Merge branch 'master' into ivoanjo/prof-10679-otlp-ruby-code-hotspots-v2

b34c521

Introduce new setting to control profiler reading from otel context

9c40c0e

This setting is off by default for now, while we're still previewing the feature.

Relax type signatures to make steep happy

5227d33

Could not make it use the more specific signature, unfortunately.

Also allow preview_otel_context_enabled to be set to a symbol

85d29f1

...otherwise this is a confusing sharp edge that I myself just ran into.

ivoanjo requested review from a team as code owners October 9, 2024 15:00

ivoanjo mentioned this pull request Oct 9, 2024

[PROF-9263] Add experimental support for profiling code hotspots when used with opentelemetry ruby gem #3510

Merged

y9v reviewed Oct 9, 2024

View reviewed changes

y9v approved these changes Oct 9, 2024

View reviewed changes

AlexJF approved these changes Oct 9, 2024

View reviewed changes

ivoanjo merged commit 89d23d6 into master Oct 10, 2024
252 checks passed

ivoanjo deleted the ivoanjo/prof-10679-otlp-ruby-code-hotspots-v2 branch October 10, 2024 08:32

github-actions bot added this to the 2.4.0 milestone Oct 10, 2024

ivoanjo added profiling Involves Datadog profiling feature Involves a product feature otel OpenTelemetry-related changes labels Oct 10, 2024

ivoanjo mentioned this pull request Oct 10, 2024

[PROF-10679] Fixes for comments from PR 3984 #3987

Merged

y9v mentioned this pull request Oct 11, 2024

Bump to version 2.4.0 #3994

Merged

ivoanjo mentioned this pull request Dec 3, 2024

[NO-TICKET] Document Ruby profiler setting for reading from opentelemetry context DataDog/documentation#26576

Merged

ivoanjo mentioned this pull request Jan 2, 2025

Context concurrency and parallelism issues open-telemetry/opentelemetry-ruby#1766

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[PROF-10679] Add preview support for correlating profiling with otel ruby gem #3984

[PROF-10679] Add preview support for correlating profiling with otel ruby gem #3984

ivoanjo commented Oct 9, 2024

codecov-commenter commented Oct 9, 2024

pr-commenter bot commented Oct 9, 2024 •

edited

Loading

y9v Oct 9, 2024

ivoanjo Oct 10, 2024

y9v Oct 10, 2024

ivoanjo Oct 10, 2024

AlexJF Oct 9, 2024

ivoanjo Oct 10, 2024

AlexJF Oct 9, 2024

ivoanjo Oct 10, 2024

AlexJF Oct 9, 2024

ivoanjo Oct 10, 2024 •

edited

Loading

AlexJF Oct 9, 2024 •

edited

Loading

ivoanjo Oct 10, 2024

ivoanjo commented Oct 10, 2024

		expect(@t1_span_id.to_i).to be > 0
		expect(@t1_local_root_span_id.to_i).to be > 0

[PROF-10679] Add preview support for correlating profiling with otel ruby gem #3984

[PROF-10679] Add preview support for correlating profiling with otel ruby gem #3984

Conversation

ivoanjo commented Oct 9, 2024

codecov-commenter commented Oct 9, 2024

Codecov Report

pr-commenter bot commented Oct 9, 2024 • edited Loading

Benchmarks

scenario:profiler - sample timeline=false

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ivoanjo Oct 10, 2024 • edited Loading

Choose a reason for hiding this comment

AlexJF Oct 9, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ivoanjo commented Oct 10, 2024

pr-commenter bot commented Oct 9, 2024 •

edited

Loading

ivoanjo Oct 10, 2024 •

edited

Loading

AlexJF Oct 9, 2024 •

edited

Loading