Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[PROF-10679] Add preview support for correlating profiling with otel ruby gem #3984

Merged
merged 15 commits into from
Oct 10, 2024

Conversation

ivoanjo
Copy link
Member

@ivoanjo ivoanjo commented Oct 9, 2024

What does this PR do?

This PR picks up from where #3510 left off and adds preview support for correlating profiling data with traces/spans being tracked by the otel ruby gem (including endpoint profiling). (And specifically this PR just adds a few things on top of #3510, but does not otherwise change it significantly).

Note that this differs from the recommended way of using opentelemetry/otel with dd-trace-rb, which is to follow the instructions from https://docs.datadoghq.com/tracing/trace_collection/custom_instrumentation/otel_instrumentation/ruby/ .

The key difference is -- this PR makes code hotspots work even for setups that opt to not use require 'datadog/opentelemetry' (which we still recommend as the easier path).

The approach taken here is similar to #2342 and #3466: we peek inside the implementation of the otel gem to extract the information we need (namely the span id, local root span id, trace type, and trace endpoint). This approach is potentially brittle, which is why the code is written very defensively, with the aim of never breaking the application (or profiling) if something is off -- it just won't collect code hotspots.

This feature is off by default. This PR adds a new setting to control this feature:

  • Via an environment variable (DD_PROFILING_PREVIEW_OTEL_CONTEXT_ENABLED -> set to only/both/false/true)
  • Via code c.profiling.advanced.preview_otel_context_enabled = -> set to "only"/"both"/true/false

Setting this feature to both means that the profiler will automatically pick up context from either dd-trace-rb tracing or otel library; whereas setting it to only means that the profiler will skip dd-trace-rb tracing entirely and only read context from the otel library.

(This was suggested by @AlexJF in #3510 (comment), and also doubles as a way to raise confidence in releasing this since it allows us to have it off by default)

Motivation:

We have a customer interested in running this setup, so hopefully they'll be able to start using this feature.

Additional Notes:

For this feature to work, otel traces need to be tagged with the dd-trace-rb runtime-id. See below for an example of how to do this.

If the runtime-id is not set up, it will not be possible to see profiles scoped to individual requests in the datadog UX.

How to test the change?

On top of the added test coverage, I was able to see code hotspots working for the following sinatra example app:

require 'bundler/inline'

gemfile do
  source 'https://rubygems.org'

  gem 'rackup'
  gem 'dogstatsd-ruby'
  gem 'datadog', git: 'https://github.com/datadog/dd-trace-rb', branch: 'ivoanjo/prof-9263-otlp-ruby-code-hotspots'
  gem 'sinatra'
  gem 'opentelemetry-api'
  gem 'opentelemetry-sdk'
  gem 'opentelemetry-instrumentation-sinatra'
  gem 'opentelemetry-exporter-otlp'
  gem 'pry'
end

require 'sinatra/base'
require 'opentelemetry/sdk'
require 'pry'

Datadog.configure do |c|
  c.service = 'ivoanjo-testing-opentelemetry-test'
  c.profiling.enabled = true
  c.profiling.advanced.preview_otel_context_enabled = :only
end

# Configure OpenTelemetry
OpenTelemetry::SDK.configure do |c|
  c.service_name = 'ivoanjo-testing-opentelemetry-test'
  c.use 'OpenTelemetry::Instrumentation::Sinatra'
end

class MyApp < Sinatra::Base
  get '/' do
    OpenTelemetry::Trace.current_span.add_attributes({'runtime-id' => Datadog::Core::Environment::Identity.id})
    sleep 1
    'Hello, OpenTelemetry!'
  end
end

MyApp.run!

And here's how it looks:

image

and

image

…tel gem

Things missing:

* Specs conflict with ddtrace otel specs (need to poke at appraisals)
* Missing endpoint support
While we don't need the actual span object to read the span ids, we
will need it to read the endpoint names.
… specs

I'm... unhappy about this, but couldn't think of anything better that
wouldn't involve refactoring the ddtrace tracing otel support and
that seems even worse.
This setting is off by default for now, while we're still previewing
the feature.
Could not make it use the more specific signature, unfortunately.
...otherwise this is a confusing sharp edge that I myself just
ran into.
@ivoanjo ivoanjo requested review from a team as code owners October 9, 2024 15:00
@codecov-commenter
Copy link

Codecov Report

Attention: Patch coverage is 99.03382% with 2 lines in your changes missing coverage. Please review.

Please upload report for BASE (master@e5a774a). Learn more about missing BASE report.
Report is 3 commits behind head on master.

Files with missing lines Patch % Lines
lib/datadog/core/configuration/settings.rb 94.11% 1 Missing ⚠️
...atadog/profiling/collectors/thread_context_spec.rb 99.15% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff            @@
##             master    #3984   +/-   ##
=========================================
  Coverage          ?   97.84%           
=========================================
  Files             ?     1314           
  Lines             ?    78701           
  Branches          ?     3909           
=========================================
  Hits              ?    77004           
  Misses            ?     1697           
  Partials          ?        0           

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

@pr-commenter
Copy link

pr-commenter bot commented Oct 9, 2024

Benchmarks

Benchmark execution time: 2024-10-09 15:39:55

Comparing candidate commit 9c40c0e in PR branch ivoanjo/prof-10679-otlp-ruby-code-hotspots-v2 with baseline commit e5a774a in branch master.

Found 0 performance improvements and 1 performance regressions! Performance is the same for 22 metrics, 2 unstable metrics.

scenario:profiler - sample timeline=false

  • 🟥 throughput [-0.893op/s; -0.879op/s] or [-12.543%; -12.349%]

Comment on lines +421 to +422
expect(@t1_span_id.to_i).to be > 0
expect(@t1_local_root_span_id.to_i).to be > 0
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

matter of personal preference probably, but this would also work:

expect(@t1_span_id.to_i).to be_positive

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not a very strong reason, but I like the > 0 as being more explicit if zero is included or not.

In Ruby, 0 is neither positive nor negative... But you can still get a few weirdnesses:

[14] pry(main)> 0.0
=> 0.0
[15] pry(main)> -0.0
=> -0.0
[16] pry(main)> (0.0).object_id
=> -9223372036854775806
[17] pry(main)> (-0.0).object_id
=> 2660
[19] pry(main)> [0.0, -0.0].map(&:positive?)
=> [false, false]
[20] pry(main)> [0.0, -0.0].map(&:negative?)
=> [false, false]

so IDK, I feel like > 0 is ultra-clear while positive may leave you wondering :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right, I didn't know about -0.0.negative? thing

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess it makes sense for -0.0 have negative? => false because -0.0 == 0.0 but yea... computers 🤣

if (context_storage == Qnil || !RB_TYPE_P(context_storage, T_ARRAY)) return;

VALUE otel_current_span_key = get_otel_current_span_key(state);
if (otel_current_span_key == Qfalse) return;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: Noting that all the other short-circuiting ifs are essentially of the form if (xxx == Qnil). Only otel_current_span_key behaves differently. I can imagine some time in the future when this code is less fresh in our minds this could bite us back. Granted tests hopefully catch it but perhaps not a wild idea to flip the values we use as not-found and not-resolved markers so all short-circuiting ifs can be of the == Qnil form?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 2b99cb4 (PR #3987)

Comment on lines +1669 to +1674
VALUE span_context = span_and_context.span_context;
if (span_context == Qnil) return;

VALUE span_id = rb_ivar_get(span_context, at_span_id_id /* @span_id */);
VALUE span_trace_id = rb_ivar_get(span_context, at_trace_id_id /* @trace_id */);
if (span_id == Qnil || span_trace_id == Qnil || !RB_TYPE_P(span_id, T_STRING) || !RB_TYPE_P(span_trace_id, T_STRING)) return;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Every time you call otel_span_and_context_from you end up doing these things (refer to lines 1655-1661 above).

Could we not include all of these into otel_span_and_context_from? It seems like it'd improve readability and maintainability a fair bit? Would also lead otel_span to be something like { .span_id, .trace_id, .span } which also feels like a simpler API?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yup, it makes for a nice simplification in 0c124ec (PR #3987)


VALUE span_id = rb_ivar_get(span_context, at_span_id_id /* @span_id */);
VALUE span_trace_id = rb_ivar_get(span_context, at_trace_id_id /* @trace_id */);
if (span_id == Qnil || span_trace_id == Qnil || !RB_TYPE_P(span_id, T_STRING) || !RB_TYPE_P(span_trace_id, T_STRING)) return;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Minor: Giving up on the whole thing at this point seems wasteful.

I realize this is part of your strategy of assuming any unexpected thing meaning we may not be operating under the same constraints we considered when we first added this. But at this point we know we have a valid active_span and may have already found a chain of multiple valid spans.

Would it be that dangerous to break rather than return here and just carry on, taking the last valid span in the chain as the root span?

Copy link
Member Author

@ivoanjo ivoanjo Oct 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmmm it's a reasonable point.

My thinking is twofold:

  • If we get the wrong local root span id then I think we'd be sending incorrect data downstream. I can imagine us trying to investigate "hey, why does this span not show up correctly when 'span and children' is selected, but if I select it directly it works?" and how hard it would be to track it down to us being lenient here.

  • If this class gets changed, then potentially we won't be able to extract data from any instance. It seems unlikely (but not impossible, I'll admit), that we'll have some instances with the data, but not others. So I'm not sure if in practice this "mixed situation would happen"

...so I'm not entirely convinced it's worth trying to be lenient here 🤔

Comment on lines +89 to +95
when 'both'
:both
when 'only'
:only
else
false
end
Copy link
Contributor

@AlexJF AlexJF Oct 9, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Q: I was wondering if this was a style choice or if there's some limitation of the settings system that prevents us from using :both, :only and false directly as the settings' normalized values?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed in 222b4d4 (PR #3987). To be honest I wasn't sure about some of the details in the settings system, but on a second pass it works quite nicely.

@ivoanjo
Copy link
Member Author

ivoanjo commented Oct 10, 2024

Thanks folks! As we want to get this out asap, I'll merge the PR as-is and open a follow-up PR to address your feedback.

@ivoanjo ivoanjo merged commit 89d23d6 into master Oct 10, 2024
252 checks passed
@ivoanjo ivoanjo deleted the ivoanjo/prof-10679-otlp-ruby-code-hotspots-v2 branch October 10, 2024 08:32
@github-actions github-actions bot added this to the 2.4.0 milestone Oct 10, 2024
@ivoanjo ivoanjo added profiling Involves Datadog profiling feature Involves a product feature otel OpenTelemetry-related changes labels Oct 10, 2024
@y9v y9v mentioned this pull request Oct 11, 2024
ivoanjo added a commit to DataDog/documentation that referenced this pull request Dec 3, 2024
…etry context

This PR documents the setting introduced in version 2.4.0 of the
Datadog Ruby library (DataDog/dd-trace-rb#3984)
to have the Ruby profiler directly read from the opentelemetry context.

This allows customers that want to use the opentelemetry Ruby library
together with profiling to get the full profiler feature set.

(I've also gone ahead and did some small tweaks to the text:
Suggest using the latest version in the config example; Suggest using
Ruby 3.1+ in the requirements).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature Involves a product feature otel OpenTelemetry-related changes profiling Involves Datadog profiling
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants