Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Add ActionView instrumentation via ActiveSupport::Notifications #796

Merged

Conversation

ahayworth
Copy link
Contributor

@ahayworth ahayworth commented Jun 7, 2021

This commit adds a safe (I hope) subscription framework for
ActiveSupport::Notifications and then uses that framework to instrument
ActionView.

Key to this PR is the fact that Rails allows you to swap out the queue for
ActiveSupport::Notifications; although documentation on that is sparse.

We implement a thin wrapper around the existing (and default) Fanout
queue and a "SpanSubscriber" that handles the actual event instrumentation.
Together, they do three special things:

  • Sort the subscribers for a notification so that our "SpanSubscriber"
    class always runs first.
  • Start a span when the event starts
  • Finish the span when the event finishes, merges in the payload as
    span attributes, and then toggles the error bit if there was an
    uncaught exception.

There are two main problems with ActiveSupport::Notifications as they
relate to telemetry, and we attempt to solve them both here:

  1. Any other subscriber to the notification in question could throw,
    and that could mean we are left with dangling spans. Our solution to
    this lies in how we sort the subscribers for a notification when
    handling the events in the queue wrapper. By ensuring that our
    handlers run first, we ensure that anyone else who might throw
    doesn't affect our span handling.

  2. Span stack handling can be a nightmare. Here, we rely on a
    undocumented (?) feature (??) of notifications: if the callable
    passed in with the subscription responds to start/finish methods,
    then Rails will call start and finish on your queue object. And
    the return value of start from your queue is saved around the
    execution of the instrumented block, and passed back in to the
    finish method on your queue object. We can use that and pass around
    the span that was started and the previous context to restore, and
    that greatly simplifies the state management.

There are, of course, tradeoffs:

  1. In this implementation, we don't allow for any kind of real
    specialization in the "SpanSubscriber" object. We could in the
    future; and individual instrumentations could always subclass it.
  2. You always lose some detail by instrumenting in this way - it's never
    going to be as rich as patching the classes directly; but it is
    something of a handy-dandy shortcut.
  3. We add a dependency on the rails instrumentation. I don't know if
    this is the right thing to do here or not, and I don't know if
    potentially installing the instrumentation many times will have a
    huge problem.

It should be noted that the Rails instrumentation is required at the moment, if you choose to use this instrumentation.

This commit adds a safe (I hope) subscription framework for
ActiveSupport::Notifications and then uses that framework to instrument
ActionView.

Key to this PR is the fact that Rails allows you to swap out the queue for
ActiveSupport::Notifications; although documentation on that is sparse.

We implement a thin wrapper around the existing (and default) Fanout
queue and a "SpanSubscriber" that handles the actual event instrumentation.
Together, they do three special things:

- Sort the subscribers for a notification so that our "SpanSubscriber"
  class always runs first.
- Start a span when the event starts
- Finish the span when the event finishes, merges in the payload as
  span attributes, and then toggles the error bit if there was an
  uncaught exception.

There are two main problems with ActiveSupport::Notifications as they
relate to telemetry, and we attempt to solve them both here:

1. Any *other* subscriber to the notification in question could throw,
   and that could mean we are left with dangling spans. Our solution to
   this lies in how we sort the subscribers for a notification when
   handling the events in the queue wrapper. By ensuring that our
   handlers run first, we ensure that anyone *else* who might throw
   doesn't affect our span handling.

2. Span stack handling can be a nightmare. Here, we rely on a
   undocumented (?) feature (??) of notifications: if the callable
   passed in with the subscription responds to `start`/`finish` methods,
   then Rails will call `start` and `finish` on your queue object. And
   the return value of `start` from your queue is saved around the
   execution of the instrumented block, and passed back in to the
   `finish` method on your queue object. We can use that and pass around
   the span that was started *and* the previous context to restore, and
   that greatly simplifies the state management.

There are, of course, tradeoffs:

1. In this implementation, we don't allow for any kind of real
   specialization in the "SpanSubscriber" object. We could in the
   future; and individual instrumentations could always subclass it.
2. You always lose some detail by instrumenting in this way - it's never
   going to be as rich as patching the classes directly; but it is
   something of a handy-dandy shortcut.
3. We add a dependency on the `rails` instrumentation. I don't know if
   this is the right thing to do here or not, and I don't know if
   potentially installing the instrumentation many times will have a
   huge problem.

This PR is not really done; I have tested it locally but not
extensively. And notably I haven't written any tests whatsoever for it.
But I'm putting it up now because I would like early feedback on this
approach!
@ahayworth
Copy link
Contributor Author

cc @open-telemetry/ruby-approvers - if you'd like to give some early feedback on this, I'd appreciate it. And, for kicks and giggles, here's a screenshot of it all working:

Screen Shot 2021-06-07 at 3 08 46 PM

@robbkidd
Copy link
Member

robbkidd commented Jun 7, 2021

I'm in.

@fbogsany
Copy link
Contributor

fbogsany commented Jun 8, 2021

This is really nice!

@@ -26,6 +26,7 @@ Gem::Specification.new do |spec|
spec.require_paths = ['lib']
spec.required_ruby_version = '>= 2.5.0'

spec.add_dependency 'opentelemetry-instrumentation-action_view', '~> 1.0.0.rc1'
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This version constraint from inst-all to inst-action_view uses a version number later than inst-action_view's current version (0.19.0).

Found attempting to use this branch for smoke testing.

Unhappy Bundler
Resolving dependencies...
Bundler could not find compatible versions for gem "opentelemetry-instrumentation-action_view":
  In Gemfile:
    opentelemetry-instrumentation-all was resolved to 0.18.0, which depends on
      opentelemetry-instrumentation-action_view (~> 1.0.0.rc1)

Could not find gem 'opentelemetry-instrumentation-action_view (~> 1.0.0.rc1)',
  which is required by gem 'opentelemetry-instrumentation-all', in any of the sources.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is because the instrumentation generator uses the api/sdk version as a default, when that might not actually be what anyone wants now that they are not kept in lockstep. cc @robertlaurin

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@robbkidd I updated to the correct version, but since the gem is unreleased I don't know how bundler will actually behave with the -all gem. When I was testing locally I had just installed the gem from a file path in Bundler.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am very interested in your smoke-testing though; this "works on my machine" ™️ but I would not be at all surprised if it breaks horribly in ways I haven't foreseen.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we actually talked about what the new default should be but lets just start by setting it to something that makes more sense than the api version #803

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I ... think yes? I'll refresh mental state on it and report back tomorrow!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Succeeded in using the new gem, yes.

gem 'opentelemetry-api'
gem 'opentelemetry-sdk'
gem 'opentelemetry-exporter-otlp'

github 'ahayworth/opentelemetry-ruby', branch: 'ahayworth-activesupport-notifications', glob: 'instrumentation/*/*.gemspec' do
  gem 'opentelemetry-instrumentation-action_view'
  gem 'opentelemetry-instrumentation-rails'
end

Currently seeing errors and no spans produced for ActionView renderings:

E, [2021-06-26T00:11:38.603174 #1] ERROR -- : OpenTelemetry error: calls to detach should match corresponding calls to attach.
E, [2021-06-26T00:11:38.603274 #1] ERROR -- : OpenTelemetry error: calls to detach should match corresponding calls to attach.
E, [2021-06-26T00:11:38.603454 #1] ERROR -- : OpenTelemetry error: calls to detach should match corresponding calls to attach.

Of course, I could be doing something wrong.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ahayworth With some judicious puts statements sprinkled in SpanSubscriber's start() and finish(), I'm seeing that the payload does not have the span or token tucked into it. So, I'm thinking the OTel replacement for Fanout isn't taking effect, maybe?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So, I'm thinking the OTel replacement for Fanout isn't taking effect, maybe?

@robbkidd Quite probably, yes. Can you share how you're initializing the OpenTelemetry SDK in your tests? (And/or if you have any kind of public test app that reproduces it, that'd be fantastic too)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ahayworth My test lab is very messy which could very well be the origin of my problem. I'll work on getting it into a state that is shareable.

@ahayworth ahayworth marked this pull request as ready for review June 24, 2021 18:44
@ahayworth
Copy link
Contributor Author

@open-telemetry/ruby-approvers I've updated to the latest context and status methods, and I think this is ready for legitimate review.

Copy link
Contributor

@fbogsany fbogsany left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks awesome. I have one question about payload value types, but otherwise this LGTM 🎉

According to the spec, we should allow:
- string
- boolean
- numeric
- arrays where all members are the same type (and that type is string,
  boolean, or numeric)
- empty arrays

We filter out anything that doesn't meet that criteria (and we continue
filtering out exception/exception_object attributes). We allow symbols
as values and stringify them to meet the spec (we do not transform
anything else).
These will never be set as a span attribute value.
@ahayworth
Copy link
Contributor Author

For any passers-by: I'm going to test this internally at GitHub a bit more on Monday, before we merge; just to be safe. Unless someone has been lurking and testing it in their app all along, which would be fantastic. 😆

@ahayworth
Copy link
Contributor Author

@robbkidd and I found a way to initialize the queue-swapping more reliably in 967bbdc, and I'm testing this now in production.

Copy link
Member

@robbkidd robbkidd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

With the Rails app I was using previously, the change in 967bbdc results in no more OpenTelemetry error: calls to detach should match corresponding calls to attach. errors and a successful child span for action_view render_template being emitted within a trace.

@ahayworth
Copy link
Contributor Author

I ran this in production for a few hours, and noted nothing obviously or dramatically wrong. I think we're good to merge @open-telemetry/ruby-maintainers 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants