Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Experimental support for Apollo tracing over OTLP #4982

Merged
merged 102 commits into from
Jun 10, 2024

Conversation

timbotnik
Copy link
Contributor

@timbotnik timbotnik commented Apr 18, 2024

https://apollographql.atlassian.net/browse/ROUTER-348

As the ecosystem around OpenTelemetry (OTel) has been expanding rapidly, we are evaluating a migration of Apollo's internal tracing system to use an OTel-based protocol.

In the short-term, benefits include:

  • A comprehensive way to visualize the Router execution path in Studio.
  • Additional spans that were previously not included in Studio traces, such as query parsing, planning, execution, and more.
  • Additional metadata such as subgraph fetch details, Router idle / busy timing, and more.

Long-term, we see this as a strategic enhancement to consolidate these 2 disparate tracing systems.
This will pave the way for future enhancements to more easily plug into the Studio trace visualizer.

Configuration

This change adds a new configuration option experimental_otlp_tracing_sampler. This can be used to send
a percentage of traces via OTLP instead of the native Apollo Usage Reporting protocol. Supported values:

  • always_off (default): send all traces via Apollo Usage Reporting protocol.
  • always_on: send all traces via OTLP.
  • 0.0 - 1.0: the ratio of traces to send via OTLP (0.5 = 50 / 50).

Note that this sampler is only applied after the common tracing sampler, for example:

Sample 1% of traces, send all traces via OTLP:

telemetry:
  apollo:
    # Send all traces via OTLP
    experimental_otlp_tracing_sampler: always_on

  exporters:
    tracing:
      common:
        # Sample traces at 1% of all traffic
        sampler: 0.01

Checklist

Complete the checklist (and note appropriate exceptions) before the PR is marked ready-for-review.

  • Changes are compatible1
  • Documentation2 completed
  • Performance impact assessed and acceptable
  • Tests added and passing3
    • Unit Tests
    • Integration Tests
    • Manual Tests

Exceptions

Note any exceptions here

Notes

Footnotes

  1. It may be appropriate to bring upcoming changes to the attention of other (impacted) groups. Please endeavour to do this before seeking PR approval. The mechanism for doing this will vary considerably, so use your judgement as to how and when to do this.

  2. Configuration is an important part of many changes. Where applicable please try to document configuration examples.

  3. Tick whichever testing boxes are applicable. If you are adding Manual Tests, please document the manual testing (extensively) in the Exceptions.

This comment has been minimized.

@timbotnik timbotnik changed the title Timbotnik/apollo otlp/initial support WIP: initial support for Apollo over OTLP Apr 18, 2024
@timbotnik timbotnik changed the title WIP: initial support for Apollo over OTLP WIP: experiment to support for Apollo over OTLP Apr 18, 2024
@timbotnik timbotnik changed the title WIP: experiment to support for Apollo over OTLP WIP: experiment to support Apollo tracing over OTLP Apr 18, 2024
@timbotnik timbotnik requested a review from bnjjj April 18, 2024 13:38
@router-perf
Copy link

router-perf bot commented Apr 18, 2024

CI performance tests

  • reload - Reload test over a long period of time at a constant rate of users
  • events_big_cap_high_rate_callback - Stress test for events with a lot of users, deduplication enabled and high rate event with a big queue capacity using callback mode
  • events_without_dedup_callback - Stress test for events with a lot of users and deduplication DISABLED using callback mode
  • large-request - Stress test with a 1 MB request payload
  • const - Basic stress test that runs with a constant number of users
  • no-graphos - Basic stress test, no GraphOS.
  • step-jemalloc-tuning - Clone of the basic stress test for jemalloc tuning
  • events - Stress test for events with a lot of users and deduplication ENABLED
  • events_callback - Stress test for events with a lot of users and deduplication ENABLED in callback mode
  • events_big_cap_high_rate - Stress test for events with a lot of users, deduplication enabled and high rate event with a big queue capacity
  • events_without_dedup - Stress test for events with a lot of users and deduplication DISABLED
  • xxlarge-request - Stress test with 100 MB request payload
  • xlarge-request - Stress test with 10 MB request payload
  • step - Basic stress test that steps up the number of users over time

@timbotnik timbotnik force-pushed the timbotnik/apollo-otlp/initial-support branch from ea8c4a2 to b0755b2 Compare April 18, 2024 15:32
@timbotnik timbotnik force-pushed the timbotnik/apollo-otlp/initial-support branch from b0755b2 to 376f480 Compare April 18, 2024 15:41
@bnjjj bnjjj requested review from a team as code owners June 3, 2024 08:30
Copy link
Contributor

@garypen garypen left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks generally good, but I'm concerned about the lock around the export future. Maybe worth taking another look at that or at least convincing me that it works as you'd expect.

resource_template: Resource::new([
KeyValue::new(
"apollo.router.id",
ROUTER_ID.get_or_init(Uuid::new_v4).to_string(),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is initialised in a couple of places now. Let's have a fn router_id() that contains ROUTER_ID.get_or_init(||Uuid::new_v4().to_string())

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

SGTM 👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@@ -0,0 +1,519 @@
//! Be aware that this test file contains some fairly flaky tests which embed a number of
Copy link
Contributor

@BrynCooke BrynCooke Jun 3, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the integration test runner was used rather than the test harness then all the tests could be isolated.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actually I haven't seen any flakiness TBH - I'll remove this line.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@timbotnik timbotnik requested review from garypen, BrynCooke and bnjjj June 4, 2024 05:27
timbotnik added 3 commits June 6, 2024 13:29
During testing, we found that some spans were missing at the top-level.  This means that our trace viewer is not able to connect the full trace tree and is rendered as an “invalid” trace.  I’ll raise a discussion about whether users can inject additional spans between the request and the supergraph / execution spans which will end up breaking our tree.
@timbotnik timbotnik merged commit cee539e into dev Jun 10, 2024
19 checks passed
@timbotnik timbotnik deleted the timbotnik/apollo-otlp/initial-support branch June 10, 2024 08:37
@lrlna lrlna mentioned this pull request Jun 18, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants