Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Telemetry roadmap config #4061

Merged
merged 22 commits into from
Oct 25, 2023
Merged

Telemetry roadmap config #4061

merged 22 commits into from
Oct 25, 2023

Conversation

BrynCooke
Copy link
Contributor

@BrynCooke BrynCooke commented Oct 19, 2023

This PR creates a config structure that will set us up a better telemetry future.

Documentation is current in progress.

The existing configuration has grown organically, and now that inconsistencies have been removed we can now look to adding new features.

This PR contains a feature gate: telemetry_next that must be enabled to get the new config.

Run the following to generate the schema for exploration:

cargo run --features telemetry_next -- --schema > schema.json

The goals of this config are:

  • Follow the Otel specs. There's plenty of standard attributes and instruments as part of the spec.
  • Custom events and instruments.
  • Standardize access to data at the relevant service.
  • Consistency across metrics/tracing/logging.

Example config:

telemetry:

  logging:
    common:
      service_name: router
    stdout:
      enabled: false
      format: bunyan
    file:
      enabled: false
      format: bunyan

  tracing:
    common: # Renamed from trace_config
      max_attributes_per_event: 128
      max_attributes_per_span: 128
      max_attributes_per_link: 128
      max_events_per_span: 128
      max_links_per_span: 128
      parent_based_sampler: true
      sampler: always_on
      service_name: router
      service_namespace: "default"
      resource:
        d: e

      # Resources are otel config not represented in the yaml config


    propagation:
      baggage: false
      jaeger: false
      datadog: false
      request:
        header_name: "X-REQUEST-ID"
      trace_context: false
      zipkin: false

    otlp:
      enabled: true
      endpoint: "http://localhost:4317/v1/traces"




  metrics:
    common:
      service_namespace: "default"
      service_name: router
      buckets:
        - 0.1

      resource:
        test: foo
    prometheus:
      enabled: true
      path: /metrics
    otlp:
      enabled: true


  instruments:
    default_attribute_requirement_level: required
    router:
      http.server.active_requests: true
      my_instrument:
        value: unit
        type: counter
        unit: kb
        description: "my description"
        event: on_error
        attributes:
          http.response.status_code: false
          "my_attribute":
            response_header: "X-MY-HEADER"
            default: "unknown"
            redact: "foo"

    supergraph:
      my_instrument:
        value: unit
        event: on_error
        type: counter
        unit: kb
        description: "my description"
    subgraph:
      my_instrument:
        value: unit
        event: on_error
        type: counter
        unit: kb
        description: "my description"


  events:
    router:
      request: true
      response: false
      error: false
      test:
        message: "foo"
        level: info
        attributes:
          http.response.body.size: false


  spans:
    default_attribute_requirement_level: required
    legacy_request_span: true
    # The request span will be disappearing
    # router is the new root span
    router:
      attributes:
        dd.trace_id: false
        http.request.body.size: false
        http.response.body.size: false
        http.request.method: false
        http.request.method.original: false
        http.response.status_code: false
        network.protocol.name: false
        network.protocol.version: false
        network.transport: false
        error.type: false
        network.type: false
        trace_id: false
        user_agent.original: false
        client.address: false
        client.port: false
        http.route: false
        network.local.address: false
        network.local.port: false
        network.peer.address: false
        network.peer.port: false
        server.address: false
        server.port: false
        url.path: false
        url.query: false
        url.scheme: false
        "x-custom1":
          trace_id: datadog
        "x-custom2":

          response_header: "X-CUSTOM2"

          default: "unknown"
        "x-custom3":
          request_header: "X-CUSTOM3"
        "x-custom5":
          response_context: "X-CUSTOM3"
        "x-custom8":
          env: "ENV_VAR"

      #etc...
    supergraph:
      attributes:
        graphql.document: false
        graphql.operation.name: true
        graphql.operation.type: true

        "x-custom":
          query_variable: "arg1"
          default: "unknown"
          redact: ""
        "x-custom2":
          response_body: "arg2"
        "x-custom4":
          request_context: "X-CUSTOM3"
        "x-custom5":
          response_context: "X-CUSTOM3"
        "x-custom6":
          operation_name: string
        "x-custom7":
          operation_name: hash
        "x-custom8":
          env: "ENV_VAR"
      #etc...
    subgraph:

      attributes:

        graphql.federation.subgraph.name: false
        graphql.operation.name: false
        graphql.operation.type: true
        "x-custom":
          subgraph_operation_name: string
          default: "unknown"
        "x-custom2":
          subgraph_response_body: "arg2"
        "x-custom4":
          request_context: "X-CUSTOM3"
        "x-custom5":
          response_context: "X-CUSTOM3"

Checklist

Complete the checklist (and note appropriate exceptions) before the PR is marked ready-for-review.

  • Changes are compatible1
  • Documentation2 completed
  • Performance impact assessed and acceptable
  • Tests added and passing3
    • Unit Tests
    • Integration Tests
    • Manual Tests

Exceptions

Note any exceptions here

Notes

Footnotes

  1. It may be appropriate to bring upcoming changes to the attention of other (impacted) groups. Please endeavour to do this before seeking PR approval. The mechanism for doing this will vary considerably, so use your judgement as to how and when to do this.

  2. Configuration is an important part of many changes. Where applicable please try to document configuration examples.

  3. Tick whichever testing boxes are applicable. If you are adding Manual Tests, please document the manual testing (extensively) in the Exceptions.

@github-actions
Copy link
Contributor

@BrynCooke, please consider creating a changeset entry in /.changesets/. These instructions describe the process and tooling.

@router-perf
Copy link

router-perf bot commented Oct 19, 2023

CI performance tests

  • events_big_cap_high_rate - Stress test for events with a lot of users, deduplication enabled and high rate event with a big queue capacity
  • events_without_dedup - Stress test for events with a lot of users and deduplication DISABLED
  • events - Stress test for events with a lot of users and deduplication ENABLED
  • large-request - Stress test with a 1 MB request payload
  • step - Basic stress test that steps up the number of users over time
  • xlarge-request - Stress test with 10 MB request payload
  • reload - Reload test over a long period of time at a constant rate of users
  • no-graphos - Basic stress test, no GraphOS.
  • xxlarge-request - Stress test with 100 MB request payload
  • step-jemalloc-tuning - Clone of the basic stress test for jemalloc tuning
  • const - Basic stress test that runs with a constant number of users

@bnjjj
Copy link
Contributor

bnjjj commented Oct 19, 2023

Is it a draft PR ? If not I would suggest to check if we can just hide it from the json schema

@BrynCooke BrynCooke force-pushed the teleemtry-config-future branch from 6e76408 to fba1330 Compare October 19, 2023 16:07
@BrynCooke BrynCooke closed this Oct 19, 2023
@BrynCooke BrynCooke reopened this Oct 19, 2023
@BrynCooke BrynCooke marked this pull request as draft October 19, 2023 17:06
@BrynCooke
Copy link
Contributor Author

@bnjjj Yes, sorry it's still draft

@BrynCooke BrynCooke force-pushed the teleemtry-config-future branch 6 times, most recently from 431e38e to b0426a5 Compare October 20, 2023 11:12
@BrynCooke BrynCooke force-pushed the teleemtry-config-future branch from b0426a5 to 4c55750 Compare October 22, 2023 20:34
@BrynCooke BrynCooke force-pushed the teleemtry-config-future branch from f16df46 to 7b307b0 Compare October 22, 2023 21:29
@bnjjj
Copy link
Contributor

bnjjj commented Oct 23, 2023

Do we agree that for custom attributes, if it's set to false, users will still see them on spans but they will be unset. Because the attributes list must be static, it's a constraint coming from the tracing crate

@BrynCooke
Copy link
Contributor Author

Do we agree that for custom attributes, if it's set to false, users will still see them on spans but they will be unset. Because the attributes list must be static, it's a constraint coming from the tracing crate

We've got a workaround for that.
When the tracing crate is called it uses a fieldset, the issue is that it requires &'static str as keys. The way we're going to tackle this is to introduce some sort of intern functionality This will leak a small amount of memory memory, but will allow us to create dynamic tracing attributes.

bryn added 5 commits October 23, 2023 11:39
Instrument type now aligns with Otel and instrument value is something that can be configured.
Possibly these should be combined though.
@BrynCooke BrynCooke marked this pull request as ready for review October 23, 2023 15:36
Copy link
Contributor

@lennyburdette lennyburdette left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

approving after discussing in person. the changes are worthwhile but will require a lot of documentation and probably a migration guide.

@BrynCooke BrynCooke merged commit d506c4b into dev Oct 25, 2023
2 checks passed
@BrynCooke BrynCooke deleted the teleemtry-config-future branch October 25, 2023 08:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve / formalize configuration of telemetry / logging / observability / metrics / tracing
3 participants