Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve / formalize configuration of telemetry / logging / observability / metrics / tracing #3226

Open
23 of 28 tasks
BrynCooke opened this issue Jun 6, 2023 · 34 comments · Fixed by #4061, #4123 or #4133
Open
23 of 28 tasks
Assignees

Comments

@BrynCooke
Copy link
Contributor

BrynCooke commented Jun 6, 2023

Currently we have an experimental_logging section that we should bring out of experimental and in addition support json formatting options. There have also been several other users asking for much more control over what is output in both spans and logs, so it's maybe time to take a more holistic look.

The things that users have asked us for are:

  • Secrets redaction
  • The ability to exclude spans from logging output
  • The ability to disable attributes on spans
  • The ability to only have attributes on spans if headers are present on the root request.
  • Optional log events for when there are http or graphql errors
  • Optionally set the supergraph span to error if there were graphql errors in the response
  • supergraph span is wrapped in an operation span that follows semantic conventions. Existing attributes on supergraph that are from semantic conventions are moved onto the new span.
    * Log levels are made easier to use. (see below yaml)

Config

Suggest the following format:

telemetry:

  tracing:
    common: # Renamed from trace_config
      max_attributes_per_event: 128
      max_attributes_per_span: 128
      max_attributes_per_link: 128
      max_events_per_span: 128
      max_links_per_span: 128
      parent_based_sampler: true
      sampler: always_on
      service_name: router
      service_namespace: "default"
      resource:
        d: e

      # Resources are otel config not represented in the yaml config


    propagation:
      baggage: false
      jaeger: false
      datadog: false
      request:
        header_name: "X-REQUEST-ID"
      trace_context: false
      zipkin: false

    otlp:
      enabled: true
      endpoint: "http://localhost:4317/v1/traces"




  metrics:
    common:
      service_namespace: "default"
      service_name: router
      buckets:
        - 0.1

      resource:
        test: foo
    prometheus:
      enabled: true
      path: /metrics
    otlp:
      enabled: true


  instruments:
    default_attribute_requirement_level: required
    router:
      http.server.active_requests: true
      my_instrument:
        value: unit
        type: counter
        unit: kb
        description: "my description"
        event: on_error
        attributes:

          http.response.status_code: false
          "my_attribute":
            response_header: "X-MY-HEADER"
            default: "unknown"
            redact: "foo"

    supergraph:
      my_instrument:
        value: unit
        event: on_error
        type: counter
        unit: kb
        description: "my description"
    subgraph:
      my_instrument:
        value: unit
        event: on_error
        type: counter
        unit: kb
        description: "my description"


  events:
    router:
      request: true
      response: false
      error: false
      test:
        message: "foo"
        level: info
        attributes:
          http.response.body.size: false


  spans:
    default_attribute_requirement_level: required
    legacy_request_span: true
    # The request span will be disappearing
    # router is the new root span
    router:
      attributes:
        dd.trace_id: false
        http.request.body.size: false
        http.response.body.size: false
        http.request.method: false
        http.request.method.original: false
        http.response.status_code: false
        network.protocol.name: false
        network.protocol.version: false
        network.transport: false
        error.type: false
        network.type: false
        trace_id: false
        user_agent.original: false
        client.address: false
        client.port: false
        http.route: false
        network.local.address: false
        network.local.port: false
        network.peer.address: false
        network.peer.port: false
        server.address: false
        server.port: false
        url.path: false
        url.query: false
        url.scheme: false
        "x-custom1":
          trace_id: datadog
        "x-custom2":

          response_header: "X-CUSTOM2"

          default: "unknown"
        "x-custom3":
          request_header: "X-CUSTOM3"
        "x-custom5":
          response_context: "X-CUSTOM3"
        "x-custom8":
          env: "ENV_VAR"

      #etc...
    supergraph:
      attributes:
        graphql.document: false
        graphql.operation.name: true
        graphql.operation.type: true

        "x-custom":
          query_variable: "arg1"
          default: "unknown"
          redact: ""
        "x-custom2":
          response_body: "arg2"
        "x-custom4":
          request_context: "X-CUSTOM3"
        "x-custom5":
          response_context: "X-CUSTOM3"
        "x-custom6":
          operation_name: string
        "x-custom7":
          operation_name: hash
        "x-custom8":
          env: "ENV_VAR"
      #etc...
    subgraph:

      attributes:

        graphql.federation.subgraph.name: false
        graphql.operation.name: false
        graphql.operation.type: true
        "x-custom":
          subgraph_operation_name: string
          default: "unknown"
        "x-custom2":
          subgraph_response_body: "arg2"
        "x-custom4":
          request_context: "X-CUSTOM3"
        "x-custom5":
          response_context: "X-CUSTOM3"

Related is: #1840 which also has some good ideas that have been folded into the above.

Path forward

This is a fairly large set of changes and will need to be split into several tickets. Let's get agreements that this is the way forward and try and schedule it in as it is a blocker for some users to adoption and could have some major performance improvement ramifications due to reduced logging requriements.

Related issues

Implementation plan

There are a large number of items that need to be tackled. Once the new config structure has been implemented then we can start farming the tasks out to separate people.

Clean up existing config

New config structure

  • Add new Config with fields skipped for schema generation and serialization.
  • Create documentation for new tracing functionality.

Spans

Tracing

  • Add stdout exporter

Testing

Logging

  • Implement stdout logger
    • Implement OTEL format
    • Implement JSON format
    • Implement text format

Instruments

Events

@bnjjj
Copy link
Contributor

bnjjj commented Jun 7, 2023

I suggest to use format in that way:

format:
    json:
      location: true | false
      filename: true | false
      line_number: true | false
      spans: true | false
    # text:
    #   location: true | false
    #   filename: true | false
    #   line_number: true | false
    #   spans: true | false

it's also related to the discussion #1961

@abernix
Copy link
Member

abernix commented Jun 12, 2023

Should we close #1840 since its ideas have been folded into the above?

@piclemx
Copy link

piclemx commented Jun 21, 2023

This request include another way to send logs to?

@BrynCooke
Copy link
Contributor Author

Not currently. But let's add it as this could be implemented using https://github.com/tokio-rs/tracing/tree/master/tracing-appender

@kindermax
Copy link
Contributor

Hi, I can be wrong, but does the new logging configuration support injecting custom fields into JSON?
It would be very convenient to add some custom field to all logs such as service: my-apollo-router ?

@BrynCooke
Copy link
Contributor Author

Let's add this, I''ll have a think about how to update the config.

@Geal
Copy link
Contributor

Geal commented Jun 23, 2023

pointing out now that any initiative around logging configuration has a huge risk of destroying performance, so this should be built carefully

@hrkfdn
Copy link

hrkfdn commented Jun 26, 2023

Would also be cool if field names could be specified, i.e. to fit into a certain log structure, which would be necessary to make logs searchable/indexable. The example below would "rename" message to msg:

format:
    json:
        - type: spans
          name: spans
        - type: message
          name: msg

Wording of type and name is debatable of course. Also name could be optional if it's not deviating from type.

@BrynCooke BrynCooke changed the title Formalize configuration of logging Formalize configuration of telemetry Jul 10, 2023
@Bjohnson131
Copy link

Hello, my issue #3502 was linked here. I think that if I were to propose a location for this issue, maybe

telemetry -> redaction -> subgraph_errors (boolean)

would be the proper place? That would require you to change redaction from a list to an object.

@abernix abernix changed the title Formalize configuration of telemetry Improve / formalize configuration of telemetry / logging / observability / metrics / tracing Aug 21, 2023
@Geal
Copy link
Contributor

Geal commented Aug 29, 2023

@BrynCooke
Copy link
Contributor Author

I've spent some time thinking about this and updated the example.
I have not put in anything around customization of json format. Can people who have indicated that they want this give some example of why this is needed and what formats they are targeting. My fear is that even if we do something it won't be flexible enough to actually be useful. If there are specific well known formats that you'd like to see then that would be different.

@hrkfdn
Copy link

hrkfdn commented Aug 29, 2023

Our usecase is to follow company-wide log formats so that log lines are indexed properly. Being able to select and naming fields would be nice (as outlined in #3226 (comment)), as well as populating specific fields with static fields. Though I can see how the latter may be a little too specific.

@yanns
Copy link
Contributor

yanns commented Aug 29, 2023

Our usecase is to follow company-wide log formats so that log lines are indexed properly. Being able to select and naming fields would be nice (as outlined in #3226 (comment)), as well as populating specific fields with static fields. Though I can see how the latter may be a little too specific.

same usecases here (naming fields + adding static fields)

@BrynCooke
Copy link
Contributor Author

Can you post some examples of the formats that you are seeking to work with?

@BrynCooke BrynCooke reopened this Oct 25, 2023
@BrynCooke BrynCooke linked a pull request Oct 31, 2023 that will close this issue
6 tasks
BrynCooke pushed a commit that referenced this issue Oct 31, 2023
`flatten` doesn't do what we want.
Implement custom deserializer and add tests.

Part of #3226
BrynCooke pushed a commit that referenced this issue Oct 31, 2023
`flatten` doesn't do what we want.
Implement custom deserializer and add tests.

Part of #3226
BrynCooke added a commit that referenced this issue Oct 31, 2023
Serde `flatten` doesn't do what we want it to do.
Custom `Deserializer` that will to the right thing, first trying a
custom attribute and then a standard attribute.

Part of #3226 

<!-- start metadata -->
---

**Checklist**

Complete the checklist (and note appropriate exceptions) before the PR
is marked ready-for-review.

- [ ] Changes are compatible[^1]
- [ ] Documentation[^2] completed
- [ ] Performance impact assessed and acceptable
- Tests added and passing[^3]
    - [ ] Unit Tests
    - [ ] Integration Tests
    - [ ] Manual Tests

**Exceptions**

*Note any exceptions here*

**Notes**

[^1]: It may be appropriate to bring upcoming changes to the attention
of other (impacted) groups. Please endeavour to do this before seeking
PR approval. The mechanism for doing this will vary considerably, so use
your judgement as to how and when to do this.
[^2]: Configuration is an important part of many changes. Where
applicable please try to document configuration examples.
[^3]: Tick whichever testing boxes are applicable. If you are adding
Manual Tests, please document the manual testing (extensively) in the
Exceptions.

Co-authored-by: bryn <bryn@apollographql.com>
@bnjjj bnjjj reopened this Oct 31, 2023
BrynCooke pushed a commit that referenced this issue Nov 2, 2023
Many of the features are not implemented yet and will be enabled over time.
Part of #3226
BrynCooke pushed a commit that referenced this issue Nov 2, 2023
Many of the features are not implemented yet and will be enabled over time.
Part of #3226
@BrynCooke BrynCooke linked a pull request Nov 3, 2023 that will close this issue
6 tasks
BrynCooke added a commit that referenced this issue Nov 3, 2023
Note that this code is not currently used, but is needed for the other
telemetry PR changes.

Various issues with telemetry next config were discovered during
documentation. The code changes are addressed in this PR with a separate
docs PR in the works.


Part of #3226


<!-- start metadata -->
---

**Checklist**

Complete the checklist (and note appropriate exceptions) before the PR
is marked ready-for-review.

- [ ] Changes are compatible[^1]
- [ ] Documentation[^2] completed
- [ ] Performance impact assessed and acceptable
- Tests added and passing[^3]
    - [ ] Unit Tests
    - [ ] Integration Tests
    - [ ] Manual Tests

**Exceptions**

*Note any exceptions here*

**Notes**

[^1]: It may be appropriate to bring upcoming changes to the attention
of other (impacted) groups. Please endeavour to do this before seeking
PR approval. The mechanism for doing this will vary considerably, so use
your judgement as to how and when to do this.
[^2]: Configuration is an important part of many changes. Where
applicable please try to document configuration examples.
[^3]: Tick whichever testing boxes are applicable. If you are adding
Manual Tests, please document the manual testing (extensively) in the
Exceptions.

Co-authored-by: bryn <bryn@apollographql.com>
@BrynCooke BrynCooke reopened this Nov 3, 2023
@BrynCooke
Copy link
Contributor Author

BrynCooke commented Nov 30, 2023

Just a quick update for people following this ticket.

We've spent considerable effort to make telemetry spans customizable via yaml and to follow the Otel semantic conventions. This will land in the next release.

Users with a commercial license will have fine grained control over what span attributes are present on spans and are able to attach arbitrary attributes. Free users are able to use the less granular requirement level that align with the opentelemetry semantic conventions.

Still TODO are instrument and event customization via yaml. We are hoping to get to these early next year, as they will allow users to setup conditional logging and metrics without having to reach for Rhai or a custom plugin.

Custom logging formats are also still on the table as well as exporting logs via opentelemetry bridge for collection via OTLP.

@BrynCooke
Copy link
Contributor Author

Another update on this ticket.
Custom instruments and events are now possible in yaml, and users seem to be tacking advantage of these new features!

One thing that is still unimplemented is redaction. So if you need this then get in touch.

@oskargotte
Copy link

oskargotte commented Jun 28, 2024

One thing that is still unimplemented is redaction. So if you need this then get in touch.

@BrynCooke What about #3502 ?
Are there any options to get the redacted subgraph errors included in the router logs but not in the GraphQL response? OR are there any plans to support it?

@BrynCooke
Copy link
Contributor Author

@oskargotte We haven't has many folks asking for redaction. Some of them do redaction centrally via their logging software. We'll still need more people to ask for this feature.

@Bjohnson131
Copy link

FWIW, logging redaction isn't really a standard feature for many programs that exist today, so it's understandable that this won't be supported.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment