Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

include_messages does not cover GraphQL errors from the subgraph #3851

Open
Geal opened this issue Sep 19, 2023 · 5 comments
Open

include_messages does not cover GraphQL errors from the subgraph #3851

Geal opened this issue Sep 19, 2023 · 5 comments

Comments

@Geal
Copy link
Contributor

Geal commented Sep 19, 2023

When using metrics configuration like this:

telemetry:
  metrics:
    common:
          attributes:
            subgraph:
              all:
                errors:
                  include_messages: true
                  extensions:
                    - name: error_code
                      path: .code

The include_messages option is only used when the subgraph service returned an error (more specifically a FetchError):

Err(err) => {
metric_attrs.push(KeyValue::new("status", "500"));
// Fill attributes from error
if let Some(subgraph_attributes_conf) = &*attribute_forward_config {
metric_attrs.extend(
subgraph_attributes_conf
.get_attributes_from_error(err)
.into_iter()
.map(|(k, v)| KeyValue::new(k, v)),
);
}

It should be able to report subgraph errors including the messages even if the subgraph returned a response with status 200

@Geal
Copy link
Contributor Author

Geal commented Oct 4, 2023

to reproduce: use a graph on Studio, and a subgraph that returns an error with a 200 status code. The errors page for the operation will only show <masked> and will not report the message
Screenshot from 2023-10-04 18-05-08

@BrynCooke
Copy link
Contributor

Suggest we defer this as we are touching much of this with the telemetry config work.

@lleadbet
Copy link
Contributor

lleadbet commented Oct 9, 2023

To clarify on @Geal's note on Studio above- that was unrelated to the above filed issue as AS4 now defaults to masking errors.

The messages attribute on Prometheus metrics, however, is affected by the issue reported.

@lleadbet
Copy link
Contributor

Adding more color here.

On the documentation site, it's listed that you can enable error messages via the telemetry.metrics.common.supergraph/subgraph.errors.include_messages configuration option.

Locally testing on v1.33.2, I can't seem to get this working.

To Reproduce

Given this configuration:

telemetry:
  apollo:
    field_level_instrumentation_sampler: 0.3
    errors:
      subgraph:
        all:
          redact: false
          send: true
  metrics:
    prometheus:
      enabled: true
      path: /metrics
      listen: 0.0.0.0:9999
    common:
      attributes:
        supergraph:
          response:
            body:
              - path: .errors[0].message
                name: error_message
          context:
            - named: has_errors
          errors:
            include_messages: true
        subgraph:
          all:
            response:
              body:
                - path: .errors[0].message
                  name: error_message
            errors:
              include_messages: true
              extensions:
                - name: reason
                  path: .reason

You will find that you do see an error_message attribute, but not the documented message attribute (or reason). Example entries from Prometheus:

Subgraph:

apollo_router_http_request_duration_seconds_bucket{error_message="test",status="200",subgraph="users",otel_scope_name="apollo/router",le="0.001"} 0
apollo_router_http_request_duration_seconds_bucket{error_message="test",status="200",subgraph="users",otel_scope_name="apollo/router",le="0.005"} 0

Supergraph:

apollo_router_http_request_duration_seconds_bucket{error_message="test",has_errors="true",operation_name="User",status="200",otel_scope_name="apollo/router",le="0.001"} 0
apollo_router_http_request_duration_seconds_bucket{error_message="test",has_errors="true",operation_name="User",status="200",otel_scope_name="apollo/router",le="0.005"} 0

The subgraph is returning a valid GQL error with a message + extension that includes a reason.

@o0Ignition0o
Copy link
Contributor

Not sure if it still falls under #3226 anymore.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants