Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Span.Status and http semantic convention #1818

Closed
iNikem opened this issue Jul 16, 2021 · 17 comments · Fixed by #1998
Closed

Span.Status and http semantic convention #1818

iNikem opened this issue Jul 16, 2021 · 17 comments · Fixed by #1998
Assignees
Labels
area:error-reporting Related to error reporting area:semantic-conventions Related to semantic conventions spec:trace Related to the specification/trace directory

Comments

@iNikem
Copy link
Contributor

iNikem commented Jul 16, 2021

HTTP semantic convention says:

For HTTP status codes in the 4xx and 5xx ranges, as well as any other code the client failed to interpret, status MUST be set to Error.

This was introduced in the first version of this semantic convention and never discussed or challenged later. But although this requirement probably makes sense for CLIENT http spans, it is hardly sensible for SERVER http spans. SERVER spans are created based on the "external event": somebody made a request to my server for the URL that I cannot serve. I have no control over such requests and thus I have no way to fix this "error".

I propose:

  • to relax semantic convention and say that "For HTTP status codes in 4xx range, status MAY be set to Error"
  • to allow auto-instrumentation to expose some kind of configuration how to interpret http codes in 4xx range

Alternative approach

We can say (after https://github.com/open-telemetry/opentelemetry-collector/issues/3571 is fixed) that application developer or Operator are free to override Span.Status from Error to Ok. But I think this use-case is common enough not to force all affected users to perform the same manual work.

@iNikem iNikem added the spec:trace Related to the specification/trace directory label Jul 16, 2021
@yurishkuro
Copy link
Member

I would go as far as saying 4xx on server span should NOT be an error via default instrumentation, only if application code wants to override that.

@mpetazzoni
Copy link

But although this requirement probably makes sense for CLIENT http spans, it is hardly sensible for SERVER http spans.

I'd like to second this wholeheartedly. Although considering 4xx responses as errors for a client makes sense, I really don't think it does for a server returning a 4xx. Someone making an unauthenticated request and getting a 401 back isn't a failure of the application; similarly, someone requesting a page or endpoint that doesn't exist and getting a 404 back isn't a failure of the application. The original convention seems to have been written with only clients in mind (as evidenced by the "as well as ony other code the client failed to interpret" part).

I would lean towards an even stronger proposal:

For HTTP status codes in the 4xx range, Span.Status MAY be set to Error if the SpanKind is CLIENT. If the SpanKind is SERVER, Span.Status MUST NOT be set to Error.

@Oberon00
Copy link
Member

Oberon00 commented Jul 16, 2021

I think this was discussed on the error flagging OTEP by @tedsuo and it was strongly desired that we set the "error flag" on server spans with non-success codes. On the OTEP it was proposed to use the "status source" of "instrumentation" (not user), and backends were encouraged to ignore the instrumentation status if they have their own error detection.

However, the status source was never put into the spec because the use case was not deemed strong enough after all. Now it seems that we have a perfect use case. On the OTEP, HTTP 404 was even specifically called out, https://github.com/open-telemetry/oteps/pull/136/files#diff-358e5293cd6793fee989e5c5445cfa58c38ceda4937f2b1d51eaa2971bae321fR11

Note that some servers prefer to respond with "400" instead of "500" for certain error where it's not clear who is at fault.

So in summary: I would rather introduce the status source (or error/error_override distinction) than have the instrumentations follow more complicated mappings. This is something the backend should interpret.

@Oberon00 Oberon00 added area:error-reporting Related to error reporting area:semantic-conventions Related to semantic conventions labels Jul 16, 2021
@iNikem
Copy link
Contributor Author

iNikem commented Jul 16, 2021

On the OTEP, HTTP 404 was even specifically called out

The referenced text reads to me as "instrumentation actually does not know if this status code should be interpreted as Error or not". If we don't know, we should not take a stance. Thus status should remain Unset.

@Oberon00
Copy link
Member

Oberon00 commented Jul 16, 2021

You cannot not not take a stance 😃 I think the information "{status=error, source=instrumentation}" is more helpful than "{status=unset}".

I can somehow see the logic of saying that client error codes do not constitute an error for the server side, but on the other hand, why do 5xx still constitute an error on the client side? I think this is inconsistent.

@iNikem
Copy link
Contributor Author

iNikem commented Jul 16, 2021

why do 5xx still constitute an error on the client side? I think this is inconsistent.

Because that is literally "server side error". My, server, error. That I have control upon and can fix. There is not a single situation in normal course of business that would return 5xx error. As opposed to e.g. 401 :)

@yurishkuro
Copy link
Member

@Oberon00 the server-side span will have the HTTP status code captured in another attribute, so information is not lost, it's not limited to just status=unset.

I find it useful to think about this in terms of SLOs: a server has a certain budget of errors it is allowed to return, which only makes sense to define in terms of errors that are the server's fault. Because if we include client errors (4xx) in that, then a misbehaving client could make the server look like it broke its SLOs, which is not a viable approach (one can't blame cloud service for breaking error SLO if one sends invalid requests).

With the client-side spans, however, the situation is different, the client is trying to perform an operation that either succeeds or fails. If client sends bad request, the operation fails, which should be reflected in the span status. For investigation purposes, the actual http status call shows why the operation failed.

@Oberon00
Copy link
Member

@iNikem

why do 5xx still constitute an error on the client side? I think this is inconsistent.

Because that is literally "server side error". My, server, error. That I have control upon and can fix. There is not a single situation in normal course of business that would return 5xx error. As opposed to e.g. 401 :)

I understand that for SERVER spans. But using the same logic of "404 is not my fault, it's on the client side", we should say for the CLIENT span: "5xx is not my fault, it's on the server side". Thus for CLIENT, we only set status=error for 4xx but not 5xx using that logic.

@Oberon00
Copy link
Member

Oberon00 commented Jul 16, 2021

@yurishkuro

I find it useful to think about this in terms of SLOs: a server has a certain budget of errors it is allowed to return, which only makes sense to define in terms of errors that are the server's fault. Because if we include client errors (4xx) in that, then a misbehaving client could make the server look like it broke its SLOs, which is not a viable approach (one can't blame cloud service for breaking error SLO if one sends invalid requests).

You found a use-case where it's more useful to not track client-side errors on the server side. But that does not mean that it is generally the better solution.

For example, if my 400 errors suddenly increase I should maybe roll back the latest application upgrade that changed the JSON input format. If the clients come from a 3rd party that I don't monitor I would be completely blind for these errors if the server-side does not mark them.

I think the two consistent ways of handling this are:

  • Every HTTP request that was technically successful (i.e. we receive any HTTP status code, there was no network error) should be status=unset. We do not set errors based on HTTP codes at all.
  • Every HTTP error status is considered an error on both server and client side (i.e., the status quo).

Everything in between can make lots of sense in certain situations and should be handled on the backend. The status is just some extremely simplistic fallback thingy for backends without any analysis capabilities.

@Oberon00
Copy link
Member

What I want to avoid is anything that makes users ask for instrumentation-side configurability of these status mappings. They should be viewed as fallbacks, configurability and complex rules should be on the backend.

@iNikem
Copy link
Contributor Author

iNikem commented Jul 16, 2021

For example, if my 400 errors suddenly increase I should maybe roll back the latest application upgrade that changed the JSON input format. If the clients come from a 3rd party that I don't monitor I would be completely blind for these errors if the server-side does not mark them.

Just 2c about this: this use-case will be better handled by alerts on metrics on http status code. Totally separate with error status code on spans.

@Oberon00
Copy link
Member

Wouldn't metrics cover the SLO use case just as well?

@iNikem
Copy link
Contributor Author

iNikem commented Jul 18, 2021

Wouldn't metrics cover the SLO use case just as well?

Yes, they could. But you will get mixed signals then. One metric, based on http status code, and another metric, based on spans.status, will provide conflicting information.

@jmacd
Copy link
Contributor

jmacd commented Jul 26, 2021

I read through this issue and couldn't make up my mind, which makes me want to keep the status quo.

@Oberon00
Copy link
Member

One metric, based on http status code, and another metric, based on spans.status, will provide conflicting information.

I think that's by design and not a problem. You should probably exclude spans with an http.status (and other spans where you have a more specific error analyzer) from your status metrics.

@tedsuo
Copy link
Contributor

tedsuo commented Oct 11, 2021

Reposting my comment from #1998 for visibility:

@Oberon00 I disagree about the need for configuration. We do need to define a configuration language for OpenTelemetry instrumentation, to handle issues such as these. Configuration for instrumentation should definitely not be implemented in an ad-hoc manner. Instead, all instrumentation we provide should use the same (sane, reasonable) configuration language.
This configuration language should be designed thoughtfully, as part of the spec.

The instrumentation SIG is working on this configuration issue, but we would greatly appreciate more participation from core Spec members.

@Oberon00
Copy link
Member

Oberon00 commented Oct 11, 2021

@tedsuo then I'm also repeating my comment #1998 (comment) 😃

I'm not against configurability in general, just in this specific case. For HTTP, as discussed here, the span status is just a function of the http.status_code attribute + span kind. That function can be trivially evaluated with a different configuration on the backend to override the status if desired.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:error-reporting Related to error reporting area:semantic-conventions Related to semantic conventions spec:trace Related to the specification/trace directory
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants