Span.Status and http semantic convention #1818

iNikem · 2021-07-16T15:33:07Z

For HTTP status codes in the 4xx and 5xx ranges, as well as any other code the client failed to interpret, status MUST be set to Error.

This was introduced in the first version of this semantic convention and never discussed or challenged later. But although this requirement probably makes sense for CLIENT http spans, it is hardly sensible for SERVER http spans. SERVER spans are created based on the "external event": somebody made a request to my server for the URL that I cannot serve. I have no control over such requests and thus I have no way to fix this "error".

I propose:

to relax semantic convention and say that "For HTTP status codes in 4xx range, status MAY be set to Error"
to allow auto-instrumentation to expose some kind of configuration how to interpret http codes in 4xx range

Alternative approach

We can say (after https://github.com/open-telemetry/opentelemetry-collector/issues/3571 is fixed) that application developer or Operator are free to override Span.Status from Error to Ok. But I think this use-case is common enough not to force all affected users to perform the same manual work.

The text was updated successfully, but these errors were encountered:

yurishkuro · 2021-07-16T15:59:51Z

I would go as far as saying 4xx on server span should NOT be an error via default instrumentation, only if application code wants to override that.

mpetazzoni · 2021-07-16T15:59:53Z

But although this requirement probably makes sense for CLIENT http spans, it is hardly sensible for SERVER http spans.

I'd like to second this wholeheartedly. Although considering 4xx responses as errors for a client makes sense, I really don't think it does for a server returning a 4xx. Someone making an unauthenticated request and getting a 401 back isn't a failure of the application; similarly, someone requesting a page or endpoint that doesn't exist and getting a 404 back isn't a failure of the application. The original convention seems to have been written with only clients in mind (as evidenced by the "as well as ony other code the client failed to interpret" part).

I would lean towards an even stronger proposal:

For HTTP status codes in the 4xx range, Span.Status MAY be set to Error if the SpanKind is CLIENT. If the SpanKind is SERVER, Span.Status MUST NOT be set to Error.

Oberon00 · 2021-07-16T16:09:39Z

I think this was discussed on the error flagging OTEP by @tedsuo and it was strongly desired that we set the "error flag" on server spans with non-success codes. On the OTEP it was proposed to use the "status source" of "instrumentation" (not user), and backends were encouraged to ignore the instrumentation status if they have their own error detection.

However, the status source was never put into the spec because the use case was not deemed strong enough after all. Now it seems that we have a perfect use case. On the OTEP, HTTP 404 was even specifically called out, https://github.com/open-telemetry/oteps/pull/136/files#diff-358e5293cd6793fee989e5c5445cfa58c38ceda4937f2b1d51eaa2971bae321fR11

Note that some servers prefer to respond with "400" instead of "500" for certain error where it's not clear who is at fault.

So in summary: I would rather introduce the status source (or error/error_override distinction) than have the instrumentations follow more complicated mappings. This is something the backend should interpret.

iNikem · 2021-07-16T17:36:21Z

On the OTEP, HTTP 404 was even specifically called out

The referenced text reads to me as "instrumentation actually does not know if this status code should be interpreted as Error or not". If we don't know, we should not take a stance. Thus status should remain Unset.

Oberon00 · 2021-07-16T17:39:52Z

You cannot not not take a stance 😃 I think the information "{status=error, source=instrumentation}" is more helpful than "{status=unset}".

I can somehow see the logic of saying that client error codes do not constitute an error for the server side, but on the other hand, why do 5xx still constitute an error on the client side? I think this is inconsistent.

iNikem · 2021-07-16T17:42:50Z

why do 5xx still constitute an error on the client side? I think this is inconsistent.

Because that is literally "server side error". My, server, error. That I have control upon and can fix. There is not a single situation in normal course of business that would return 5xx error. As opposed to e.g. 401 :)

yurishkuro · 2021-07-16T17:52:23Z

@Oberon00 the server-side span will have the HTTP status code captured in another attribute, so information is not lost, it's not limited to just status=unset.

I find it useful to think about this in terms of SLOs: a server has a certain budget of errors it is allowed to return, which only makes sense to define in terms of errors that are the server's fault. Because if we include client errors (4xx) in that, then a misbehaving client could make the server look like it broke its SLOs, which is not a viable approach (one can't blame cloud service for breaking error SLO if one sends invalid requests).

With the client-side spans, however, the situation is different, the client is trying to perform an operation that either succeeds or fails. If client sends bad request, the operation fails, which should be reflected in the span status. For investigation purposes, the actual http status call shows why the operation failed.

Oberon00 · 2021-07-16T18:02:02Z

@iNikem

why do 5xx still constitute an error on the client side? I think this is inconsistent.

Because that is literally "server side error". My, server, error. That I have control upon and can fix. There is not a single situation in normal course of business that would return 5xx error. As opposed to e.g. 401 :)

I understand that for SERVER spans. But using the same logic of "404 is not my fault, it's on the client side", we should say for the CLIENT span: "5xx is not my fault, it's on the server side". Thus for CLIENT, we only set status=error for 4xx but not 5xx using that logic.

Oberon00 · 2021-07-16T18:08:44Z

@yurishkuro

I find it useful to think about this in terms of SLOs: a server has a certain budget of errors it is allowed to return, which only makes sense to define in terms of errors that are the server's fault. Because if we include client errors (4xx) in that, then a misbehaving client could make the server look like it broke its SLOs, which is not a viable approach (one can't blame cloud service for breaking error SLO if one sends invalid requests).

You found a use-case where it's more useful to not track client-side errors on the server side. But that does not mean that it is generally the better solution.

For example, if my 400 errors suddenly increase I should maybe roll back the latest application upgrade that changed the JSON input format. If the clients come from a 3rd party that I don't monitor I would be completely blind for these errors if the server-side does not mark them.

I think the two consistent ways of handling this are:

Every HTTP request that was technically successful (i.e. we receive any HTTP status code, there was no network error) should be status=unset. We do not set errors based on HTTP codes at all.
Every HTTP error status is considered an error on both server and client side (i.e., the status quo).

Everything in between can make lots of sense in certain situations and should be handled on the backend. The status is just some extremely simplistic fallback thingy for backends without any analysis capabilities.

Oberon00 · 2021-07-16T18:10:12Z

What I want to avoid is anything that makes users ask for instrumentation-side configurability of these status mappings. They should be viewed as fallbacks, configurability and complex rules should be on the backend.

iNikem · 2021-07-16T20:12:25Z

For example, if my 400 errors suddenly increase I should maybe roll back the latest application upgrade that changed the JSON input format. If the clients come from a 3rd party that I don't monitor I would be completely blind for these errors if the server-side does not mark them.

Just 2c about this: this use-case will be better handled by alerts on metrics on http status code. Totally separate with error status code on spans.

Oberon00 · 2021-07-17T15:20:04Z

Wouldn't metrics cover the SLO use case just as well?

iNikem · 2021-07-18T10:52:10Z

Wouldn't metrics cover the SLO use case just as well?

Yes, they could. But you will get mixed signals then. One metric, based on http status code, and another metric, based on spans.status, will provide conflicting information.

jmacd · 2021-07-26T22:02:41Z

I read through this issue and couldn't make up my mind, which makes me want to keep the status quo.

Oberon00 · 2021-07-27T08:44:44Z

One metric, based on http status code, and another metric, based on spans.status, will provide conflicting information.

I think that's by design and not a problem. You should probably exclude spans with an http.status (and other spans where you have a more specific error analyzer) from your status metrics.

tedsuo · 2021-10-11T18:03:40Z

Reposting my comment from #1998 for visibility:

@Oberon00 I disagree about the need for configuration. We do need to define a configuration language for OpenTelemetry instrumentation, to handle issues such as these. Configuration for instrumentation should definitely not be implemented in an ad-hoc manner. Instead, all instrumentation we provide should use the same (sane, reasonable) configuration language.
This configuration language should be designed thoughtfully, as part of the spec.

The instrumentation SIG is working on this configuration issue, but we would greatly appreciate more participation from core Spec members.

Oberon00 · 2021-10-11T18:08:13Z

@tedsuo then I'm also repeating my comment #1998 (comment) 😃

I'm not against configurability in general, just in this specific case. For HTTP, as discussed here, the span status is just a function of the http.status_code attribute + span kind. That function can be trivially evaluated with a different configuration on the backend to override the status if desired.

iNikem added the spec:trace Related to the specification/trace directory label Jul 16, 2021

github-actions bot assigned yurishkuro Jul 16, 2021

Oberon00 added area:error-reporting Related to error reporting area:semantic-conventions Related to semantic conventions labels Jul 16, 2021

iNikem mentioned this issue Oct 6, 2021

Don't set Span.Status for 4xx http status codes for SERVER spans #1998

Merged

carlosalberto closed this as completed in #1998 Oct 12, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Span.Status and http semantic convention #1818

Span.Status and http semantic convention #1818

iNikem commented Jul 16, 2021

yurishkuro commented Jul 16, 2021

mpetazzoni commented Jul 16, 2021

Oberon00 commented Jul 16, 2021 •

edited

Loading

iNikem commented Jul 16, 2021

Oberon00 commented Jul 16, 2021 •

edited

Loading

iNikem commented Jul 16, 2021

yurishkuro commented Jul 16, 2021

Oberon00 commented Jul 16, 2021

Oberon00 commented Jul 16, 2021 •

edited

Loading

Oberon00 commented Jul 16, 2021

iNikem commented Jul 16, 2021

Oberon00 commented Jul 17, 2021

iNikem commented Jul 18, 2021

jmacd commented Jul 26, 2021

Oberon00 commented Jul 27, 2021

tedsuo commented Oct 11, 2021

Oberon00 commented Oct 11, 2021 •

edited

Loading

Span.Status and http semantic convention #1818

Span.Status and http semantic convention #1818

Comments

iNikem commented Jul 16, 2021

Alternative approach

yurishkuro commented Jul 16, 2021

mpetazzoni commented Jul 16, 2021

Oberon00 commented Jul 16, 2021 • edited Loading

iNikem commented Jul 16, 2021

Oberon00 commented Jul 16, 2021 • edited Loading

iNikem commented Jul 16, 2021

yurishkuro commented Jul 16, 2021

Oberon00 commented Jul 16, 2021

Oberon00 commented Jul 16, 2021 • edited Loading

Oberon00 commented Jul 16, 2021

iNikem commented Jul 16, 2021

Oberon00 commented Jul 17, 2021

iNikem commented Jul 18, 2021

jmacd commented Jul 26, 2021

Oberon00 commented Jul 27, 2021

tedsuo commented Oct 11, 2021

Oberon00 commented Oct 11, 2021 • edited Loading

Oberon00 commented Jul 16, 2021 •

edited

Loading

Oberon00 commented Jul 16, 2021 •

edited

Loading

Oberon00 commented Jul 16, 2021 •

edited

Loading

Oberon00 commented Oct 11, 2021 •

edited

Loading