4xx client errors to be left unset #2309

jamesmoessis · 2022-02-01T23:59:54Z

Fixes part of OTEP-174

Changes

As part of OTEP-174 it was suggested that span error statuses should be left unset for client 4xx spans. The reason being, a 4xx on the client does not necessarily indicate an error. For example, a 429 might not be considered an error on the client side.

Related oteps #

OTEP-174

jamesmoessis · 2022-02-02T00:00:32Z

cc @denisivan0v @tedsuo

bogdandrutu · 2022-02-02T02:01:23Z

It seems reasonable to define the same/similar behavior for SpanKind.CLIENT.

This statement does not seem that convincing that this is the right thing to do :)

jamesmoessis · 2022-02-02T05:22:03Z

@bogdandrutu what are your thoughts? Should we change it or keep it as is?

I can think of reasons both to change it or keep it as is - I don't feel strongly either way. I'm making this PR so we can make a decision, as it is a noted decision point in getting the HTTP semconv to stable.

iNikem · 2022-02-02T11:26:32Z

HTTP statuses 4xx mean "client errors". They usually mean that client made some mistake: it was not authenticated, it asked for a wrong url etc. IMO client should be made aware of this mistake so that it could fix it.

I vote against this change.

Oberon00 · 2022-02-02T12:31:04Z

From my memory, the span status was introduced after a long discussion where the agreement was that usually it is context-dependent if something is an error or not. Thus, the accepted OTEP defined the error codes UNSET, ERROR and OK together with a source field that indicates whether the status is set by an instrumentation or the user. Please see https://github.com/open-telemetry/oteps/blob/main/text/trace/0136-error_flagging.md

This distinction was deemed important, because an instrumentation-set code can only be a vague heuristic hint, because it does not know the context in which it is used.

E.g. did the HTTP library request some fixed API URL like GET /users to list users? Then a 404 is probably a serious issue because the URL routing on the server may be broken, a wrong host configured, etc. Did it request GET /users/exampleusername and got 404? In that case, that may not be an error and the user of the HTTP library may just have been checking if the user exists. But on a HTTP level, without knowing additional context, those 404s have to be considered equal, and the HTTP RFCs do define the 4xx as client errors. So it is natural to represent this as status=ERROR, source=INSTRUMENTATION, to indicate that this was technically an error result, but we don't know if this is a problem at all in the application's context.

Of course, the source field then did not make it from the OTEP to the specification, so this got a bit unclear. But given that we already have a definition that considers 4xx an error status on the client side at least, and we did deliberately introduce a client/server distinction when changing the original definition that also considered it an error on the server side, I am against this PR. I think the case for this backwards-incompatible change is not strong enough.

bogdandrutu · 2022-02-02T16:51:47Z

Of course, the source field then did not make it from the OTEP to the specification, so this got a bit unclear. But given that we already have a definition that considers 4xx an error status on the client side at least, and we did deliberately introduce a client/server distinction when changing the original definition that also considered it an error on the server side, I am against this PR. I think the case for this backwards-incompatible change is not strong enough.

@Oberon00 I think this is a good example to prove that we should consider adding the source to the specs.

pyohannes · 2022-02-02T20:01:09Z

Did it request GET /users/exampleusername and got 404? In that case, that may not be an error and the user of the HTTP library may just have been checking if the user exists.

I had people reach out to me with exactly this use case, I think this often comes up when instrumenting libraries that use HTTP under the hood.

For me there is a general open question regarding the error status on a span: if a child span has a status of ERROR, and a parent span has a status of Ok or Unset, can one assume that the parent span gracefully handled the error of the child span? If yes, I think we should write that out somewhere and let 4xx be errors.

denisivan0v · 2022-02-03T03:34:08Z

But on a HTTP level, without knowing additional context, those 404s have to be considered equal, and the HTTP RFCs do define the 4xx as client errors.

This part still seems to be unclear for me and generates even more questions, like

should we decide about span status on each level independently and take into account that layer's specifics only? (say, span status must be set to error on HTTP layer since HTTP RFCs do define the 4xx as client errors)?
if yes, should 5xx be UNSET for CLIENT spans then (5xx are server errors)?
in case a user (or a library they use) has resiliency policies (e.g. retry) applied on HTTP level, how to react on "transient" 4xx errors (like 429)?
to @pyohannes's point, if the parent span gracefully handled the error of the child span (so, different layers), how to react on these errors?

I feel it might be beneficial to relax the current requirements (yet the spec is still experimental), so we improve here in the future, for example by introducing some configuration options/language, so a user can adjust the instrumentation behavior according to their case (e.g., [400-403, 405-407, 411-419] -> ERROR, the rest 4xx -> UNSET).

Oberon00 · 2022-02-03T10:29:55Z

I feel it might be beneficial to relax the current requirements (yet the spec is still experimental), so we improve here in the future, for example by introducing some configuration options/language

Such configuration can get quite complex and would need to be re-implemented in every OTel language (at least, if you do code sharing between different instrumentations of the same language). I feel like this is better handled on the backend or collector.

There you could define a configuration like:

If
- http.route equals /users/{username}
- AND http.status_code equals 404
Then
- Set/override span.status to OK.

Although I can see an argument for deploying such configuration along with the application.
A case for a generic ErrorDetectionSpanProcessor with a spec-defined cross-language configuration/rule file format that uses the BeforeEnd callback (#1089) to override span.status?

tedsuo · 2022-02-09T01:06:44Z

@Oberon00 100% I agree that configuring error status should happen in a collector (or even farther down the pipe). The question remains, though: what is the best default? In practice, does counting 4xx as errors just create noise that most users will immediately want to turn off?

In other words, do users want to configure 4xx errors by selectively turn them on? Or by selectively turning them off? In my experience, users want these errors suppressed, except in specific circumstances. So the proper default should be that 4xx are not marked as errors.

If others disagree, I am curious: with the observability system you currently use (and/or work on), what is the default? What default do you think operators prefer in practice, and why?

github-actions · 2022-02-16T03:17:12Z

This PR was marked stale due to lack of activity. It will be closed in 7 days.

github-actions · 2022-02-23T03:17:16Z

Closed as inactive. Feel free to reopen if this PR is still being worked on.

denisivan0v · 2022-02-24T02:12:47Z

Closed as inactive. Feel free to reopen if this PR is still being worked on.

Up?

jamesmoessis · 2022-03-03T06:17:18Z

@carlosalberto seems we have some folks on either side of the argument, and as a result this PR isn't going anywhere. As the assignee do you have any suggestion on how we should proceed?

carlosalberto · 2022-03-10T13:50:29Z

Hey @jamesmoessis

I think we need more actual eyes on this, so let's present it on the next Spec call (Tuesday). Hopefully you are there, as plan would be not to poke the community, but hopefully discuss the pros and cons ;)

jamesmoessis · 2022-03-10T18:38:48Z

@carlosalberto I would love to come to a spec meeting but unfortunately they are at 3am my local time, and I'm away on vacation for the next week. I'm happy to discuss the pros and cons here (and in SIGs that I can make).

That being said it would be great if it had more eyes on it. The community needs to make a yes/no decision on this PR in order to make a step towards a stable HTTP semconv.

carlosalberto · 2022-03-10T19:35:52Z

@jamesmoessis Thanks for the heads up. So I will present that in that Spec call and will report here what's the result of that discussion.

github-actions · 2022-03-18T03:17:42Z

This PR was marked stale due to lack of activity. It will be closed in 7 days.

carlosalberto · 2022-03-18T14:41:06Z

Hey @jamesmoessis

We discussed this in the last Spec SIG call, and we want to stick with what we have right now, based on:

We want to avoid changes for existing sections, as it may result in breaking changes for our users.
We'd prefer to have users first see these erros AND then decide to turn them off if they want (rather than the other way around).
We would like to see this happening in the Collector instead (i.e. as an optional processor to massage these errors) - at least for now.

Let us know what you think.

jamesmoessis · 2022-03-21T03:58:45Z

Thank you a bunch for raising this in the spec SIG @carlosalberto! I think the conclusions reached and reasons for conclusion are completely reasonable.

I'm happy to close this PR now that it's been discussed, and seemingly a decision has been made.

4xx client errors to be left unset

1c7142e

jamesmoessis requested review from a team February 1, 2022 23:59

github-actions bot assigned carlosalberto Feb 2, 2022

github-actions bot added the Stale label Feb 16, 2022

github-actions bot closed this Feb 23, 2022

tedsuo reopened this Feb 24, 2022

github-actions bot removed the Stale label Feb 25, 2022

github-actions bot added the Stale label Mar 18, 2022

carlosalberto removed the Stale label Mar 18, 2022

jamesmoessis closed this Mar 21, 2022

Aneurysm9 mentioned this pull request Apr 1, 2022

how about add an option for otelmongo: do not treat mongo.ErrNoDocuments as error open-telemetry/opentelemetry-go-contrib#2139

Closed

denisivan0v mentioned this pull request Apr 19, 2022

Agree on the scope for HTTP semantic conventions v1.0 #2499

Closed

4 tasks

tsloughter mentioned this pull request Mar 14, 2023

[Tesla middleware] non_error_statuses option open-telemetry/opentelemetry-erlang-contrib#154

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

4xx client errors to be left unset #2309

4xx client errors to be left unset #2309

jamesmoessis commented Feb 1, 2022 •

edited

Loading

jamesmoessis commented Feb 2, 2022

bogdandrutu commented Feb 2, 2022

jamesmoessis commented Feb 2, 2022

iNikem commented Feb 2, 2022

Oberon00 commented Feb 2, 2022

bogdandrutu commented Feb 2, 2022

pyohannes commented Feb 2, 2022 •

edited

Loading

denisivan0v commented Feb 3, 2022 •

edited

Loading

Oberon00 commented Feb 3, 2022

tedsuo commented Feb 9, 2022

github-actions bot commented Feb 16, 2022

github-actions bot commented Feb 23, 2022

denisivan0v commented Feb 24, 2022

jamesmoessis commented Mar 3, 2022

carlosalberto commented Mar 10, 2022

jamesmoessis commented Mar 10, 2022 •

edited

Loading

carlosalberto commented Mar 10, 2022

github-actions bot commented Mar 18, 2022

carlosalberto commented Mar 18, 2022

jamesmoessis commented Mar 21, 2022

4xx client errors to be left unset #2309

4xx client errors to be left unset #2309

Conversation

jamesmoessis commented Feb 1, 2022 • edited Loading

Changes

jamesmoessis commented Feb 2, 2022

bogdandrutu commented Feb 2, 2022

jamesmoessis commented Feb 2, 2022

iNikem commented Feb 2, 2022

Oberon00 commented Feb 2, 2022

bogdandrutu commented Feb 2, 2022

pyohannes commented Feb 2, 2022 • edited Loading

denisivan0v commented Feb 3, 2022 • edited Loading

Oberon00 commented Feb 3, 2022

tedsuo commented Feb 9, 2022

github-actions bot commented Feb 16, 2022

github-actions bot commented Feb 23, 2022

denisivan0v commented Feb 24, 2022

jamesmoessis commented Mar 3, 2022

carlosalberto commented Mar 10, 2022

jamesmoessis commented Mar 10, 2022 • edited Loading

carlosalberto commented Mar 10, 2022

github-actions bot commented Mar 18, 2022

carlosalberto commented Mar 18, 2022

jamesmoessis commented Mar 21, 2022

jamesmoessis commented Feb 1, 2022 •

edited

Loading

pyohannes commented Feb 2, 2022 •

edited

Loading

denisivan0v commented Feb 3, 2022 •

edited

Loading

jamesmoessis commented Mar 10, 2022 •

edited

Loading