Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

4xx client errors to be left unset #2309

Conversation

jamesmoessis
Copy link
Contributor

@jamesmoessis jamesmoessis commented Feb 1, 2022

Fixes part of OTEP-174

Changes

As part of OTEP-174 it was suggested that span error statuses should be left unset for client 4xx spans. The reason being, a 4xx on the client does not necessarily indicate an error. For example, a 429 might not be considered an error on the client side.

Related oteps #

@jamesmoessis jamesmoessis requested review from a team February 1, 2022 23:59
@jamesmoessis
Copy link
Contributor Author

cc @denisivan0v @tedsuo

@bogdandrutu
Copy link
Member

It seems reasonable to define the same/similar behavior for SpanKind.CLIENT.

This statement does not seem that convincing that this is the right thing to do :)

@jamesmoessis
Copy link
Contributor Author

@bogdandrutu what are your thoughts? Should we change it or keep it as is?

I can think of reasons both to change it or keep it as is - I don't feel strongly either way. I'm making this PR so we can make a decision, as it is a noted decision point in getting the HTTP semconv to stable.

@iNikem
Copy link
Contributor

iNikem commented Feb 2, 2022

HTTP statuses 4xx mean "client errors". They usually mean that client made some mistake: it was not authenticated, it asked for a wrong url etc. IMO client should be made aware of this mistake so that it could fix it.

I vote against this change.

@Oberon00
Copy link
Member

Oberon00 commented Feb 2, 2022

From my memory, the span status was introduced after a long discussion where the agreement was that usually it is context-dependent if something is an error or not. Thus, the accepted OTEP defined the error codes UNSET, ERROR and OK together with a source field that indicates whether the status is set by an instrumentation or the user. Please see https://github.com/open-telemetry/oteps/blob/main/text/trace/0136-error_flagging.md

This distinction was deemed important, because an instrumentation-set code can only be a vague heuristic hint, because it does not know the context in which it is used.

E.g. did the HTTP library request some fixed API URL like GET /users to list users? Then a 404 is probably a serious issue because the URL routing on the server may be broken, a wrong host configured, etc. Did it request GET /users/exampleusername and got 404? In that case, that may not be an error and the user of the HTTP library may just have been checking if the user exists. But on a HTTP level, without knowing additional context, those 404s have to be considered equal, and the HTTP RFCs do define the 4xx as client errors. So it is natural to represent this as status=ERROR, source=INSTRUMENTATION, to indicate that this was technically an error result, but we don't know if this is a problem at all in the application's context.

Of course, the source field then did not make it from the OTEP to the specification, so this got a bit unclear. But given that we already have a definition that considers 4xx an error status on the client side at least, and we did deliberately introduce a client/server distinction when changing the original definition that also considered it an error on the server side, I am against this PR. I think the case for this backwards-incompatible change is not strong enough.

@bogdandrutu
Copy link
Member

Of course, the source field then did not make it from the OTEP to the specification, so this got a bit unclear. But given that we already have a definition that considers 4xx an error status on the client side at least, and we did deliberately introduce a client/server distinction when changing the original definition that also considered it an error on the server side, I am against this PR. I think the case for this backwards-incompatible change is not strong enough.

@Oberon00 I think this is a good example to prove that we should consider adding the source to the specs.

@pyohannes
Copy link
Contributor

pyohannes commented Feb 2, 2022

Did it request GET /users/exampleusername and got 404? In that case, that may not be an error and the user of the HTTP library may just have been checking if the user exists.

I had people reach out to me with exactly this use case, I think this often comes up when instrumenting libraries that use HTTP under the hood.

For me there is a general open question regarding the error status on a span: if a child span has a status of ERROR, and a parent span has a status of Ok or Unset, can one assume that the parent span gracefully handled the error of the child span? If yes, I think we should write that out somewhere and let 4xx be errors.

@denisivan0v
Copy link
Contributor

denisivan0v commented Feb 3, 2022

But on a HTTP level, without knowing additional context, those 404s have to be considered equal, and the HTTP RFCs do define the 4xx as client errors.

This part still seems to be unclear for me and generates even more questions, like

  • should we decide about span status on each level independently and take into account that layer's specifics only? (say, span status must be set to error on HTTP layer since HTTP RFCs do define the 4xx as client errors)?
  • if yes, should 5xx be UNSET for CLIENT spans then (5xx are server errors)?
  • in case a user (or a library they use) has resiliency policies (e.g. retry) applied on HTTP level, how to react on "transient" 4xx errors (like 429)?
  • to @pyohannes's point, if the parent span gracefully handled the error of the child span (so, different layers), how to react on these errors?

I feel it might be beneficial to relax the current requirements (yet the spec is still experimental), so we improve here in the future, for example by introducing some configuration options/language, so a user can adjust the instrumentation behavior according to their case (e.g., [400-403, 405-407, 411-419] -> ERROR, the rest 4xx -> UNSET).

@Oberon00
Copy link
Member

Oberon00 commented Feb 3, 2022

I feel it might be beneficial to relax the current requirements (yet the spec is still experimental), so we improve here in the future, for example by introducing some configuration options/language

Such configuration can get quite complex and would need to be re-implemented in every OTel language (at least, if you do code sharing between different instrumentations of the same language). I feel like this is better handled on the backend or collector.

There you could define a configuration like:

  • If
    • http.route equals /users/{username}
    • AND http.status_code equals 404
  • Then
    • Set/override span.status to OK.

Although I can see an argument for deploying such configuration along with the application.
A case for a generic ErrorDetectionSpanProcessor with a spec-defined cross-language configuration/rule file format that uses the BeforeEnd callback (#1089) to override span.status?

@tedsuo
Copy link
Contributor

tedsuo commented Feb 9, 2022

@Oberon00 100% I agree that configuring error status should happen in a collector (or even farther down the pipe). The question remains, though: what is the best default? In practice, does counting 4xx as errors just create noise that most users will immediately want to turn off?

In other words, do users want to configure 4xx errors by selectively turn them on? Or by selectively turning them off? In my experience, users want these errors suppressed, except in specific circumstances. So the proper default should be that 4xx are not marked as errors.

If others disagree, I am curious: with the observability system you currently use (and/or work on), what is the default? What default do you think operators prefer in practice, and why?

@github-actions
Copy link

This PR was marked stale due to lack of activity. It will be closed in 7 days.

@github-actions github-actions bot added the Stale label Feb 16, 2022
@github-actions
Copy link

Closed as inactive. Feel free to reopen if this PR is still being worked on.

@github-actions github-actions bot closed this Feb 23, 2022
@denisivan0v
Copy link
Contributor

Closed as inactive. Feel free to reopen if this PR is still being worked on.

Up?

@tedsuo tedsuo reopened this Feb 24, 2022
@github-actions github-actions bot removed the Stale label Feb 25, 2022
@jamesmoessis
Copy link
Contributor Author

@carlosalberto seems we have some folks on either side of the argument, and as a result this PR isn't going anywhere. As the assignee do you have any suggestion on how we should proceed?

@carlosalberto
Copy link
Contributor

Hey @jamesmoessis

I think we need more actual eyes on this, so let's present it on the next Spec call (Tuesday). Hopefully you are there, as plan would be not to poke the community, but hopefully discuss the pros and cons ;)

@jamesmoessis
Copy link
Contributor Author

jamesmoessis commented Mar 10, 2022

@carlosalberto I would love to come to a spec meeting but unfortunately they are at 3am my local time, and I'm away on vacation for the next week. I'm happy to discuss the pros and cons here (and in SIGs that I can make).

That being said it would be great if it had more eyes on it. The community needs to make a yes/no decision on this PR in order to make a step towards a stable HTTP semconv.

@carlosalberto
Copy link
Contributor

@jamesmoessis Thanks for the heads up. So I will present that in that Spec call and will report here what's the result of that discussion.

@github-actions
Copy link

This PR was marked stale due to lack of activity. It will be closed in 7 days.

@carlosalberto
Copy link
Contributor

Hey @jamesmoessis

We discussed this in the last Spec SIG call, and we want to stick with what we have right now, based on:

  • We want to avoid changes for existing sections, as it may result in breaking changes for our users.
  • We'd prefer to have users first see these erros AND then decide to turn them off if they want (rather than the other way around).
  • We would like to see this happening in the Collector instead (i.e. as an optional processor to massage these errors) - at least for now.

Let us know what you think.

@jamesmoessis
Copy link
Contributor Author

Thank you a bunch for raising this in the spec SIG @carlosalberto! I think the conclusions reached and reasons for conclusion are completely reasonable.

I'm happy to close this PR now that it's been discussed, and seemingly a decision has been made.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Status: Done
Development

Successfully merging this pull request may close these issues.

8 participants