-
Notifications
You must be signed in to change notification settings - Fork 897
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Span.Status and http semantic convention #1818
Comments
I would go as far as saying 4xx on server span should NOT be an error via default instrumentation, only if application code wants to override that. |
I'd like to second this wholeheartedly. Although considering 4xx responses as errors for a client makes sense, I really don't think it does for a server returning a 4xx. Someone making an unauthenticated request and getting a 401 back isn't a failure of the application; similarly, someone requesting a page or endpoint that doesn't exist and getting a 404 back isn't a failure of the application. The original convention seems to have been written with only clients in mind (as evidenced by the "as well as ony other code the client failed to interpret" part). I would lean towards an even stronger proposal:
|
I think this was discussed on the error flagging OTEP by @tedsuo and it was strongly desired that we set the "error flag" on server spans with non-success codes. On the OTEP it was proposed to use the "status source" of "instrumentation" (not user), and backends were encouraged to ignore the instrumentation status if they have their own error detection. However, the status source was never put into the spec because the use case was not deemed strong enough after all. Now it seems that we have a perfect use case. On the OTEP, HTTP 404 was even specifically called out, https://github.com/open-telemetry/oteps/pull/136/files#diff-358e5293cd6793fee989e5c5445cfa58c38ceda4937f2b1d51eaa2971bae321fR11 Note that some servers prefer to respond with "400" instead of "500" for certain error where it's not clear who is at fault. So in summary: I would rather introduce the status source (or error/error_override distinction) than have the instrumentations follow more complicated mappings. This is something the backend should interpret. |
The referenced text reads to me as "instrumentation actually does not know if this status code should be interpreted as Error or not". If we don't know, we should not take a stance. Thus status should remain |
You cannot not not take a stance 😃 I think the information "{status=error, source=instrumentation}" is more helpful than "{status=unset}". I can somehow see the logic of saying that client error codes do not constitute an error for the server side, but on the other hand, why do 5xx still constitute an error on the client side? I think this is inconsistent. |
Because that is literally "server side error". My, server, error. That I have control upon and can fix. There is not a single situation in normal course of business that would return 5xx error. As opposed to e.g. 401 :) |
@Oberon00 the server-side span will have the HTTP status code captured in another attribute, so information is not lost, it's not limited to just I find it useful to think about this in terms of SLOs: a server has a certain budget of errors it is allowed to return, which only makes sense to define in terms of errors that are the server's fault. Because if we include client errors (4xx) in that, then a misbehaving client could make the server look like it broke its SLOs, which is not a viable approach (one can't blame cloud service for breaking error SLO if one sends invalid requests). With the client-side spans, however, the situation is different, the client is trying to perform an operation that either succeeds or fails. If client sends bad request, the operation fails, which should be reflected in the span status. For investigation purposes, the actual http status call shows why the operation failed. |
I understand that for SERVER spans. But using the same logic of "404 is not my fault, it's on the client side", we should say for the CLIENT span: "5xx is not my fault, it's on the server side". Thus for CLIENT, we only set status=error for 4xx but not 5xx using that logic. |
You found a use-case where it's more useful to not track client-side errors on the server side. But that does not mean that it is generally the better solution. For example, if my 400 errors suddenly increase I should maybe roll back the latest application upgrade that changed the JSON input format. If the clients come from a 3rd party that I don't monitor I would be completely blind for these errors if the server-side does not mark them. I think the two consistent ways of handling this are:
Everything in between can make lots of sense in certain situations and should be handled on the backend. The status is just some extremely simplistic fallback thingy for backends without any analysis capabilities. |
What I want to avoid is anything that makes users ask for instrumentation-side configurability of these status mappings. They should be viewed as fallbacks, configurability and complex rules should be on the backend. |
Just 2c about this: this use-case will be better handled by alerts on metrics on http status code. Totally separate with error status code on spans. |
Wouldn't metrics cover the SLO use case just as well? |
Yes, they could. But you will get mixed signals then. One metric, based on http status code, and another metric, based on spans.status, will provide conflicting information. |
I read through this issue and couldn't make up my mind, which makes me want to keep the status quo. |
I think that's by design and not a problem. You should probably exclude spans with an http.status (and other spans where you have a more specific error analyzer) from your status metrics. |
Reposting my comment from #1998 for visibility: @Oberon00 I disagree about the need for configuration. We do need to define a configuration language for OpenTelemetry instrumentation, to handle issues such as these. Configuration for instrumentation should definitely not be implemented in an ad-hoc manner. Instead, all instrumentation we provide should use the same (sane, reasonable) configuration language. The instrumentation SIG is working on this configuration issue, but we would greatly appreciate more participation from core Spec members. |
@tedsuo then I'm also repeating my comment #1998 (comment) 😃 I'm not against configurability in general, just in this specific case. For HTTP, as discussed here, the span status is just a function of the http.status_code attribute + span kind. That function can be trivially evaluated with a different configuration on the backend to override the status if desired. |
HTTP semantic convention says:
This was introduced in the first version of this semantic convention and never discussed or challenged later. But although this requirement probably makes sense for CLIENT http spans, it is hardly sensible for SERVER http spans. SERVER spans are created based on the "external event": somebody made a request to my server for the URL that I cannot serve. I have no control over such requests and thus I have no way to fix this "error".
I propose:
Alternative approach
We can say (after https://github.com/open-telemetry/opentelemetry-collector/issues/3571 is fixed) that application developer or Operator are free to override
Span.Status
fromError
toOk
. But I think this use-case is common enough not to force all affected users to perform the same manual work.The text was updated successfully, but these errors were encountered: