Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Loki should not send "request was cancelled by the client" as an http response #4991

Closed
bboreham opened this issue Dec 22, 2021 · 6 comments · Fixed by #5297
Closed

Loki should not send "request was cancelled by the client" as an http response #4991

bboreham opened this issue Dec 22, 2021 · 6 comments · Fixed by #5297
Assignees
Labels
type/bug Somehing is not working as expected

Comments

@bboreham
Copy link
Contributor

It is impossible for this to work if true - the way you cancel an http request is to close the socket, so if that has happened then the server cannot write the response.

But we do see such responses:
image

Such reports can come from here:

JSONError(w, StatusClientClosedRequest, ErrClientCanceled)

One improvement is to go to all the places where it is called, like


and check if r.Context() has errored - if so then we should bail out and not write anything.
If not, we know the client did not cancel; some internal condition caused this, so we can return a 500 or similar.

It appears this code was added in #4593 in response to #4502.

@cyriltovena
Copy link
Contributor

cyriltovena commented Jan 19, 2022

Thank @bboreham for the report. 499 is mostly for us to have a status code for SLO reason.

I think the problem is this

return s.Code() == codes.Canceled || s.Message() == "transport is closing"

When we rollout this gets turned into 499, the frontend thinks it's a cancel, and propagate it back up (and actually cancel instead of retrying.).

It should 500 and the frontend will retry.

@vlad-diachenko
Copy link
Contributor

The root cause of the issue is that the connection to the ingester is canceled here https://github.com/grafana/dskit/blob/main/ring/client/pool.go#L130 because the health check of GRPC connection is failed and when another goroutine that serves the request from the user tries to close the connection, it leads to the error.

@bboreham
Copy link
Contributor Author

Sorry, that may be the root cause of why you see a cancellation, but the error I described is in the logic of how errors are reported, not how they are caused.

@vlad-diachenko
Copy link
Contributor

you are completely right, and I will fix the logic of how the error is reported, but I had to understand why the issue happens... And now I see that we can also fix the root cause of the issue by adding retry mechanism to health-check or something like this...

@bboreham
Copy link
Contributor Author

Good find! Suggest filing that as another issue (in DSKit?).

@vlad-diachenko
Copy link
Contributor

Good find! Suggest filing that as another issue (in DSKit?

yes, I will report an issue in dskit. thanks.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type/bug Somehing is not working as expected
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants