-
Notifications
You must be signed in to change notification settings - Fork 808
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ingesters should not grow without bound #665
Comments
Out of curiosity, what circumstances are leading to this behavior (OOM killing ingesters)? I'd imagine some sort of slow-down on the backing store would be the most likely culprit, but I'd love to hear specific situations if you have any. :-) |
Some info on store slow-downs at #733, but we also get cases where the incoming data blows up - e.g. someone starts sending loads more timeseries. I'm not clear that "backpressure" is available - if we slow down, the sending Prometheus will queue the data and eventually OOM itself. We could discard data in the ingester; we could error back to the sender so it will discard it. |
Prometheus queue is bounded, so the sending prometheus with start dropping samples.
There is already a limiter on sample/s per user https://github.com/weaveworks/cortex/blob/6191d41c5a99e6f89fbd1877afc8805ab0c3ffbd/pkg/distributor/distributor.go#L298, but its not coordinated between distributors so it "scales" up with the number of replicas. #255 is for that. The downstream prometheus should understand and backoff: https://github.com/prometheus/prometheus/blob/master/storage/remote/queue_manager.go#L495 Re: bounded queue, I want to make the prometheus remote write code tail the TSDB WAL so we can "buffer" against this kind of problem client side much more robustly, and support usecases with spotty internet connections. |
Samples/sec needn't be a problem, so long as the samples compress well. Number of timeseries is a problem, since they get 1K plus metadata even if you only send one sample. I though there was a limit for that but also I think it isn't entirely effective. Again, need better tools to understand the problem. |
There are also |
It occurred to me we could inspect Go's |
Maybe relevant: golang/go#16843 |
This issue has been automatically marked as stale because it has not had any activity in the past 30 days. It will be closed in 7 days if no further activity occurs. Thank you for your contributions. |
The update here after a brief discussion in our bug scrub is use the suggestion to inspect memory via #665 (comment) and then stop creating new series / chunks. |
Approach taken in PR #3992 is to have limits for max number of users, series, inflight-requests and ingestion rate per ingester. It's not autoconfiguration based on memory, although external system can monitor memory usage and above mentioned metrics to adjust configuration on the fly (using runtime-config mechanism). |
You mean one could achieve #665 (comment) by having an external agent that clamped down the series limit for some/all users when it detected excess growth? |
Yes. In #3992 there is a setting per ingester. So external agent can monitor number of series and heap, and can adjust this per-ingester setting. Ingester will then use it within couple of seconds, and can start rejecting push requests that try to create additional series. (It would actually affect all ingesters, since they all monitor same overrides file) |
The ingesters would be more manageable if they could refuse service/exert backpressure instead of growing to the point where they are OOM killed when they can't flush or are exposed to a traffic spike. Ideally they would accept a configuration parameter to control this behaviour, which we would use in conjunction with a slightly higher memory request/limit in k8s.
The text was updated successfully, but these errors were encountered: