Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ingesters should not grow without bound #665

Closed
awh opened this issue Jan 18, 2018 · 13 comments · Fixed by #3992
Closed

Ingesters should not grow without bound #665

awh opened this issue Jan 18, 2018 · 13 comments · Fixed by #3992

Comments

@awh
Copy link
Contributor

awh commented Jan 18, 2018

The ingesters would be more manageable if they could refuse service/exert backpressure instead of growing to the point where they are OOM killed when they can't flush or are exposed to a traffic spike. Ideally they would accept a configuration parameter to control this behaviour, which we would use in conjunction with a slightly higher memory request/limit in k8s.

@cboggs
Copy link
Contributor

cboggs commented Feb 8, 2018

Out of curiosity, what circumstances are leading to this behavior (OOM killing ingesters)?

I'd imagine some sort of slow-down on the backing store would be the most likely culprit, but I'd love to hear specific situations if you have any. :-)

@bboreham
Copy link
Contributor

bboreham commented Mar 4, 2018

Some info on store slow-downs at #733, but we also get cases where the incoming data blows up - e.g. someone starts sending loads more timeseries.

I'm not clear that "backpressure" is available - if we slow down, the sending Prometheus will queue the data and eventually OOM itself. We could discard data in the ingester; we could error back to the sender so it will discard it.

@tomwilkie
Copy link
Contributor

if we slow down, the sending Prometheus will queue the data and eventually OOM itself.

Prometheus queue is bounded, so the sending prometheus with start dropping samples.

We could discard data in the ingester; we could error back to the sender so it will discard it.

There is already a limiter on sample/s per user https://github.com/weaveworks/cortex/blob/6191d41c5a99e6f89fbd1877afc8805ab0c3ffbd/pkg/distributor/distributor.go#L298, but its not coordinated between distributors so it "scales" up with the number of replicas. #255 is for that. The downstream prometheus should understand and backoff: https://github.com/prometheus/prometheus/blob/master/storage/remote/queue_manager.go#L495

Re: bounded queue, I want to make the prometheus remote write code tail the TSDB WAL so we can "buffer" against this kind of problem client side much more robustly, and support usecases with spotty internet connections.

@bboreham
Copy link
Contributor

bboreham commented Mar 4, 2018

Samples/sec needn't be a problem, so long as the samples compress well. Number of timeseries is a problem, since they get 1K plus metadata even if you only send one sample. I though there was a limit for that but also I think it isn't entirely effective. Again, need better tools to understand the problem.

@bboreham
Copy link
Contributor

bboreham commented Mar 5, 2018

There are also DefaultMaxSeriesPerMetric = 50000 and DefaultMaxSeriesPerUser = 5000000; the latter is per-ingester. Empirically each timeseries costs about 4K of Go heap, so letting one user have 20GB is brave.

@bboreham
Copy link
Contributor

It occurred to me we could inspect Go's runtime.MemStats.HeapAlloc and refuse to create a new series if that number is over a configurable limit, e.g. 10GB. This would get closer to the desired behaviour and protect against OOM-kills.

@awh
Copy link
Contributor Author

awh commented Oct 30, 2018

Maybe relevant: golang/go#16843

@stale
Copy link

stale bot commented Feb 3, 2020

This issue has been automatically marked as stale because it has not had any activity in the past 30 days. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

@stale stale bot added the stale label Feb 3, 2020
@stale stale bot closed this as completed Feb 18, 2020
@vishnubraj
Copy link

The ingester keeps consuming memory and stops without any error message. There has to be a flag to set the memory limit for the ingester or by default it has to restrict it self to use max 80% of system memory.. Please reopen this ticket

image

@gouthamve gouthamve reopened this Apr 2, 2020
@stale stale bot removed the stale label Apr 2, 2020
@gouthamve gouthamve added the keepalive Skipped by stale bot label Apr 2, 2020
@gouthamve
Copy link
Contributor

The update here after a brief discussion in our bug scrub is use the suggestion to inspect memory via #665 (comment) and then stop creating new series / chunks.

@pstibrany pstibrany mentioned this issue Mar 22, 2021
3 tasks
@pstibrany
Copy link
Contributor

Approach taken in PR #3992 is to have limits for max number of users, series, inflight-requests and ingestion rate per ingester. It's not autoconfiguration based on memory, although external system can monitor memory usage and above mentioned metrics to adjust configuration on the fly (using runtime-config mechanism).

@bboreham
Copy link
Contributor

You mean one could achieve #665 (comment) by having an external agent that clamped down the series limit for some/all users when it detected excess growth?

@pstibrany
Copy link
Contributor

pstibrany commented Mar 23, 2021

You mean one could achieve #665 (comment) by having an external agent that clamped down the series limit for some/all users when it detected excess growth?

Yes. In #3992 there is a setting per ingester. So external agent can monitor number of series and heap, and can adjust this per-ingester setting. Ingester will then use it within couple of seconds, and can start rejecting push requests that try to create additional series. (It would actually affect all ingesters, since they all monitor same overrides file)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants