Ingesters should not grow without bound #665

awh · 2018-01-18T16:37:02Z

The ingesters would be more manageable if they could refuse service/exert backpressure instead of growing to the point where they are OOM killed when they can't flush or are exposed to a traffic spike. Ideally they would accept a configuration parameter to control this behaviour, which we would use in conjunction with a slightly higher memory request/limit in k8s.

cboggs · 2018-02-08T15:29:43Z

Out of curiosity, what circumstances are leading to this behavior (OOM killing ingesters)?

I'd imagine some sort of slow-down on the backing store would be the most likely culprit, but I'd love to hear specific situations if you have any. :-)

bboreham · 2018-03-04T17:28:28Z

Some info on store slow-downs at #733, but we also get cases where the incoming data blows up - e.g. someone starts sending loads more timeseries.

I'm not clear that "backpressure" is available - if we slow down, the sending Prometheus will queue the data and eventually OOM itself. We could discard data in the ingester; we could error back to the sender so it will discard it.

tomwilkie · 2018-03-04T17:53:22Z

if we slow down, the sending Prometheus will queue the data and eventually OOM itself.

Prometheus queue is bounded, so the sending prometheus with start dropping samples.

We could discard data in the ingester; we could error back to the sender so it will discard it.

There is already a limiter on sample/s per user https://github.com/weaveworks/cortex/blob/6191d41c5a99e6f89fbd1877afc8805ab0c3ffbd/pkg/distributor/distributor.go#L298, but its not coordinated between distributors so it "scales" up with the number of replicas. #255 is for that. The downstream prometheus should understand and backoff: https://github.com/prometheus/prometheus/blob/master/storage/remote/queue_manager.go#L495

Re: bounded queue, I want to make the prometheus remote write code tail the TSDB WAL so we can "buffer" against this kind of problem client side much more robustly, and support usecases with spotty internet connections.

bboreham · 2018-03-04T18:29:51Z

Samples/sec needn't be a problem, so long as the samples compress well. Number of timeseries is a problem, since they get 1K plus metadata even if you only send one sample. I though there was a limit for that but also I think it isn't entirely effective. Again, need better tools to understand the problem.

bboreham · 2018-03-05T11:51:10Z

There are also DefaultMaxSeriesPerMetric = 50000 and DefaultMaxSeriesPerUser = 5000000; the latter is per-ingester. Empirically each timeseries costs about 4K of Go heap, so letting one user have 20GB is brave.

bboreham · 2018-06-13T12:40:29Z

It occurred to me we could inspect Go's runtime.MemStats.HeapAlloc and refuse to create a new series if that number is over a configurable limit, e.g. 10GB. This would get closer to the desired behaviour and protect against OOM-kills.

awh · 2018-10-30T15:59:46Z

Maybe relevant: golang/go#16843

stale · 2020-02-03T10:56:42Z

This issue has been automatically marked as stale because it has not had any activity in the past 30 days. It will be closed in 7 days if no further activity occurs. Thank you for your contributions.

vishnubraj · 2020-04-02T09:45:52Z

The ingester keeps consuming memory and stops without any error message. There has to be a flag to set the memory limit for the ingester or by default it has to restrict it self to use max 80% of system memory.. Please reopen this ticket

gouthamve · 2020-05-07T14:45:53Z

The update here after a brief discussion in our bug scrub is use the suggestion to inspect memory via #665 (comment) and then stop creating new series / chunks.

pstibrany · 2021-03-23T07:58:22Z

Approach taken in PR #3992 is to have limits for max number of users, series, inflight-requests and ingestion rate per ingester. It's not autoconfiguration based on memory, although external system can monitor memory usage and above mentioned metrics to adjust configuration on the fly (using runtime-config mechanism).

bboreham · 2021-03-23T11:14:37Z

You mean one could achieve #665 (comment) by having an external agent that clamped down the series limit for some/all users when it detected excess growth?

pstibrany · 2021-03-23T11:31:41Z

You mean one could achieve #665 (comment) by having an external agent that clamped down the series limit for some/all users when it detected excess growth?

Yes. In #3992 there is a setting per ingester. So external agent can monitor number of series and heap, and can adjust this per-ingester setting. Ingester will then use it within couple of seconds, and can start rejecting push requests that try to create additional series. (It would actually affect all ingesters, since they all monitor same overrides file)

awh added the type/feature label Jan 18, 2018

bboreham mentioned this issue Mar 7, 2018

Auto-scale DB write capacity based on ingester queue size #735

Closed

bboreham mentioned this issue Mar 26, 2018

RFC: Tools to block and/or drop data which is bad #771

Closed

gouthamve added the component/ingester label Nov 11, 2019

bboreham mentioned this issue Dec 3, 2019

Memory leak in ingester after it has failed to write indixes on memcache #1852

Closed

stale bot added the stale label Feb 3, 2020

stale bot closed this as completed Feb 18, 2020

gouthamve reopened this Apr 2, 2020

stale bot removed the stale label Apr 2, 2020

gouthamve added the keepalive Skipped by stale bot label Apr 2, 2020

pstibrany mentioned this issue Mar 22, 2021

Ingester bounds #3992

Merged

3 tasks

pstibrany closed this as completed in #3992 Apr 9, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ingesters should not grow without bound #665

Ingesters should not grow without bound #665

awh commented Jan 18, 2018 •

edited

Loading

cboggs commented Feb 8, 2018

bboreham commented Mar 4, 2018

tomwilkie commented Mar 4, 2018

bboreham commented Mar 4, 2018

bboreham commented Mar 5, 2018

bboreham commented Jun 13, 2018

awh commented Oct 30, 2018

stale bot commented Feb 3, 2020

vishnubraj commented Apr 2, 2020

gouthamve commented May 7, 2020

pstibrany commented Mar 23, 2021

bboreham commented Mar 23, 2021

pstibrany commented Mar 23, 2021 •

edited

Loading

Ingesters should not grow without bound #665

Ingesters should not grow without bound #665

Comments

awh commented Jan 18, 2018 • edited Loading

cboggs commented Feb 8, 2018

bboreham commented Mar 4, 2018

tomwilkie commented Mar 4, 2018

bboreham commented Mar 4, 2018

bboreham commented Mar 5, 2018

bboreham commented Jun 13, 2018

awh commented Oct 30, 2018

stale bot commented Feb 3, 2020

vishnubraj commented Apr 2, 2020

gouthamve commented May 7, 2020

pstibrany commented Mar 23, 2021

bboreham commented Mar 23, 2021

pstibrany commented Mar 23, 2021 • edited Loading

awh commented Jan 18, 2018 •

edited

Loading

pstibrany commented Mar 23, 2021 •

edited

Loading