Allow to limit the "in-memory" chunks for Ingester #5721

TeddyAndrieux · 2022-03-29T07:34:53Z

Is your feature request related to a problem? Please describe.
If I configure Loki to store on the filesystem and the disk used get full, since it cannot write on the disk Loki will keep receiving data and fill the memory until it gets OOM killed (or the machine just crash).

Describe the solution you'd like
To me, we should be able to configure the Loki ingester "in memory max size" (the maximum size the "in-memory" chunk should take) so that once Loki reaches this "in-memory max size" he just reject new data with the right return code "retry-later".

Describe alternatives you've considered
If Loki is deployed as a Pod in Kubernetes we could just set a memory limit, but it's not a good solution IMHO since it means once Loki reach this limit, the Pod just get OOM killed and we likely lose all "in-memory" data

Additional context

TeddyAndrieux · 2022-03-29T08:34:22Z

Also maybe Loki should expose 2 different endpoints:

/health that basically do what the /ready endpoint does today
/ready that also check that we can actually write chunks (or maybe that we can just write in the WAL)

chaudum · 2022-03-29T11:58:18Z

If I configure Loki to store on the filesystem and the disk used get full, since it cannot write on the disk Loki will keep receiving data and fill the memory until it gets OOM killed (or the machine just crash).

To me, these are two independent areas:

Disk filling up and WAL cannot be written any more. This usually happens first, before chunks are filling up memory.
Chunks filling up memory and causing process to run OOM.

Both should be handled independently.

To me, we should be able to configure the Loki ingester "in memory max size" (the maximum size the "in-memory" chunk should take) so that once Loki reaches this "in-memory max size" he just reject new data with the right return code "retry-later".

Instead of manually configuring the limit, what about having a "dynamic" limit based on the available memory?

dannykopping · 2022-03-29T12:40:09Z

I think the reason why a WAL append fails is somewhat immaterial - in all cases we should fail the write with a 5xx response, which clients like Promtail interpret as "retry later".

The disk filling up though is a case we have specific handling for, and we can expand this.

Instead of manually configuring the limit, what about having a "dynamic" limit based on the available memory?

I'm not sure we need to go this far; this could backfire in some scary ways as well. Let's focus (at least in this issue) on how the ingesters behave when the WAL cannot be appended, and not on in-memory chunk behaviour IMHO; if we define a threshold of disk fullness at which we start to reject writes, that'll save the ingester from having a large increase in memory usage anyway.

dannykopping · 2022-03-29T12:40:42Z

Also maybe Loki should expose 2 different endpoints:

/health that basically do what the /ready endpoint does today

/ready that also check that we can actually write chunks (or maybe that we can just write in the WAL)

Agreed, the health/ready check should take the WAL appendability into account, however having a failing health check can remove the ingester from the ring as well, which may not be desireable.

TeddyAndrieux · 2022-03-29T12:52:26Z

To me, these are two independent areas:

Disk filling up and WAL cannot be written any more. This usually happens first, before chunks are filling up memory.

Chunks filling up memory and causing process to run OOM.

Both should be handled independently.

I totally agree 1. is not the point for this feature request (Ideally could be cool to have a retention size so that old chunk get automatically cleaned when we reach this size) but the point for this feature request is really about 2.

Instead of manually configuring the limit, what about having a "dynamic" limit based on the available memory?

I'm not entirely sure what you mean by "dynamic", you mean like we set a percentage so that Ingester checks all memory and "compute" his limit ? Fine for me as well

if we define a threshold of disk fullness at which we start to reject writes, that'll save the ingester from having a large increase in memory usage anyway.

Indeed, like a retention size

dannykopping · 2022-03-29T13:07:15Z

I totally agree 1. is not the point for this feature request (Ideally could be cool to have a retention size so that old chunk get automatically cleaned when we reach this size) but the point for this feature request is really about 2.

I don't think we should change the way we buffer chunks in memory, but rather address the underlying cause of too many chunks being buffered - which is the WAL append failure behaviour.

TeddyAndrieux · 2022-03-29T13:30:29Z

From what I see it's not the same root cause, if you are able to write the WAL but not able to write data at the end you will still consume all the memory for chunks.
I think in most cases when you use filesystem storage you likely use the same disk for WAL and data so it's related but it may not, in some context.

I don't know if the same may happen with some other config like using Cassandra/s3 or whatever else storage, like Loki unable to write to his storage so it keep all chunks in memory (even if it can write WAL)

dannykopping · 2022-03-29T13:48:53Z

Indeed. The WAL and the chunk storage can be on separate disks, or on the same disk.

I think in most cases when you use filesystem storage you likely use the same disk for WAL and data so it's related but it may not, in some context.

It's definitely not recommended (or really even practical) for them to be on the same disk, but for the sake of argument, let's assume they can be.

If the WAL can't be appended to, we should have an option to fail the writes. Having a memory threshold for the chunks is probably not worthwhile here. WAL append failures are pretty catastrophic, and replication should minimize the impact of a single ingester's WAL disk being full from causing data-loss.

If the chunk storage can't be written to, yeah maybe then we could have a memory threshold for that case.

The problem, of course, is that that threshold could be reached even if there isn't a direct problem like the WAL or chunk disk being full, such as a huge spike in traffic. In the case of a traffic spike, you want to try receive all the data you can, and even if you cause an OOM (there's no way around limited resources) you can still recover from the WAL. However, of course, the WAL replay is expensive, so yeah - maybe a memory threshold is the right approach here for the chunk storage case.

TeddyAndrieux · 2022-03-29T14:07:19Z

Fine, I agree.

To me, we should have both as you said,
1- A check that we can write in the WAL
2- A way to set a memory limit (once Loki reach this limit he just answer 5xx to every new POST (and stop actually reading the WAL))

dannykopping · 2022-03-30T14:43:34Z

I was wrong about the behaviour of the ingester, sorry for the confusion.

The ingester will receive samples into memory and then produce "chunks", which are then flushed to object storage. When this happens (and after chunk_idle_period), the chunk is purged from memory. I believe the only way an ingester will fill up its memory is if the chunks cannot be flushed successfully to the chunk store fast enough. More details here.

Here's what I think we could do moving forward:

Have a background process which checks the "health" of the WAL disk - i.e. is it full (or nearly full)? Before appending the samples into the head chunk, check if the WAL is ready. If not, reject writes until the WAL is ready again

This is a bit complex because to determine the mountpath on which the WAL resides is a bit tricky and not portable across OSes

If the chunks cannot flush to the chunk store, measure the size of the chunk flush queue and if it becomes dangerously long, start rejecting writes

2- A way to set a memory limit (once Loki reach this limit he just answer 5xx to every new POST (and stop actually reading the WAL))

I'm not yet convinced why this would be an idea worth exploring, but I'm open to suggestions. Solving the problem that grows ingester memory in an unacceptable way (chunks cannot flush to storage) seems like the preferable course of action first.

dannykopping · 2022-03-30T15:11:15Z

Also adding this here as it's related: #3136

stale · 2022-05-01T03:08:17Z

Hi! This issue has been automatically marked as stale because it has not had any
activity in the past 30 days.

We use a stalebot among other tools to help manage the state of issues in this project.
A stalebot can be very useful in closing issues in a number of cases; the most common
is closing issues or PRs where the original reporter has not responded.

Stalebots are also emotionless and cruel and can close issues which are still very relevant.

If this issue is important to you, please add a comment to keep it open. More importantly, please add a thumbs-up to the original issue entry.

We regularly sort for closed issues which have a stale label sorted by thumbs up.

We may also:

Mark issues as revivable if we think it's a valid issue but isn't something we are likely
to prioritize in the future (the issue will still remain closed).
Add a keepalive label to silence the stalebot if the issue is very common/popular/important.

We are doing our best to respond, organize, and prioritize all issues but it can be a challenging task,
our sincere apologies if you find yourself at the mercy of the stalebot.

frittentheke · 2022-05-16T14:11:17Z

I just noticed the /run/loki folder filling most of the memory due to S3 being unreachable.
An enforced limit potentially related to the available space should absolutely be added to Loki to not push the machine into OOM.
Also just returning a "503 Service Unavailable" until the issue is resolved should be enough to have the clients buffer and try again.

dannykopping added type/feature Something new we should do component/ingester labels Mar 29, 2022

afayngelerindbx mentioned this issue Apr 13, 2022

Allow Loki ingester to drop older chunks in favor of accepting new writes #5911

Open

stale bot added the stale A stale issue or PR that will automatically be closed. label May 1, 2022

stale bot removed the stale A stale issue or PR that will automatically be closed. label May 16, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allow to limit the "in-memory" chunks for Ingester #5721

Allow to limit the "in-memory" chunks for Ingester #5721

TeddyAndrieux commented Mar 29, 2022

TeddyAndrieux commented Mar 29, 2022

chaudum commented Mar 29, 2022

dannykopping commented Mar 29, 2022

dannykopping commented Mar 29, 2022 •

edited

Loading

TeddyAndrieux commented Mar 29, 2022

dannykopping commented Mar 29, 2022 •

edited

Loading

TeddyAndrieux commented Mar 29, 2022

dannykopping commented Mar 29, 2022

TeddyAndrieux commented Mar 29, 2022

dannykopping commented Mar 30, 2022

dannykopping commented Mar 30, 2022 •

edited

Loading

stale bot commented May 1, 2022

frittentheke commented May 16, 2022

Allow to limit the "in-memory" chunks for Ingester #5721

Allow to limit the "in-memory" chunks for Ingester #5721

Comments

TeddyAndrieux commented Mar 29, 2022

TeddyAndrieux commented Mar 29, 2022

chaudum commented Mar 29, 2022

dannykopping commented Mar 29, 2022

dannykopping commented Mar 29, 2022 • edited Loading

TeddyAndrieux commented Mar 29, 2022

dannykopping commented Mar 29, 2022 • edited Loading

TeddyAndrieux commented Mar 29, 2022

dannykopping commented Mar 29, 2022

TeddyAndrieux commented Mar 29, 2022

dannykopping commented Mar 30, 2022

dannykopping commented Mar 30, 2022 • edited Loading

stale bot commented May 1, 2022

frittentheke commented May 16, 2022

dannykopping commented Mar 29, 2022 •

edited

Loading

dannykopping commented Mar 29, 2022 •

edited

Loading

dannykopping commented Mar 30, 2022 •

edited

Loading