Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow to limit the "in-memory" chunks for Ingester #5721

Open
TeddyAndrieux opened this issue Mar 29, 2022 · 13 comments
Open

Allow to limit the "in-memory" chunks for Ingester #5721

TeddyAndrieux opened this issue Mar 29, 2022 · 13 comments
Labels
component/ingester type/feature Something new we should do

Comments

@TeddyAndrieux
Copy link
Contributor

Is your feature request related to a problem? Please describe.
If I configure Loki to store on the filesystem and the disk used get full, since it cannot write on the disk Loki will keep receiving data and fill the memory until it gets OOM killed (or the machine just crash).

Describe the solution you'd like
To me, we should be able to configure the Loki ingester "in memory max size" (the maximum size the "in-memory" chunk should take) so that once Loki reaches this "in-memory max size" he just reject new data with the right return code "retry-later".

Describe alternatives you've considered
If Loki is deployed as a Pod in Kubernetes we could just set a memory limit, but it's not a good solution IMHO since it means once Loki reach this limit, the Pod just get OOM killed and we likely lose all "in-memory" data

Additional context

@dannykopping dannykopping added type/feature Something new we should do component/ingester labels Mar 29, 2022
@TeddyAndrieux
Copy link
Contributor Author

Also maybe Loki should expose 2 different endpoints:

  • /health that basically do what the /ready endpoint does today
  • /ready that also check that we can actually write chunks (or maybe that we can just write in the WAL)

@chaudum
Copy link
Contributor

chaudum commented Mar 29, 2022

If I configure Loki to store on the filesystem and the disk used get full, since it cannot write on the disk Loki will keep receiving data and fill the memory until it gets OOM killed (or the machine just crash).

To me, these are two independent areas:

  1. Disk filling up and WAL cannot be written any more. This usually happens first, before chunks are filling up memory.
  2. Chunks filling up memory and causing process to run OOM.

Both should be handled independently.

To me, we should be able to configure the Loki ingester "in memory max size" (the maximum size the "in-memory" chunk should take) so that once Loki reaches this "in-memory max size" he just reject new data with the right return code "retry-later".

Instead of manually configuring the limit, what about having a "dynamic" limit based on the available memory?

@dannykopping
Copy link
Contributor

I think the reason why a WAL append fails is somewhat immaterial - in all cases we should fail the write with a 5xx response, which clients like Promtail interpret as "retry later".

The disk filling up though is a case we have specific handling for, and we can expand this.

Instead of manually configuring the limit, what about having a "dynamic" limit based on the available memory?

I'm not sure we need to go this far; this could backfire in some scary ways as well. Let's focus (at least in this issue) on how the ingesters behave when the WAL cannot be appended, and not on in-memory chunk behaviour IMHO; if we define a threshold of disk fullness at which we start to reject writes, that'll save the ingester from having a large increase in memory usage anyway.

@dannykopping
Copy link
Contributor

dannykopping commented Mar 29, 2022

Also maybe Loki should expose 2 different endpoints:

  • /health that basically do what the /ready endpoint does today
  • /ready that also check that we can actually write chunks (or maybe that we can just write in the WAL)

Agreed, the health/ready check should take the WAL appendability into account, however having a failing health check can remove the ingester from the ring as well, which may not be desireable.

@TeddyAndrieux
Copy link
Contributor Author

To me, these are two independent areas:

  1. Disk filling up and WAL cannot be written any more. This usually happens first, before chunks are filling up memory.
  2. Chunks filling up memory and causing process to run OOM.

Both should be handled independently.

I totally agree 1. is not the point for this feature request (Ideally could be cool to have a retention size so that old chunk get automatically cleaned when we reach this size) but the point for this feature request is really about 2.

Instead of manually configuring the limit, what about having a "dynamic" limit based on the available memory?

I'm not entirely sure what you mean by "dynamic", you mean like we set a percentage so that Ingester checks all memory and "compute" his limit ? Fine for me as well

if we define a threshold of disk fullness at which we start to reject writes, that'll save the ingester from having a large increase in memory usage anyway.

Indeed, like a retention size

@dannykopping
Copy link
Contributor

dannykopping commented Mar 29, 2022

I totally agree 1. is not the point for this feature request (Ideally could be cool to have a retention size so that old chunk get automatically cleaned when we reach this size) but the point for this feature request is really about 2.

I don't think we should change the way we buffer chunks in memory, but rather address the underlying cause of too many chunks being buffered - which is the WAL append failure behaviour.

@TeddyAndrieux
Copy link
Contributor Author

From what I see it's not the same root cause, if you are able to write the WAL but not able to write data at the end you will still consume all the memory for chunks.
I think in most cases when you use filesystem storage you likely use the same disk for WAL and data so it's related but it may not, in some context.

I don't know if the same may happen with some other config like using Cassandra/s3 or whatever else storage, like Loki unable to write to his storage so it keep all chunks in memory (even if it can write WAL)

@dannykopping
Copy link
Contributor

Indeed. The WAL and the chunk storage can be on separate disks, or on the same disk.

I think in most cases when you use filesystem storage you likely use the same disk for WAL and data so it's related but it may not, in some context.

It's definitely not recommended (or really even practical) for them to be on the same disk, but for the sake of argument, let's assume they can be.

If the WAL can't be appended to, we should have an option to fail the writes. Having a memory threshold for the chunks is probably not worthwhile here. WAL append failures are pretty catastrophic, and replication should minimize the impact of a single ingester's WAL disk being full from causing data-loss.

If the chunk storage can't be written to, yeah maybe then we could have a memory threshold for that case.

The problem, of course, is that that threshold could be reached even if there isn't a direct problem like the WAL or chunk disk being full, such as a huge spike in traffic. In the case of a traffic spike, you want to try receive all the data you can, and even if you cause an OOM (there's no way around limited resources) you can still recover from the WAL. However, of course, the WAL replay is expensive, so yeah - maybe a memory threshold is the right approach here for the chunk storage case.

@TeddyAndrieux
Copy link
Contributor Author

Fine, I agree.

To me, we should have both as you said,
1- A check that we can write in the WAL
2- A way to set a memory limit (once Loki reach this limit he just answer 5xx to every new POST (and stop actually reading the WAL))

@dannykopping
Copy link
Contributor

I was wrong about the behaviour of the ingester, sorry for the confusion.

The ingester will receive samples into memory and then produce "chunks", which are then flushed to object storage. When this happens (and after chunk_idle_period), the chunk is purged from memory. I believe the only way an ingester will fill up its memory is if the chunks cannot be flushed successfully to the chunk store fast enough. More details here.

Here's what I think we could do moving forward:

  1. Have a background process which checks the "health" of the WAL disk - i.e. is it full (or nearly full)? Before appending the samples into the head chunk, check if the WAL is ready. If not, reject writes until the WAL is ready again
  • This is a bit complex because to determine the mountpath on which the WAL resides is a bit tricky and not portable across OSes
  1. If the chunks cannot flush to the chunk store, measure the size of the chunk flush queue and if it becomes dangerously long, start rejecting writes

2- A way to set a memory limit (once Loki reach this limit he just answer 5xx to every new POST (and stop actually reading the WAL))

I'm not yet convinced why this would be an idea worth exploring, but I'm open to suggestions. Solving the problem that grows ingester memory in an unacceptable way (chunks cannot flush to storage) seems like the preferable course of action first.

@dannykopping
Copy link
Contributor

dannykopping commented Mar 30, 2022

Also adding this here as it's related: #3136

@stale
Copy link

stale bot commented May 1, 2022

Hi! This issue has been automatically marked as stale because it has not had any
activity in the past 30 days.

We use a stalebot among other tools to help manage the state of issues in this project.
A stalebot can be very useful in closing issues in a number of cases; the most common
is closing issues or PRs where the original reporter has not responded.

Stalebots are also emotionless and cruel and can close issues which are still very relevant.

If this issue is important to you, please add a comment to keep it open. More importantly, please add a thumbs-up to the original issue entry.

We regularly sort for closed issues which have a stale label sorted by thumbs up.

We may also:

  • Mark issues as revivable if we think it's a valid issue but isn't something we are likely
    to prioritize in the future (the issue will still remain closed).
  • Add a keepalive label to silence the stalebot if the issue is very common/popular/important.

We are doing our best to respond, organize, and prioritize all issues but it can be a challenging task,
our sincere apologies if you find yourself at the mercy of the stalebot.

@stale stale bot added the stale A stale issue or PR that will automatically be closed. label May 1, 2022
@frittentheke
Copy link
Contributor

I just noticed the /run/loki folder filling most of the memory due to S3 being unreachable.
An enforced limit potentially related to the available space should absolutely be added to Loki to not push the machine into OOM.
Also just returning a "503 Service Unavailable" until the issue is resolved should be enough to have the clients buffer and try again.

@stale stale bot removed the stale A stale issue or PR that will automatically be closed. label May 16, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
component/ingester type/feature Something new we should do
Projects
None yet
Development

No branches or pull requests

4 participants