-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allow to limit the "in-memory" chunks for Ingester #5721
Comments
Also maybe Loki should expose 2 different endpoints:
|
To me, these are two independent areas:
Both should be handled independently.
Instead of manually configuring the limit, what about having a "dynamic" limit based on the available memory? |
I think the reason why a WAL append fails is somewhat immaterial - in all cases we should fail the write with a The disk filling up though is a case we have specific handling for, and we can expand this.
I'm not sure we need to go this far; this could backfire in some scary ways as well. Let's focus (at least in this issue) on how the ingesters behave when the WAL cannot be appended, and not on in-memory chunk behaviour IMHO; if we define a threshold of disk fullness at which we start to reject writes, that'll save the ingester from having a large increase in memory usage anyway. |
Agreed, the health/ready check should take the WAL appendability into account, however having a failing health check can remove the ingester from the ring as well, which may not be desireable. |
I totally agree 1. is not the point for this feature request (Ideally could be cool to have a retention size so that old chunk get automatically cleaned when we reach this size) but the point for this feature request is really about 2.
I'm not entirely sure what you mean by "dynamic", you mean like we set a percentage so that Ingester checks all memory and "compute" his limit ? Fine for me as well
Indeed, like a retention size |
I don't think we should change the way we buffer chunks in memory, but rather address the underlying cause of too many chunks being buffered - which is the WAL append failure behaviour. |
From what I see it's not the same root cause, if you are able to write the WAL but not able to write data at the end you will still consume all the memory for chunks. I don't know if the same may happen with some other config like using Cassandra/s3 or whatever else storage, like Loki unable to write to his storage so it keep all chunks in memory (even if it can write WAL) |
Indeed. The WAL and the chunk storage can be on separate disks, or on the same disk.
It's definitely not recommended (or really even practical) for them to be on the same disk, but for the sake of argument, let's assume they can be. If the WAL can't be appended to, we should have an option to fail the writes. Having a memory threshold for the chunks is probably not worthwhile here. WAL append failures are pretty catastrophic, and replication should minimize the impact of a single ingester's WAL disk being full from causing data-loss. If the chunk storage can't be written to, yeah maybe then we could have a memory threshold for that case. The problem, of course, is that that threshold could be reached even if there isn't a direct problem like the WAL or chunk disk being full, such as a huge spike in traffic. In the case of a traffic spike, you want to try receive all the data you can, and even if you cause an OOM (there's no way around limited resources) you can still recover from the WAL. However, of course, the WAL replay is expensive, so yeah - maybe a memory threshold is the right approach here for the chunk storage case. |
Fine, I agree. To me, we should have both as you said, |
I was wrong about the behaviour of the ingester, sorry for the confusion. The ingester will receive samples into memory and then produce "chunks", which are then flushed to object storage. When this happens (and after Here's what I think we could do moving forward:
I'm not yet convinced why this would be an idea worth exploring, but I'm open to suggestions. Solving the problem that grows ingester memory in an unacceptable way (chunks cannot flush to storage) seems like the preferable course of action first. |
Also adding this here as it's related: #3136 |
Hi! This issue has been automatically marked as stale because it has not had any We use a stalebot among other tools to help manage the state of issues in this project. Stalebots are also emotionless and cruel and can close issues which are still very relevant. If this issue is important to you, please add a comment to keep it open. More importantly, please add a thumbs-up to the original issue entry. We regularly sort for closed issues which have a We may also:
We are doing our best to respond, organize, and prioritize all issues but it can be a challenging task, |
I just noticed the |
Is your feature request related to a problem? Please describe.
If I configure Loki to store on the filesystem and the disk used get full, since it cannot write on the disk Loki will keep receiving data and fill the memory until it gets OOM killed (or the machine just crash).
Describe the solution you'd like
To me, we should be able to configure the Loki ingester "in memory max size" (the maximum size the "in-memory" chunk should take) so that once Loki reaches this "in-memory max size" he just reject new data with the right return code "retry-later".
Describe alternatives you've considered
If Loki is deployed as a Pod in Kubernetes we could just set a memory limit, but it's not a good solution IMHO since it means once Loki reach this limit, the Pod just get OOM killed and we likely lose all "in-memory" data
Additional context
The text was updated successfully, but these errors were encountered: