-
Notifications
You must be signed in to change notification settings - Fork 3.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Wait on all ingesters to complete processing before returning success #6241
Conversation
./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell. + ingester 0%
- distributor -0.1%
+ querier 0%
+ querier/queryrange 0%
+ iter 0%
+ storage 0%
+ chunkenc 0%
+ logql 0%
+ loki 0% |
7dd0f7f
to
7df0598
Compare
./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell. + ingester 0%
- distributor -0.4%
+ querier 0%
+ querier/queryrange 0%
+ iter 0%
+ storage 0%
+ chunkenc 0%
+ logql 0%
+ loki 0% |
@splitice The implementation looks to me to defeat the mechanics of Loki's |
That's right this is intentional, based on replication factor and dynamo DB paper. https://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf |
Nothing should have changed in regards to replication. A replication factor of 3 means send to 3 but be successful if 2 (or more) succeed. This is true before, and true in this PR. What has changed is that in the event a single ingester is struggling with the load and it takes longer than the other 2 the request won't complete until it does finish processing (it doesnt matter if it errors). Without this it's therefore possible to build up tens or hundreds of thousands of push log requests between distributor and a slower ingester. This leads to a blow out of memory. The simplest method to resolve this is to still require two successful ingesters. This does not change replication behaviour. It just adds a requirement to wait on telling the caller (and hence the grpc request) until all ingesters have processed the request (be it successful or failure). The subsequent state once all ingesters are complete will reflect the replication conditions. Any replication error condition (failures greater than 1 for replication factor 3) exits early. It may still be advisable to wait on all ingesters in this case too, but this would slow down retries (so perhaps not ideal). |
Basically in pseudo code the logic is:
replication failure is the reverse of the replication guaruntee, the number of failures that are required to make sure the replication guaruntee can't be met (for 3 replication factor, it's anything more than a single failure). |
I think I'm understanding the intent of this PR however as Cyril and Peri have both mentioned, it breaks a very much intended behavior of how Loki works today. Specifically that Loki doesn't wait for all ingesters in a replication set to respond before sending a successful response, this is the intended design. As soon as a stream meets the replication requirements the channel is notified and when all streams have met the replication requirements the successful response is sent. I don't think we have a lot of appetite for changing this behavior. That being said, I can't say as if we've seen/notice or had any problems here similar to the problem you are describing. loki/pkg/distributor/distributor.go Line 338 in b90c460
The timeout configured here by default is 5 seconds, I am wondering if you lower this to something like 1s or 2s if it would help alleviate the problems you are describing without having to change the current designed behavior? |
I'm not going to go into how many clusters I manage that have seen this issue, it's moot if it won't merge.
There is no world where spikes to 100GB+ ram (or swap) at the ingester level is expected from users. Until you bring this under control the natural lifecycle of any slow ingester will be to crash. You may call that a feature, I'll call it a bug. We can have a difference of oppinion and you may close this PR.
Then you may close this. It will either be you (or as is the fate of most PRs) the stalebot. To anyone who finds this in the future - perhaps check out our fork. If we are still running loki then this will likely be a maintained patch. A bunch of other patches too including mmap for chunks, block idle cutting and a more advanced memory management model with disk support inspired by KeyDb (all victims of the stalebot). Since we arent maintaing it to merge you will likely need to pick out the changes you desire from the larger patchset.
I doubt it. From 5 to 1 second would back of the envelope mean peaks of 20% 100GB at-least (20GB). Thats still swaping out useless data for most people and slowing down ingestion massively. |
I'm trying to understand better how your situation seems to differ from ours? The numbers you describe for peaks in memory don't resemble behaviors we see and I wonder if we run Loki in very different ways? Would you be interested in chatting in real life? I'd love to better understand how you are using Loki. Feel free to DM me on grafana.slack.com @ewelch or on twitter @edapted |
I doubt it. Probably the only thing we do differently to Grafana Cloud (but the same as many other users) is run a shared cluster. Both in that it's shared with other services and hosted on shared CPU instances of a managed K8s provider. This means somewhat variable individual node performance (even more so if the memory completely blows out due to bugs like this). From my point of view when I saw the same troublesome behaviour on a second cluster under management it became clear it was systemic (and not specific to that one clusters usage paterns). This got the investigation necessary and this fix has resolved the issue completely. Anyway if you ever want to replicate the bug simply take a cluster and limit one node to a small cpu limit (at an extreme say 1000m for 2000m work) and load it up. And you will see it. Memory on the slow node will blow out to massive proportions further slowing the ingestion rate of that node, leading to more blow out, until eventually it OOMs (if swap is enabled then it will first fill swap). Anyway you may close this PR. I'm not going to argue the case for an unwanted bug fix. We contribute upstream to give back, nothing more. It's no skin off my back for this to be closed. |
Hi! This issue has been automatically marked as stale because it has not had any We use a stalebot among other tools to help manage the state of issues in this project. Stalebots are also emotionless and cruel and can close issues which are still very relevant. If this issue is important to you, please add a comment to keep it open. More importantly, please add a thumbs-up to the original issue entry. We regularly sort for closed issues which have a We may also:
We are doing our best to respond, organize, and prioritize all issues but it can be a challenging task, |
Go away |
Hi! This issue has been automatically marked as stale because it has not had any We use a stalebot among other tools to help manage the state of issues in this project. Stalebots are also emotionless and cruel and can close issues which are still very relevant. If this issue is important to you, please add a comment to keep it open. More importantly, please add a thumbs-up to the original issue entry. We regularly sort for closed issues which have a We may also:
We are doing our best to respond, organize, and prioritize all issues but it can be a challenging task, |
Go away |
Hi! This issue has been automatically marked as stale because it has not had any We use a stalebot among other tools to help manage the state of issues in this project. Stalebots are also emotionless and cruel and can close issues which are still very relevant. If this issue is important to you, please add a comment to keep it open. More importantly, please add a thumbs-up to the original issue entry. We regularly sort for closed issues which have a We may also:
We are doing our best to respond, organize, and prioritize all issues but it can be a challenging task, |
It looks like there is some room to chat about usage and expectations but that this breaks our intended behavior. It's also been stale for quite some time. I'm going to close this. |
Wait on all ingesters to complete processing before returning success
Errors still immediately return an error
What this PR does / why we need it:
Which issue(s) this PR fixes:
This fixes a build up of in-flight requests between distributor and ingester if one of the ingesters in the set is slower than the
rest.
This also allows adaptive request concurrency in vector to actually get the feedback it needs to prevent peaks in logs from overloading the cluster.
Special notes for your reviewer:
This also helps in performance as Loki tends too get even slower at processing work when the number of requests is massive (likely increases in allocations leading to GOGC taking longer and longer).