Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wait on all ingesters to complete processing before returning success #6241

Closed
wants to merge 1 commit into from

Conversation

splitice
Copy link
Contributor

Wait on all ingesters to complete processing before returning success

Errors still immediately return an error

What this PR does / why we need it:

Which issue(s) this PR fixes:
This fixes a build up of in-flight requests between distributor and ingester if one of the ingesters in the set is slower than the
rest.

This also allows adaptive request concurrency in vector to actually get the feedback it needs to prevent peaks in logs from overloading the cluster.

Special notes for your reviewer:

This also helps in performance as Loki tends too get even slower at processing work when the number of requests is massive (likely increases in allocations leading to GOGC taking longer and longer).

@splitice splitice requested a review from a team as a code owner May 25, 2022 03:44
@grafanabot
Copy link
Collaborator

./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
-        distributor	-0.1%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0%

@splitice splitice marked this pull request as draft May 25, 2022 04:45
@splitice splitice force-pushed the feature/wait-on-all-ingesters branch from 7dd0f7f to 7df0598 Compare May 25, 2022 04:51
@grafanabot
Copy link
Collaborator

./tools/diff_coverage.sh ../loki-main/test_results.txt test_results.txt ingester,distributor,querier,querier/queryrange,iter,storage,chunkenc,logql,loki

Change in test coverage per package. Green indicates 0 or positive change, red indicates that test coverage for a package fell.

+           ingester	0%
-        distributor	-0.4%
+            querier	0%
+ querier/queryrange	0%
+               iter	0%
+            storage	0%
+           chunkenc	0%
+              logql	0%
+               loki	0%

@splitice splitice marked this pull request as ready for review May 25, 2022 05:32
@periklis
Copy link
Collaborator

@splitice The implementation looks to me to defeat the mechanics of Loki's replication_factor, i.e. a replication factor of 2 means wait for 2 ingesters at most to succeed processing the logs. Waiting for all ingesters at all times sounds to me as a full replication of all stream. WDYT? (cc @cyriltovena @owen-d )

@cyriltovena
Copy link
Contributor

That's right this is intentional, based on replication factor and dynamo DB paper.

https://www.allthingsdistributed.com/files/amazon-dynamo-sosp2007.pdf

@splitice
Copy link
Contributor Author

splitice commented May 25, 2022

i.e. a replication factor of 2 means wait for 2 ingesters at most to succeed processing the logs.

Nothing should have changed in regards to replication.

A replication factor of 3 means send to 3 but be successful if 2 (or more) succeed. This is true before, and true in this PR.

What has changed is that in the event a single ingester is struggling with the load and it takes longer than the other 2 the request won't complete until it does finish processing (it doesnt matter if it errors).

Without this it's therefore possible to build up tens or hundreds of thousands of push log requests between distributor and a slower ingester. This leads to a blow out of memory.

The simplest method to resolve this is to still require two successful ingesters. This does not change replication behaviour. It just adds a requirement to wait on telling the caller (and hence the grpc request) until all ingesters have processed the request (be it successful or failure).

The subsequent state once all ingesters are complete will reflect the replication conditions. Any replication error condition (failures greater than 1 for replication factor 3) exits early. It may still be advisable to wait on all ingesters in this case too, but this would slow down retries (so perhaps not ideal).

@splitice
Copy link
Contributor Author

Basically in pseudo code the logic is:

for each ingester:
   if any sample fails enough to result in replication failure:
     send error
   else loop through sameples until done
   when done, if we are the last ingester send done signal

replication failure is the reverse of the replication guaruntee, the number of failures that are required to make sure the replication guaruntee can't be met (for 3 replication factor, it's anything more than a single failure).

@slim-bean
Copy link
Collaborator

I think I'm understanding the intent of this PR however as Cyril and Peri have both mentioned, it breaks a very much intended behavior of how Loki works today.

Specifically that Loki doesn't wait for all ingesters in a replication set to respond before sending a successful response, this is the intended design. As soon as a stream meets the replication requirements the channel is notified and when all streams have met the replication requirements the successful response is sent.

I don't think we have a lot of appetite for changing this behavior.

That being said, I can't say as if we've seen/notice or had any problems here similar to the problem you are describing.

localCtx, cancel := context.WithTimeout(context.Background(), d.clientCfg.RemoteTimeout)

The timeout configured here by default is 5 seconds, I am wondering if you lower this to something like 1s or 2s if it would help alleviate the problems you are describing without having to change the current designed behavior?

@splitice
Copy link
Contributor Author

splitice commented Jun 3, 2022

I'm not going to go into how many clusters I manage that have seen this issue, it's moot if it won't merge.

Specifically that Loki doesn't wait for all ingesters in a replication set to respond before sending a successful response, this is the intended design. As soon as a stream meets the replication requirements the channel is notified and when all streams have met the replication requirements the successful response is sent.

There is no world where spikes to 100GB+ ram (or swap) at the ingester level is expected from users. Until you bring this under control the natural lifecycle of any slow ingester will be to crash.

You may call that a feature, I'll call it a bug. We can have a difference of oppinion and you may close this PR.

I don't think we have a lot of appetite for changing this behavior.

Then you may close this. It will either be you (or as is the fate of most PRs) the stalebot.

To anyone who finds this in the future - perhaps check out our fork. If we are still running loki then this will likely be a maintained patch. A bunch of other patches too including mmap for chunks, block idle cutting and a more advanced memory management model with disk support inspired by KeyDb (all victims of the stalebot).

Since we arent maintaing it to merge you will likely need to pick out the changes you desire from the larger patchset.

The timeout configured here by default is 5 seconds, I am wondering if you lower this to something like 1s or 2s if it would help alleviate the problems you are describing without having to change the current designed behavior?

I doubt it. From 5 to 1 second would back of the envelope mean peaks of 20% 100GB at-least (20GB). Thats still swaping out useless data for most people and slowing down ingestion massively.

@slim-bean
Copy link
Collaborator

I'm trying to understand better how your situation seems to differ from ours? The numbers you describe for peaks in memory don't resemble behaviors we see and I wonder if we run Loki in very different ways?

Would you be interested in chatting in real life? I'd love to better understand how you are using Loki. Feel free to DM me on grafana.slack.com @ewelch or on twitter @edapted

@splitice
Copy link
Contributor Author

splitice commented Jun 3, 2022

I doubt it. Probably the only thing we do differently to Grafana Cloud (but the same as many other users) is run a shared cluster. Both in that it's shared with other services and hosted on shared CPU instances of a managed K8s provider. This means somewhat variable individual node performance (even more so if the memory completely blows out due to bugs like this).

From my point of view when I saw the same troublesome behaviour on a second cluster under management it became clear it was systemic (and not specific to that one clusters usage paterns). This got the investigation necessary and this fix has resolved the issue completely.

Anyway if you ever want to replicate the bug simply take a cluster and limit one node to a small cpu limit (at an extreme say 1000m for 2000m work) and load it up. And you will see it. Memory on the slow node will blow out to massive proportions further slowing the ingestion rate of that node, leading to more blow out, until eventually it OOMs (if swap is enabled then it will first fill swap).

Anyway you may close this PR. I'm not going to argue the case for an unwanted bug fix. We contribute upstream to give back, nothing more. It's no skin off my back for this to be closed.

@stale
Copy link

stale bot commented Jul 10, 2022

Hi! This issue has been automatically marked as stale because it has not had any
activity in the past 30 days.

We use a stalebot among other tools to help manage the state of issues in this project.
A stalebot can be very useful in closing issues in a number of cases; the most common
is closing issues or PRs where the original reporter has not responded.

Stalebots are also emotionless and cruel and can close issues which are still very relevant.

If this issue is important to you, please add a comment to keep it open. More importantly, please add a thumbs-up to the original issue entry.

We regularly sort for closed issues which have a stale label sorted by thumbs up.

We may also:

  • Mark issues as revivable if we think it's a valid issue but isn't something we are likely
    to prioritize in the future (the issue will still remain closed).
  • Add a keepalive label to silence the stalebot if the issue is very common/popular/important.

We are doing our best to respond, organize, and prioritize all issues but it can be a challenging task,
our sincere apologies if you find yourself at the mercy of the stalebot.

@stale stale bot added the stale A stale issue or PR that will automatically be closed. label Jul 10, 2022
@splitice
Copy link
Contributor Author

Go away

@stale stale bot removed the stale A stale issue or PR that will automatically be closed. label Jul 10, 2022
@stale
Copy link

stale bot commented Aug 13, 2022

Hi! This issue has been automatically marked as stale because it has not had any
activity in the past 30 days.

We use a stalebot among other tools to help manage the state of issues in this project.
A stalebot can be very useful in closing issues in a number of cases; the most common
is closing issues or PRs where the original reporter has not responded.

Stalebots are also emotionless and cruel and can close issues which are still very relevant.

If this issue is important to you, please add a comment to keep it open. More importantly, please add a thumbs-up to the original issue entry.

We regularly sort for closed issues which have a stale label sorted by thumbs up.

We may also:

  • Mark issues as revivable if we think it's a valid issue but isn't something we are likely
    to prioritize in the future (the issue will still remain closed).
  • Add a keepalive label to silence the stalebot if the issue is very common/popular/important.

We are doing our best to respond, organize, and prioritize all issues but it can be a challenging task,
our sincere apologies if you find yourself at the mercy of the stalebot.

@stale stale bot added the stale A stale issue or PR that will automatically be closed. label Aug 13, 2022
@splitice
Copy link
Contributor Author

Go away

@stale stale bot removed the stale A stale issue or PR that will automatically be closed. label Aug 13, 2022
@stale
Copy link

stale bot commented Sep 21, 2022

Hi! This issue has been automatically marked as stale because it has not had any
activity in the past 30 days.

We use a stalebot among other tools to help manage the state of issues in this project.
A stalebot can be very useful in closing issues in a number of cases; the most common
is closing issues or PRs where the original reporter has not responded.

Stalebots are also emotionless and cruel and can close issues which are still very relevant.

If this issue is important to you, please add a comment to keep it open. More importantly, please add a thumbs-up to the original issue entry.

We regularly sort for closed issues which have a stale label sorted by thumbs up.

We may also:

  • Mark issues as revivable if we think it's a valid issue but isn't something we are likely
    to prioritize in the future (the issue will still remain closed).
  • Add a keepalive label to silence the stalebot if the issue is very common/popular/important.

We are doing our best to respond, organize, and prioritize all issues but it can be a challenging task,
our sincere apologies if you find yourself at the mercy of the stalebot.

@stale stale bot added the stale A stale issue or PR that will automatically be closed. label Sep 21, 2022
@MasslessParticle
Copy link
Contributor

It looks like there is some room to chat about usage and expectations but that this breaks our intended behavior. It's also been stale for quite some time. I'm going to close this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
size/S stale A stale issue or PR that will automatically be closed.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants