-
Notifications
You must be signed in to change notification settings - Fork 510
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Improve Span Rate Limiting #264
Comments
I was having this same error while investigating the memory leak issue in the compactors. This was preventing me from creating a trace with +100k spans. Even though the rate-limiting strategy was local, not global, I had the same confusing message, the configuration below:
and the resulting message after trying to push 10k messages:
After some debugging, I fixed the configuration and stopped getting this problem:
Basically, the bucket size needs to be large enough to accommodate 10k spans (10000 + 1). In my case, the rate limit isn't that important. |
I believe you're describing is a slightly different issue. There are two different limits in play at the same time:
You were hitting the spans/batch limit and, yes, the error message could definitely be improved. I was hitting the spans/second limit due to bursts in spans. Under the hood Tempo uses https://godoc.org/golang.org/x/time/rate and I believe this is calculating on a per second rate. So even though over 15 seconds we were well below the limit we were still getting rate limited due to spikes in ingestion. I'm wondering if we should smooth out the rate limit calculation over several seconds to avoid this. |
Hum ok, maybe I can research a bit more. Now, both are using the same bucket rate-limiter.
Both are used as input to the to the rate limiter I just described. This is where I was looking at:
Do you have an idea where that other rate-limiter lives? |
It's the same rate limiter. When you call https://godoc.org/golang.org/x/time/rate#Limiter.Burst
|
Is your feature request related to a problem? Please describe.
The span rate limiter in the distributor correctly rejects spans when they cross the rate limit threshold. Grafana Agent/OpenTelemetry collector will then then batch and resend. The problem is that the rate limit regularly trips due to spiky load while the rate as calculated over a minute or so is well below the configured threshold.
The result is that spans are unnecessarily sent multiple times.
Describe the solution you'd like
Investigate tracking spans/second limit over a longer time frame.
Describe alternatives you've considered
Additional context
The limit on this environment is 200k spans/second, but spans are still refused when the average spans/second is well below as calculated over a minute+ time period.
Distributors are logging:
The 25k span rate limit is from 200k / 8 distributors.
The text was updated successfully, but these errors were encountered: