Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve Span Rate Limiting #264

Closed
joe-elliott opened this issue Oct 26, 2020 · 4 comments
Closed

Improve Span Rate Limiting #264

joe-elliott opened this issue Oct 26, 2020 · 4 comments

Comments

@joe-elliott
Copy link
Member

Is your feature request related to a problem? Please describe.
The span rate limiter in the distributor correctly rejects spans when they cross the rate limit threshold. Grafana Agent/OpenTelemetry collector will then then batch and resend. The problem is that the rate limit regularly trips due to spiky load while the rate as calculated over a minute or so is well below the configured threshold.

The result is that spans are unnecessarily sent multiple times.

Describe the solution you'd like
Investigate tracking spans/second limit over a longer time frame.

Describe alternatives you've considered

Additional context
The limit on this environment is 200k spans/second, but spans are still refused when the average spans/second is well below as calculated over a minute+ time period.

image

Distributors are logging:

err="rpc error: code = ResourceExhausted desc = ingestion rate limit (25000 spans) exceeded while adding 295 spans"

The 25k span rate limit is from 200k / 8 distributors.

@calvernaz
Copy link
Contributor

I was having this same error while investigating the memory leak issue in the compactors. This was preventing me from creating a trace with +100k spans.

Even though the rate-limiting strategy was local, not global, I had the same confusing message, the configuration below:

overrides:
  ingestion_rate_limit: 100000
  ingestion_max_batch_size: 1000

and the resulting message after trying to push 10k messages:

err="rpc error: code = ResourceExhausted desc = ingestion rate limit (100000 spans) exceeded while adding 10001 spans"

After some debugging, I fixed the configuration and stopped getting this problem:

overrides:
  ingestion_rate_limit: 1000
  ingestion_max_batch_size: 10001

Basically, the bucket size needs to be large enough to accommodate 10k spans (10000 + 1). In my case, the rate limit isn't that important.

@joe-elliott
Copy link
Member Author

I believe you're describing is a slightly different issue. There are two different limits in play at the same time:

  • spans/second
  • spans/batch

You were hitting the spans/batch limit and, yes, the error message could definitely be improved.

I was hitting the spans/second limit due to bursts in spans. Under the hood Tempo uses https://godoc.org/golang.org/x/time/rate and I believe this is calculating on a per second rate. So even though over 15 seconds we were well below the limit we were still getting rate limited due to spikes in ingestion. I'm wondering if we should smooth out the rate limit calculation over several seconds to avoid this.

@calvernaz
Copy link
Contributor

Hum ok, maybe I can research a bit more. Now, both are using the same bucket rate-limiter.
Now, the error message I have is the same as yours and the configuration comments seems to be pointing to what I described:

f.IntVar(&l.IngestionRateSpans, "distributor.ingestion-rate-limit", 100000, "Per-user ingestion rate limit in spans per second.")
f.IntVar(&l.IngestionMaxBatchSize, "distributor.ingestion-max-batch-size", 1000, "Per-user allowed ingestion max batch size (in number of spans).")

Both are used as input to the to the rate limiter I just described. This is where I was looking at:

	now := time.Now()
	if !d.ingestionRateLimiter.AllowN(now, userID, spanCount) {
		// Return a 4xx here to have the client discard the data and not retry. If a client
		// is sending too much data consistently we will unlikely ever catch up otherwise.
		metricDiscardedSpans.WithLabelValues(rateLimited, userID).Add(float64(spanCount))

		return nil, status.Errorf(codes.ResourceExhausted, "ingestion rate limit (%d spans) exceeded while adding %d spans", int(d.ingestionRateLimiter.Limit(now, userID)), spanCount)
	}

Do you have an idea where that other rate-limiter lives?

@joe-elliott
Copy link
Member Author

It's the same rate limiter. When you call AllowN it does both checks at once. spanCount is not allowed to be greater than IngestionMaxBatchSize. This corresponds to the limiters "burst size":

https://godoc.org/golang.org/x/time/rate#Limiter.Burst

IngestionRateSpans corresponds to the limiters actual per second limit:

https://godoc.org/golang.org/x/time/rate#Limiter.Limit

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants