Improve Span Rate Limiting #264

joe-elliott · 2020-10-26T15:53:32Z

Is your feature request related to a problem? Please describe.
The span rate limiter in the distributor correctly rejects spans when they cross the rate limit threshold. Grafana Agent/OpenTelemetry collector will then then batch and resend. The problem is that the rate limit regularly trips due to spiky load while the rate as calculated over a minute or so is well below the configured threshold.

The result is that spans are unnecessarily sent multiple times.

Describe the solution you'd like
Investigate tracking spans/second limit over a longer time frame.

Describe alternatives you've considered

Additional context
The limit on this environment is 200k spans/second, but spans are still refused when the average spans/second is well below as calculated over a minute+ time period.

Distributors are logging:

err="rpc error: code = ResourceExhausted desc = ingestion rate limit (25000 spans) exceeded while adding 295 spans"

The 25k span rate limit is from 200k / 8 distributors.

The text was updated successfully, but these errors were encountered:

calvernaz · 2020-11-11T22:41:09Z

I was having this same error while investigating the memory leak issue in the compactors. This was preventing me from creating a trace with +100k spans.

Even though the rate-limiting strategy was local, not global, I had the same confusing message, the configuration below:

overrides:
  ingestion_rate_limit: 100000
  ingestion_max_batch_size: 1000

and the resulting message after trying to push 10k messages:

err="rpc error: code = ResourceExhausted desc = ingestion rate limit (100000 spans) exceeded while adding 10001 spans"

After some debugging, I fixed the configuration and stopped getting this problem:

overrides:
  ingestion_rate_limit: 1000
  ingestion_max_batch_size: 10001

Basically, the bucket size needs to be large enough to accommodate 10k spans (10000 + 1). In my case, the rate limit isn't that important.

joe-elliott · 2020-11-12T14:31:35Z

I believe you're describing is a slightly different issue. There are two different limits in play at the same time:

spans/second
spans/batch

You were hitting the spans/batch limit and, yes, the error message could definitely be improved.

I was hitting the spans/second limit due to bursts in spans. Under the hood Tempo uses https://godoc.org/golang.org/x/time/rate and I believe this is calculating on a per second rate. So even though over 15 seconds we were well below the limit we were still getting rate limited due to spikes in ingestion. I'm wondering if we should smooth out the rate limit calculation over several seconds to avoid this.

calvernaz · 2020-11-12T14:46:19Z

Hum ok, maybe I can research a bit more. Now, both are using the same bucket rate-limiter.
Now, the error message I have is the same as yours and the configuration comments seems to be pointing to what I described:

f.IntVar(&l.IngestionRateSpans, "distributor.ingestion-rate-limit", 100000, "Per-user ingestion rate limit in spans per second.")
f.IntVar(&l.IngestionMaxBatchSize, "distributor.ingestion-max-batch-size", 1000, "Per-user allowed ingestion max batch size (in number of spans).")

Both are used as input to the to the rate limiter I just described. This is where I was looking at:

	now := time.Now()
	if !d.ingestionRateLimiter.AllowN(now, userID, spanCount) {
		// Return a 4xx here to have the client discard the data and not retry. If a client
		// is sending too much data consistently we will unlikely ever catch up otherwise.
		metricDiscardedSpans.WithLabelValues(rateLimited, userID).Add(float64(spanCount))

		return nil, status.Errorf(codes.ResourceExhausted, "ingestion rate limit (%d spans) exceeded while adding %d spans", int(d.ingestionRateLimiter.Limit(now, userID)), spanCount)
	}

Do you have an idea where that other rate-limiter lives?

joe-elliott · 2020-11-12T14:51:36Z

It's the same rate limiter. When you call AllowN it does both checks at once. spanCount is not allowed to be greater than IngestionMaxBatchSize. This corresponds to the limiters "burst size":

https://godoc.org/golang.org/x/time/rate#Limiter.Burst

IngestionRateSpans corresponds to the limiters actual per second limit:

https://godoc.org/golang.org/x/time/rate#Limiter.Limit

joe-elliott mentioned this issue Jan 8, 2021

Standardize ingestion burst setting #445

Merged

3 tasks

joe-elliott closed this as completed Jan 8, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve Span Rate Limiting #264

Improve Span Rate Limiting #264

joe-elliott commented Oct 26, 2020

calvernaz commented Nov 11, 2020

joe-elliott commented Nov 12, 2020

calvernaz commented Nov 12, 2020

joe-elliott commented Nov 12, 2020

Improve Span Rate Limiting #264

Improve Span Rate Limiting #264

Comments

joe-elliott commented Oct 26, 2020

calvernaz commented Nov 11, 2020

joe-elliott commented Nov 12, 2020

calvernaz commented Nov 12, 2020

joe-elliott commented Nov 12, 2020