Sample ratio not align with configured SampleRate during Stress Relief #1391

VinozzZ · 2024-10-18T20:39:21Z

Description:

When Stress Relief mode is activated, a fraction of traffic is sampled through a deterministic sampler based on the hash of trace id calculated using wyhash. When observing the kept_from_stress and dropped_from_stress metrics, the ratio between the two does not always align with the SampleRate configured for stress relief

Potential Cause
Below test showed that with smaller iteration n, wyhash result sometimes can have less distribution. This can be a reason why more traces are kept than configured SampleRate

func TestWyhash(t *testing.T) {
	n := 10000
	const frac = 100
	var upperBound uint64 = math.MaxUint64 / frac
	for i := 0; i < 10; i++ {
		t.Run(fmt.Sprintf("frac=%d", frac), func(t *testing.T) {
			count := 0
			for i := 0; i < n; i++ {
				traceID := fmt.Sprintf("%016x%016x", rand.Int63(), rand.Int63())
				hash := wyhash.Hash([]byte(traceID), hashSeed)
				if hash <= upperBound {
					count++
				}
			}
			assert.InDelta(t, count, n/frac, 0.1*float64(n/frac))
		})
	}
}

The text was updated successfully, but these errors were encountered:

kentquirk · 2024-11-15T03:43:05Z

I've been analyzing this data and expermenting with different hash functions, and the problem turns out to be the often-surprising nature of sampling statistics.

The short version is: The larger the sample rate you're trying to achieve, the more samples it takes to be reliably close to it.

The other thing that people forget to account for is: The number of spans per trace usually varies.

I ran a test with several different hash algorithms (wyhash, murmur3, and sha1). There was a slight but consistent difference between them, with wyhash (the algorithm we use) yielding the best results, but the difference was minor.

The test generated random traceIDs, hashed them, and then decided to "keep" or "drop" them based on the value of the hash, the same way Refinery does, using a target sample rate of 100. It then calculated the actual achieved sample rate.

It did this for different numbers of samples, and repeated each test 100 times, keeping track of the minimum, maximum, average and standard deviation of the actual sampleRate achieved in each test. The table below shows the results.

sampleCount	minrate	maxrate	avgrate	stddev%
500	31.25	500.00	127.86	83%
1000	45.45	333.33	108.62	37%
5000	71.43	156.25	101.54	14%
10000	78.74	144.93	100.72	10%
20000	83.68	129.03	101.05	7%
50000	88.34	116.82	99.98	4%

If you think of the sampleCount column as the number of samples in one granularity bucket of a Honeycomb query, you need something like 10000 samples per bucket before the graph looks even close to "smooth" when your sampleRate is 100.

Now we need to take the second factor into account. This is subtle, but: in any collection of traces containing a distribution of span counts, different trace IDs will have different weights when looking at the number of spans kept vs dropped. They will also reduce the number of actual trace IDs you see compared to span count. So if you randomly select a subset of trace IDs, you're going to get skewed results, particularly at the small numbers of traces.

The results below were from the same test as above, but now each trace represented from 1-20 spans in a bell curve around 11. Note how much less stable these results are (lower min, higher max, farther from the target):

sampleCount	minrate	maxrate	avgrate	stddev%
500	14.60	255.00	55.76	80%
1000	16.97	506.00	89.27	90%
5000	37.07	1254.00	144.98	135%
10000	45.32	556.00	118.38	59%
20000	58.32	285.73	107.48	29%
50000	61.20	195.32	103.03	17%

One last thing -- the current implementation of stress relief is not a binary -- it sends a fraction of traces through the deterministic sampler, so the effective sample rate will be a blend of the normal and stress rates.

In short, I think we're seeing "expected behavior" -- it's just that we didn't actually expect it until we did the math. Because sampling is hard, yo.

…ief activated (#1433) ## Which problem is this PR solving? - fix: #1391 ## Short description of the changes - remove logic for sending fraction of all traffic through normal operation during stress Co-authored-by: Mike Goldsmith <goldsmith.mike@gmail.com>

VinozzZ added the type: bug Something isn't working label Oct 18, 2024

kentquirk added this to the v2.9 milestone Nov 7, 2024

VinozzZ mentioned this issue Nov 15, 2024

fix: send all traffic through deterministic sampler during stress relief activated #1433

Merged

MikeGoldsmith closed this as completed in #1433 Nov 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Sample ratio not align with configured SampleRate during Stress Relief #1391

Sample ratio not align with configured SampleRate during Stress Relief #1391

VinozzZ commented Oct 18, 2024 •

edited

Loading

kentquirk commented Nov 15, 2024 •

edited

Loading

Sample ratio not align with configured SampleRate during Stress Relief #1391

Sample ratio not align with configured SampleRate during Stress Relief #1391

Comments

VinozzZ commented Oct 18, 2024 • edited Loading

kentquirk commented Nov 15, 2024 • edited Loading

VinozzZ commented Oct 18, 2024 •

edited

Loading

kentquirk commented Nov 15, 2024 •

edited

Loading