Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Sample ratio not align with configured SampleRate during Stress Relief #1391

Closed
VinozzZ opened this issue Oct 18, 2024 · 1 comment · Fixed by #1433
Closed

Sample ratio not align with configured SampleRate during Stress Relief #1391

VinozzZ opened this issue Oct 18, 2024 · 1 comment · Fixed by #1433
Labels
type: bug Something isn't working
Milestone

Comments

@VinozzZ
Copy link
Contributor

VinozzZ commented Oct 18, 2024

Description:

When Stress Relief mode is activated, a fraction of traffic is sampled through a deterministic sampler based on the hash of trace id calculated using wyhash. When observing the kept_from_stress and dropped_from_stress metrics, the ratio between the two does not always align with the SampleRate configured for stress relief

Potential Cause
Below test showed that with smaller iteration n, wyhash result sometimes can have less distribution. This can be a reason why more traces are kept than configured SampleRate

func TestWyhash(t *testing.T) {
	n := 10000
	const frac = 100
	var upperBound uint64 = math.MaxUint64 / frac
	for i := 0; i < 10; i++ {
		t.Run(fmt.Sprintf("frac=%d", frac), func(t *testing.T) {
			count := 0
			for i := 0; i < n; i++ {
				traceID := fmt.Sprintf("%016x%016x", rand.Int63(), rand.Int63())
				hash := wyhash.Hash([]byte(traceID), hashSeed)
				if hash <= upperBound {
					count++
				}
			}
			assert.InDelta(t, count, n/frac, 0.1*float64(n/frac))
		})
	}
}
@VinozzZ VinozzZ added the type: bug Something isn't working label Oct 18, 2024
@kentquirk kentquirk added this to the v2.9 milestone Nov 7, 2024
@kentquirk
Copy link
Contributor

kentquirk commented Nov 15, 2024

I've been analyzing this data and expermenting with different hash functions, and the problem turns out to be the often-surprising nature of sampling statistics.

The short version is: The larger the sample rate you're trying to achieve, the more samples it takes to be reliably close to it.

The other thing that people forget to account for is: The number of spans per trace usually varies.

I ran a test with several different hash algorithms (wyhash, murmur3, and sha1). There was a slight but consistent difference between them, with wyhash (the algorithm we use) yielding the best results, but the difference was minor.

The test generated random traceIDs, hashed them, and then decided to "keep" or "drop" them based on the value of the hash, the same way Refinery does, using a target sample rate of 100. It then calculated the actual achieved sample rate.

It did this for different numbers of samples, and repeated each test 100 times, keeping track of the minimum, maximum, average and standard deviation of the actual sampleRate achieved in each test. The table below shows the results.

sampleCount minrate maxrate avgrate stddev%
500 31.25 500.00 127.86 83%
1000 45.45 333.33 108.62 37%
5000 71.43 156.25 101.54 14%
10000 78.74 144.93 100.72 10%
20000 83.68 129.03 101.05 7%
50000 88.34 116.82 99.98 4%

If you think of the sampleCount column as the number of samples in one granularity bucket of a Honeycomb query, you need something like 10000 samples per bucket before the graph looks even close to "smooth" when your sampleRate is 100.

Now we need to take the second factor into account. This is subtle, but: in any collection of traces containing a distribution of span counts, different trace IDs will have different weights when looking at the number of spans kept vs dropped. They will also reduce the number of actual trace IDs you see compared to span count. So if you randomly select a subset of trace IDs, you're going to get skewed results, particularly at the small numbers of traces.

The results below were from the same test as above, but now each trace represented from 1-20 spans in a bell curve around 11. Note how much less stable these results are (lower min, higher max, farther from the target):

sampleCount minrate maxrate avgrate stddev%
500 14.60 255.00 55.76 80%
1000 16.97 506.00 89.27 90%
5000 37.07 1254.00 144.98 135%
10000 45.32 556.00 118.38 59%
20000 58.32 285.73 107.48 29%
50000 61.20 195.32 103.03 17%

One last thing -- the current implementation of stress relief is not a binary -- it sends a fraction of traces through the deterministic sampler, so the effective sample rate will be a blend of the normal and stress rates.

In short, I think we're seeing "expected behavior" -- it's just that we didn't actually expect it until we did the math. Because sampling is hard, yo.

MikeGoldsmith added a commit that referenced this issue Nov 18, 2024
…ief activated (#1433)

## Which problem is this PR solving?

- fix: #1391 

## Short description of the changes

- remove logic for sending fraction of all traffic through normal
operation during stress

Co-authored-by: Mike Goldsmith <goldsmith.mike@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
type: bug Something isn't working
Projects
None yet
2 participants