Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[processor/tailsampling] probabilistic policy hash distribution not good enough #27044

Closed
jiekun opened this issue Sep 21, 2023 · 14 comments
Closed
Labels
bug Something isn't working processor/tailsampling Tail sampling processor

Comments

@jiekun
Copy link
Member

jiekun commented Sep 21, 2023

Component(s)

No response

What happened?

Description

When testing probabilistic with auto increasing trace_id, the sampling rate is not match the expected probability.

Steps to Reproduce

The trace_id we used in probabilitistic_test.go for probabilistic policy is random:

func genRandomTraceIDs(num int) (ids []pcommon.TraceID) {
	r := rand.New(rand.NewSource(1))
	ids = make([]pcommon.TraceID, 0, num)
	for i := 0; i < num; i++ {
		traceID := [16]byte{}
		binary.BigEndian.PutUint64(traceID[:8], r.Uint64())
		binary.BigEndian.PutUint64(traceID[8:], r.Uint64())
		ids = append(ids, pcommon.TraceID(traceID))
	}
	return ids
}

However, if we use special (auto increasing) trace_id, most test cases will failed:

var id atomic.Uint64

func genRandomTraceIDs(num int) (ids []pcommon.TraceID) {
	r := id.Add(1)
	ids = make([]pcommon.TraceID, 0, num)
	for i := 0; i < num; i++ {
		r = id.Add(1)
		tid := idutils.UInt64ToTraceID(0, r)
		ids = append(ids, tid)
	}
	return ids
}

Obviously, the hash result is strongly relevent to the original input in this case. It works well when the trace_id is fully random, but failed when using special trace_id.

It's still uncleared that if it will affect other trace_id pattern (which is auto increasing). But I think we need to discuss if we should keep this algorithm or switch to antoher one.

ping original author @kvrhdn as well #3876

Expected Result

=== RUN   TestProbabilisticSampling
--- PASS: TestProbabilisticSampling (2.98s)
=== RUN   TestProbabilisticSampling/100%
    --- PASS: TestProbabilisticSampling/100% (0.41s)
=== RUN   TestProbabilisticSampling/0%
    --- PASS: TestProbabilisticSampling/0% (0.43s)
=== RUN   TestProbabilisticSampling/25%
    --- PASS: TestProbabilisticSampling/25% (0.42s)
=== RUN   TestProbabilisticSampling/33%
    --- PASS: TestProbabilisticSampling/33% (0.43s)
=== RUN   TestProbabilisticSampling/33%_-_custom_salt
    --- PASS: TestProbabilisticSampling/33%_-_custom_salt (0.45s)
=== RUN   TestProbabilisticSampling/-%50
    --- PASS: TestProbabilisticSampling/-%50 (0.43s)
=== RUN   TestProbabilisticSampling/150%
    --- PASS: TestProbabilisticSampling/150% (0.40s)
PASS

Actual Result

=== RUN   TestProbabilisticSampling
--- FAIL: TestProbabilisticSampling (3.22s)
=== RUN   TestProbabilisticSampling/100%
    --- PASS: TestProbabilisticSampling/100% (0.46s)
=== RUN   TestProbabilisticSampling/0%
    --- PASS: TestProbabilisticSampling/0% (0.46s)
=== RUN   TestProbabilisticSampling/25%
    probabilistic_test.go:86: 
        	Error Trace:	/home/jiekun/repo/github.com/opentelemetry-collector-contrib/processor/tailsamplingprocessor/internal/sampling/probabilistic_test.go:86
        	Error:      	Max difference between 25 and 50.963997 allowed is 0.2, but difference was -25.96399688720703
        	Test:       	TestProbabilisticSampling/25%
        	Messages:   	Effective sampling percentage is 50.963997, expected 25.000000
    --- FAIL: TestProbabilisticSampling/25% (0.47s)

=== RUN   TestProbabilisticSampling/33%
    probabilistic_test.go:86: 
        	Error Trace:	/home/jiekun/repo/github.com/opentelemetry-collector-contrib/processor/tailsamplingprocessor/internal/sampling/probabilistic_test.go:86
        	Error:      	Max difference between 33 and 14.571799 allowed is 0.2, but difference was 18.428200721740723
        	Test:       	TestProbabilisticSampling/33%
        	Messages:   	Effective sampling percentage is 14.571799, expected 33.000000
    --- FAIL: TestProbabilisticSampling/33% (0.46s)

=== RUN   TestProbabilisticSampling/33%_-_custom_salt
    probabilistic_test.go:86: 
        	Error Trace:	/home/jiekun/repo/github.com/opentelemetry-collector-contrib/processor/tailsamplingprocessor/internal/sampling/probabilistic_test.go:86
        	Error:      	Max difference between 33 and 0 allowed is 0.2, but difference was 33
        	Test:       	TestProbabilisticSampling/33%_-_custom_salt
        	Messages:   	Effective sampling percentage is 0.000000, expected 33.000000
    --- FAIL: TestProbabilisticSampling/33%_-_custom_salt (0.47s)

=== RUN   TestProbabilisticSampling/-%50
    --- PASS: TestProbabilisticSampling/-%50 (0.46s)
=== RUN   TestProbabilisticSampling/150%
    --- PASS: TestProbabilisticSampling/150% (0.45s)

TraceID examples

Format:

  • TraceID ([]byte)
  • Hash Result (uint64)
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1]
16254957729900898954
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 2]
16254956630389270743
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 3]
16254955530877642532
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 4]
16254954431366014321
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 5]
16254953331854386110
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 6]
16254952232342757899
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 7]
16254951132831129688
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 8]
16254950033319501477
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 9]
16254948933807873266
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 10]
16254947834296245055
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 11]
16254946734784616844
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 12]
16254945635272988633
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 13]
16254944535761360422
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 14]
16254943436249732211
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 15]
16254942336738104000
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 16]
16254976421598578541
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 17]
16254975322086950330
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 18]
16254974222575322119
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 19]
16254973123063693908
[0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 20]
16254972023552065697

Collector version

since #3876

Environment information

No response

OpenTelemetry Collector configuration

No response

Log output

No response

Additional context

No response

@jiekun jiekun added bug Something isn't working needs triage New item requiring triage labels Sep 21, 2023
@github-actions github-actions bot added the processor/tailsampling Tail sampling processor label Sep 21, 2023
@github-actions
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@jpkrohling
Copy link
Member

@jiekun, do you have a proposal to improve the current one?

@jiekun
Copy link
Member Author

jiekun commented Oct 5, 2023

@jpkrohling I am thinking about using:

value = currentHash(trace_id)

// and then

if value % 10000 < probability {
      Sampled
}else{
     NotSampled
}

But I think it's important to know the limitations(/Why) of the previous design.

@jpkrohling
Copy link
Member

I don't think that's going to cut either. We should probably wait for open-telemetry/oteps#235. I think @jmacd is keeping track of this and will work on the collector side of that.

@jiekun
Copy link
Member Author

jiekun commented Oct 5, 2023

Thanks for mentioning that. Let me discuss this with @jmacd in Slack later. I think that OTEP of sampling algorithm does help on this issue. It should be easy to reuse some code and align with the OTEP implementation.

Next update ETA: before Oct 14th.

@Frapschen Frapschen removed the needs triage New item requiring triage label Oct 16, 2023
@jpkrohling
Copy link
Member

Do you have an update on this, @jiekun?

@jiekun
Copy link
Member Author

jiekun commented Oct 24, 2023

apologize for the delay, I am still tracking this issue, and will have update soon. @jpkrohling

@jmacd
Copy link
Contributor

jmacd commented Oct 26, 2023

I am not sure why this "auto-incrementing" trace ID policy would be used, ever. What is the use-case for auto-incrementing trace IDs?

@jmacd
Copy link
Contributor

jmacd commented Oct 26, 2023

In the sampling SIG we often refer to https://github.com/rurban/smhasher, which is an extensive test for hashing functions which shows, unfortunately, that most of them fail at the task you're setting up for them, when you use auto-incrementing IDs.

I also want to correct the statement I made -- I know about uses of auto-incrementing TraceIDs -- there is a performance problem when TraceIDs require so many bits of randomness. That is why the above-referenced specification (working with W3C) has settled on only 56 bits of randomness. I've seen cases where thread-local TraceID generators are used that would auto-increment let's say 12 bits while using 44 bits of randomness. You could experiment with that approach and see how far you can take it.

Also, one could argue that the tests here are the problem, they're not good enough. The sampler is definitely outside of its error bounds, so to speak, but then it has hard-coded assumptions--it relies on the randomness from rand.New(rand.NewSource(1)) in addition to the performance of the hash function. You could relax the error bounds, given less randomness, and make those tests pass. It's difficult to deterministically test random algorithms, sort of a problem of its own nature. I have one approach to point at https://github.com/open-telemetry/opentelemetry-go-contrib/blob/main/samplers/probability/consistent/statistical_test.go, which is also specified in https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/tracestate-probability-sampling.md#appendix-statistical-test-requirements.

Thanks. I suggest closing this issue, but I'll leave it for the original author to decide.

@jiekun
Copy link
Member Author

jiekun commented Oct 27, 2023

@jmacd Hi thanks for your comment! I feel like I need to clearify a few things.

  1. I agree that "auto-incrementing" trace ID is less likely to be used in production.
  2. The original discussion in Slack: https://cloud-native.slack.com/archives/C01N6P7KR6W/p1695267233079099 , we have received user feedback stating that the number of traces retained by the processor does not match the configured probability (He's not using the auto-incrementing trace id)

So what I'm questioning about is the design of current impl:

  1. calculate threshold by the probabilistic config, e.g. 50% -> 123456789
  2. when trace comming in, calculate hash(trace_id), e.g. 234567
  3. if 234567 < 123456789 then return Sampled, else return NotSampled

I think the hash algo is fine but what's the best practice after calculated the hash value? For example:

  1. hash value % 10000 < sampling rate * 10000?
  2. hash value < pre-calculated threshold? (This is the current implementation)
  3. other bla bla bla ...

If we agree that the impl is fine, I can close this issue and leave it as is. Thanks for all your point of views.

@jiekun
Copy link
Member Author

jiekun commented Oct 27, 2023

A few additional points to consider:

  • If we simply use "hash(incremental ID) % 10000 < configured probability," the sampled results should at least be relatively close to the configured probability.
  • Having "hash(incremental ID) < threshold" in the incremental ID results in either 0% or 100% sampling probability. This seems highly unreasonable because it goes to an extreme end.

For the current implementation, I think the acceptable behaviour is:

  • When using auto incrementing trace ID (or any other special ID), it could still fits the "relaxed the error bounds", at least not falling to 0% or 100%.

@jpkrohling
Copy link
Member

jpkrohling commented Nov 1, 2023

we have received user feedback stating that the number of traces retained by the processor does not match the configured probability

I'd be interested in seeing the metrics showing that this is the case. Small variations on big workloads are expected. For small workloads, a higher variance is expected.

@jiekun
Copy link
Member Author

jiekun commented Nov 2, 2023

we have received user feedback stating that the number of traces retained by the processor does not match the configured probability

I'd be interested in seeing the metrics showing that this is the case. Small variations on big workloads are expected. For small workloads, a higher variance is expected.

I have some further feedback from @jmacd via Slack. I would like to move this issue to pending status and see if we got further feedback from the original user. If not, then I will close it later.

@jiekun
Copy link
Member Author

jiekun commented Nov 7, 2023

I'm closing this issue now. But if someone come up with new issue / idea (and more importantly the example and metrics), feel free to re-open and discuss here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working processor/tailsampling Tail sampling processor
Projects
None yet
Development

No branches or pull requests

4 participants