Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Specify how to propagate consistent head sampling probability #168

Merged
merged 46 commits into from
Sep 29, 2021

Conversation

jmacd
Copy link
Contributor

@jmacd jmacd commented Jul 23, 2021

This OTEP specifies how to propagate head trace sampling probability, so that child spans can be counted on the fly. This specification would allow ParentBased samplers to output sampling probabilities in a way that supports span-to-metrics pipelines, for example.

This OTEP is paired with #170 which discusses additions to the span data model.

@oertl
Copy link
Contributor

oertl commented Jul 28, 2021

@jmacd

This makes sense. It makes me wonder, though, if we have a way to generate the correct randomness, what's really keeping us from using random bits in TraceID? If we have random bits in the TraceID, the number-of-leading-zeros test can be applied directly, right?

Yes, if the trace ID is specified to be truly random, it would be possible to use it without hashing.

...and I gather that the answer is -- it requires less randomness with your proposal. In the version where we add a single byte to convey randomness, the expected number of random bits required is 2 in this case (for the reader, I believe we're talking about the mean of a https://en.wikipedia.org/wiki/Geometric_distribution with p=0.5). Can you confirm?

Yes, the given pseudocode generates a geometrically distributed random number with p=0.5 and takes just 2 random bits on average.

What I was trying to say is that all samplers (AlwaysOn, AlwaysOff, Parent, and TraceIDRatio) can be implemented using the same logic as the TraceIDRatio sampler, provided that the sampling rate of the parent is propagated. They only differ how the sampling rate is chosen for each span. For AlwaysOn it is 1, for AlwaysOff it is 0, for Parent it is the parent sampling rate, and for TraceIDRatio it is the fixed sampling rate. One could therefore say that a sampler should only be responsible for the selection of the sampling rate. The sampling decision is then always made according to the trace ID ratio logic.

To better reflect that in the interface, I would even redefine the sampler interface which is currently a function mapping to a boolean indicating the sampling decision. If it is defined as a function that maps each individual span to the corresponding sampling rate, the sampler implementations would be very simple. For example, TraceIDRatio would just return its fixed sampling rate and Parent would return the sampling rate of its parent. The final sampling decision would be made outside of the sampler following a specified and fixed logic based on the shared random number (or trace ID) and the sampling rate only.

The proposed redefined sampler interface would have following benefits:

  • Any sampling decision will be consistent, because the same logic is used to get a sampling decision based on the sampling rate and the shared random number (or trace ID).
  • The sampling rate and therefore the adjusted count is known for all spans as the interface enforces returning a sampling rate for every span.
  • For estimation and further processing it is sufficient to know the sampling rate. It is not necessary to know which sampler was used to determine the sampling rate. What matters is just the sampling rate. This makes recording the sampler name (which would introduce additional overhead and which was mentioned here Probability sampling basics for telemetry events #148 (comment)) obsolete.
  • Furthermore, new samplers which, for example, choose sampling rates more dynamically than TraceIDRatio based on span attributes, are easy to implement and would not break estimation algorithms.

@jmacd
Copy link
Contributor Author

jmacd commented Jul 28, 2021

@oertl Thank you. I definitely understand your proposal. You've described a simplification in the Sampler interface, which is an SDK specification topic that I was hoping to avoid. I agree with your summary, the overall Sampler outcome would be easier to reason about with the change you suggest, but I am not sure that legacy users of the API will agree. The amount of energy it will take to make this happen may exceed the benefit produced.

I am trying to keep these OTEPs separate:

#170 is about the data model for Span (formerly known as #148)
#168 is about uses of the W3C trace context (this PR)

Changing the Sampler API would be another issue entirely.

I will revise this PR based on the discussion in #168 (comment), which means combining the two pieces of additional information that go into the traceparent to both include the head probability and the randomly distributed value, as they are both apparently needed to complete a functional span-to-metrics pipeline.

@yurishkuro
Copy link
Member

yurishkuro commented Jul 28, 2021

Personally I would prefer to separate the propagation proposal from the W3C format discussion. Whatever happens with this OTEP is not going to have any (formal) impact on changing the W3C spec, the discussion would need to happen there. On the other hand, the actual functionality is not dependent on changing the W3C format because it can be achieved by using tracestate, which would be completely in the hands of OTEL & if adopted & successful would serve as a strong motivation for upgrading the traceparent spec.

@jmacd
Copy link
Contributor Author

jmacd commented Jul 28, 2021

@oertl and @yurishkuro

This makes recording the sampler name (which would introduce additional overhead and which was mentioned here -- #148) obsolete.

In my proposal, in the text of #170 (now), the sampler.name attribute is only a MUST when the adjusted count is unknown, which would only result from legacy trace contexts (if we succeed). I see no reason to record the sampler.name otherwise, especially if as you suggest the TraceIDRatio decision is used consistently.

@jmacd jmacd changed the title Specify how to propagate head sampling probability Specify how to propagate consistent head sampling probability Jul 28, 2021
@jmacd
Copy link
Contributor Author

jmacd commented Jul 28, 2021

@yurishkuro Thank you for your guidance. I have revised this proposal only to use tracestate.

@oertl I have revised this to include your proposal with a syntax like:

tracestate: otelprob=PPRR

where PP is the base16 probability value and RR is the base16 randomness value. Please take another look.

Copy link
Member

@bogdandrutu bogdandrutu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One big question that I don't get the answer from this OTEP is if this is a required feature. I feel that you need this to be required to work (especially when the trace travers multiple customers), but I also feel this is an extra overhead and complicated logic for cases when the customer does not care about this.

text/trace/0168-sampling-propagation.md Outdated Show resolved Hide resolved
text/trace/0168-sampling-propagation.md Outdated Show resolved Hide resolved
text/trace/0168-sampling-propagation.md Outdated Show resolved Hide resolved
text/trace/0168-sampling-propagation.md Show resolved Hide resolved
text/trace/0168-sampling-propagation.md Show resolved Hide resolved
text/trace/0168-sampling-propagation.md Show resolved Hide resolved
@bogdandrutu
Copy link
Member

@jmacd please fix the check-errors.

@jmacd
Copy link
Contributor Author

jmacd commented Sep 15, 2021

One big question that I don't get the answer from this OTEP is if this is a required feature.

Discussed in "Default behavior". I'd prefer this were on-by-default, but with the present tracestate solution it's expensive. Maybe after there's a W3C traceparent, where this costs 1-3 bytes per context, this can be on by default. I'll accept off-by-default for now-- Lightstep's OTel launchers would turn this on by default. I imagine "yet another" environment variable to say whether the TraceIDRatioBased sampler generates r or not.

@bogdandrutu
Copy link
Member

@jmacd where should we discuss/agree if this is or not the default behavior?

@carlosalberto
Copy link
Contributor

@bogdandrutu Just FYI there's a SIG call on 8 minutes, although don't know who package are you today ;)

Copy link
Member

@bogdandrutu bogdandrutu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unblocking, since @jmacd promised that the default behaviors will be discussed in specs.

@jmacd
Copy link
Contributor Author

jmacd commented Sep 28, 2021

default behaviors will be discussed in specs

Yes. FWIW, I've already changed this OTEP to state that the default will be opt-in. We're just going to debate the name and form of the option.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

9 participants