Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Probability sampling: Encode Span's head-adjusted count #170

Merged
merged 23 commits into from
Sep 9, 2021

Conversation

jmacd
Copy link
Contributor

@jmacd jmacd commented Jul 27, 2021

This text is reproduced from #148, a review that became too long after several revisions to the document.

This proposes specification text for a sampler.name and sampler.adjusted_count attribute, used to convey sampling probability in recorded SpanData messages.

A companion OTEP #168 discusses how to propagate sampling probability when using the Parent sampler.

@jmacd jmacd requested review from a team July 27, 2021 19:35
@jmacd
Copy link
Contributor Author

jmacd commented Jul 27, 2021

@oertl @yurishkuro @paulosman this document is unchanged from the most recent discussion in #148. I thought that opening a new PR would improve our chances of getting this specification merged.

This specification includes a lot of background to establish that there are many ways to sample and count spans, but all of them involve recording an adjusted count.

Copy link
Member

@yurishkuro yurishkuro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am in favor of merging the narrative, but I do not full agree with the proposed spec changes.

text/trace/0170-sampling-probability.md Outdated Show resolved Hide resolved
text/trace/0170-sampling-probability.md Outdated Show resolved Hide resolved
text/trace/0170-sampling-probability.md Outdated Show resolved Hide resolved
text/trace/0170-sampling-probability.md Outdated Show resolved Hide resolved
@paulosman
Copy link
Member

@oertl @yurishkuro @paulosman this document is unchanged from the most recent discussion in #148. I thought that opening a new PR would improve our chances of getting this specification merged.

This specification includes a lot of background to establish that there are many ways to sample and count spans, but all of them involve recording an adjusted count.

Thank you. I can at least provide a vendor perspective, as a consumer of sampled telemetry events.

Adding an adjusted count solves a real problem for us: At the moment we depend on the return value from ShouldSample including attributes that are added to the span. We have custom plugins that add a sample rate to this list of attributes, which is then read at ingest and used to make estimates about the population. If no adjusted count is included, we assume a value of 1.

This works but requires customers to use custom sampling plugins that provide no real value above the ability to communicate the adjusted count (the plugins use a deterministic algorithm that is similar to the TraceIdRatioBased sampler). Having the adjusted count be part of the spec would allow us to get rid of the need for these plugins.

For tail sampling, we accept OTLP data to our sampling proxy and then convert the data to our HTTP API which includes an adjusted count. We'd prefer to be able to do end-to-end OTLP, which the spec change described in this OTEP would allow us to do.

So 👍 from me. Thanks again for writing this up.

@jmacd
Copy link
Contributor Author

jmacd commented Aug 9, 2021

@open-telemetry/specs-trace-approvers Please take a look.

I've completed a prototype of this proposal in https://github.com/open-telemetry/opentelemetry-go/compare/main...jmacd:jmacd/propagate?expand=1. The text in this PR matches the prototype.

@jmacd
Copy link
Contributor Author

jmacd commented Aug 10, 2021

There is an error condition implied by #168 that would occur when the random variable R is not received in the trace state, here.

This illustrates the benefits of deriving R from the bits of the TraceID. If we depend on tracestate to make a consistent decision and do not receive that field, the TraceIDRatio Sampler will not know how to make its sampling decision. We could fall back to whatever behavior was being implemented by each SDK before this specification, which might be labeled ProbabilitySampler. This would lead to a situation where non-root spans are sampled with known adjusted count, but without having been selected using a consistent decision.

As specified, @oertl, the sampler.adjusted_count attribute can be correctly unbiased without requiring the sampling decision to be made in a consistent way. If an arbitrary Sampler records adjusted counts in this way, how much do you care to know that another sampler was used? None of the built-in samplers would do this, except for a hypothetical TraceIDRatio Sampler when the random variable R is not received in the trace state. We can avoid any built-in Samplers doing this by adding requirements to the TraceID, probably via W3C traceparent.

@yurishkuro
Copy link
Member

Suggestion: we may want to consider adding top-level Span fields for count and sampler name, given how important these concepts are and the space savings we'd achieve by recording values only.

@jmacd
Copy link
Contributor Author

jmacd commented Aug 20, 2021

I will update this PR based on the discussion in Thursday's Sampling SIG meeting, where we concluded that Tail sampling is interfering with our objectives and can be dropped at this time. Any kind of tail sampling that wants to be involved in span-to-metrics will require further specification, probably can be based off the proposals that were developed in this OTEP.

@jmacd
Copy link
Contributor Author

jmacd commented Aug 21, 2021

I revised this OTEP based on the Thursday Sampling SIG discussion, in which it was proposed that we introduce just a single integer field to encode head sampling probability, ignoring (for the time being) the topic of tail sampling. The proposed field would take 65 possible values, as follows:

Value Head Adjusted Count
0 Unknown
1 1
2 2
3 4
4 8
5 16
6 32
... ...
X 2^(X-1)
... ...
63 2^62
64 0

and the SpanData would be updated with a new field in v1/trace.proto.

  // Log-head-adjusted count is the logarithm of adjusted count for
  // this span as calculated at the head, offset by +1, with the
  // following recognized values.
  //
  // 0: The zero value represents an UNKNOWN adjusted count.
  //    Consumers of these Spans cannot cannot compute span metrics.
  //
  // 1: An adjusted count of 1.
  // 
  // 2-63: Values 2 through 63 represent an adjusted count of 2^(Value-1)
  //
  // 64: Value 64 represents an adjusted count of zero.
  //
  // Values greater than 64 are unrecognized.
  uint32 log_head_adjusted_count = <next_tag>;

@jmacd jmacd changed the title Probability sampling: Sampler Name and Adjusted Count attributes Probability sampling: Encode Span's head-adjusted count Aug 21, 2021
@oertl
Copy link
Contributor

oertl commented Aug 21, 2021

@jmacd, maybe it is better to sacrify the sampling rate 1/2^62 (which will never be used in practice I guess) in order to fit the state into 6 bits?

@jmacd
Copy link
Contributor Author

jmacd commented Aug 23, 2021

@jmacd, maybe it is better to sacrifice the sampling rate 1/2^62 (which will never be used in practice I guess) in order to fit the state into 6 bits?

I agree, and have applied this change in both OTEPs.

@jmacd
Copy link
Contributor Author

jmacd commented Aug 25, 2021

@yurishkuro I responded to your commend above. This PR is reduced to proposing to a single uint32 field in the protocol, defined as:

  // Log-head-adjusted count is the logarithm of adjusted count for
  // this span as calculated at the head, offset by +1, with the
  // following recognized values.
  //
  // 0: The zero value represents an UNKNOWN adjusted count.
  //    Consumers of these Spans cannot cannot compute span metrics.
  //
  // 1: An adjusted count of 1.
  // 
  // 2-62: Values 2 through 62 represent an adjusted count of 2^(Value-1)
  //
  // 63: Value 63 represents an adjusted count of zero.
  //
  // Values greater than 64 are unrecognized.
  uint32 log_head_adjusted_count = <next_tag>;

This matches OTEP #168, which is also ready for another look, please.

Copy link
Member

@yurishkuro yurishkuro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

overall lgtm (minor nit on v=63), but I think this OTEP is not actionable in the current form, it needs to propose changes to the Sampler SPI to return inclusion probability (or something) to the Tracer.

text/trace/0170-sampling-probability.md Show resolved Hide resolved
text/trace/0170-sampling-probability.md Show resolved Hide resolved
text/trace/0170-sampling-probability.md Show resolved Hide resolved
@oertl
Copy link
Contributor

oertl commented Aug 26, 2021

@jmacd, @yurishkuro I have implemented an abstract Sampler base class prototype in Java, see here. Any derived implementation will meet the requirements for consistent sampling. It takes care of generating the geometric random value and propagates it together with the exponent of the sampling probability. There are currently two prototype implementations: ConsistentParentRateSampler and ConsistentFixedRateSampler.

@jmacd
Copy link
Contributor Author

jmacd commented Aug 27, 2021

To make this more actionable, I've added a section describing a new SamplingResult which is same as the new Span field proposed here. 4e6c69a

This is also the same as @oertl's prototype:

https://github.com/dynatrace-research/opentelemetry-sampling-poc/blob/087741b7bf4a33353ba7d4dccac4c99d82b10de7/poc/src/main/java/com/dynatrace/research/otelsampling/sampling/AbstractConsistentSampler.java#L166

Copy link
Contributor

@tedsuo tedsuo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This OTEP looks GTM, any small changes could be worked out during the spec process.

@jmacd
Copy link
Contributor Author

jmacd commented Sep 9, 2021

The only remaining question here, possibly, is whether the Span field name should be log_head_adjusted_count or something similar. I propose that we merge this OTEP. 😀

@jsuereth jsuereth merged commit eee3cb8 into open-telemetry:main Sep 9, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants