From fcd0d6f96a6e8bde90a1e6d94c9c4297694228b0 Mon Sep 17 00:00:00 2001 From: Joshua MacDonald Date: Tue, 27 Jul 2021 12:33:13 -0700 Subject: [PATCH 01/23] draft from OTEP 148 --- text/trace/0000-sampling-probability.md | 690 ++++++++++++++++++++++++ 1 file changed, 690 insertions(+) create mode 100644 text/trace/0000-sampling-probability.md diff --git a/text/trace/0000-sampling-probability.md b/text/trace/0000-sampling-probability.md new file mode 100644 index 000000000..6ce72dae3 --- /dev/null +++ b/text/trace/0000-sampling-probability.md @@ -0,0 +1,690 @@ +s# Probability sampling of telemetry events + + + +- [Motivation](#motivation) +- [Examples](#examples) + * [Span sampling](#span-sampling) + + [Sample spans to Counter Metric](#sample-spans-to-counter-metric) + + [Sample spans to Histogram Metric](#sample-spans-to-histogram-metric) + + [Sample span rate limiting](#sample-span-rate-limiting) + * [Metric sampling](#metric-sampling) + + [Statsd Counter](#statsd-counter) + + [Metric exemplars with adjusted counts](#metric-exemplars-with-adjusted-counts) + + [Metric cardinality limiter](#metric-cardinality-limiter) +- [Explanation](#explanation) + * [Model and terminology](#model-and-terminology) + + [Sampling without replacement](#sampling-without-replacement) + + [Adjusted sample count](#adjusted-sample-count) + + [Sampling and variance](#sampling-and-variance) + * [Conveying the sampling probability](#conveying-the-sampling-probability) + + [Encoding adjusted count](#encoding-adjusted-count) + + [Encoding inclusion probability](#encoding-inclusion-probability) + + [Encoding negative base-2 logarithm of inclusion probability](#encoding-negative-base-2-logarithm-of-inclusion-probability) + + [Multiply the adjusted count into the data](#multiply-the-adjusted-count-into-the-data) + * [Trace Sampling](#trace-sampling) + + [Counting spans and traces](#counting-spans-and-traces) + + [Head sampling for traces](#head-sampling-for-traces) + - [`Parent` Sampler](#parent-sampler) + - [`TraceIDRatio` Sampler](#traceidratio-sampler) + - [Dapper's "Inflationary" Sampler](#dappers-inflationary-sampler) + * [Working with adjusted counts](#working-with-adjusted-counts) + + [Merging samples](#merging-samples) + + [Maintaining "Probability proportional to size"](#maintaining-probability-proportional-to-size) + + [Zero adjusted count](#zero-adjusted-count) +- [Proposed specification text](#proposed-specification-text) +- [Recommended reading](#recommended-reading) +- [Acknowledgements](#acknowledgements) + + + +Objective: Specify a foundation for sampling techniques in OpenTelemetry. + +## Motivation + +Probability sampling allows consumers of sampled telemetry data to +collect a fraction of telemetry events and use them to estimate total +quantities about the population of events, such as the total rate of +events with a particular attribute. Sampling is a general-purpose +facility for lowering cost at the expense of lower data quality. + +These techniques enable reducing the cost of telemetry collection, +both for producers (i.e., SDKs) and for processors (i.e., Collectors), +without losing the ability to (at least coarsely) monitor the whole +system. + +Sampling builds on results from probability theory, most significantly +the concept of expected value. Estimates drawn from probability +samples are *random variables* that, when correct procedures are +followed, accurately reflect their true value, making them unbiased. +Unbiased samples can be used for after-the-fact analysis. We can +answer questions such as "what fraction of events had property X?" +using the fraction of events in the sample that have property X. + +This document outlines how producers and consumers of sample telemetry +data can convey estimates about the total count of telemetry events, +without conveying information about how the sample was computed, using +a quantity known as **adjusted count**. In common language, a +"one-in-N" sampling scheme emits events with adjusted count equal to +N. Adjusted count is the expected value of the number of events in +the population represented by an individual sample event. + +## Examples + +These examples use the proposed attribute `sampler.adjusted_count` to +convey sampling probability. Consumers of spans, metrics, and logs +annotated with adjusted counts are able to calculate accurate +statistics about the whole population of events, at a basic level, +without knowing details about the sampling configuration. + +### Span sampling + +Example use-cases for probability sampling of spans +generally involve generating metrics from spans. + +#### Sample spans to Counter Metric + +For every complete span it receives, the example processor will synthesize +metric data as though a Counter instrument named `S.count` for span +named `S` had been incremented once per span at the original `Start()` +call site. + +This processor will add the adjusted count of each span to the +instrument (e.g., `Add(adjusted_count, labels...)`) for every span it +receives, logically taking place at the start or end time of the span. + +#### Sample spans to Histogram Metric + +For every span it receives, the example processor will synthesize +metric data as though a Histogram instrument named `S.duration` for +span named `S` had been observed once per span at the original `End()` +call site. + +The OpenTelemetry Metric data model does not support histogram buckets +with non-integer counts, which forces the use of integer adjusted +counts here (i.e., 1-in-N sampling rates where N is an integer). + +Logically speaking, this processor will observe the span's duration its +adjusted count number of times for every span it receives, at the end +time of the span. + +#### Sample span rate limiting + +A collector processor will introduce a slight delay in order to ensure +it has received a complete frame of data, during which time it +maintains a fixed-size buffer of complete input spans. If the number of spans +received exceeds the size of the buffer before the end of the +interval, begin weighted sampling using the adjusted count of each +span as input weight. + +This processor drops spans when the configured rate threshold is +exceeeded, otherwise it passes spans through with unmodifed adjusted +counts. + +When the interval expires and the sample frame is considered complete, +the selected sample spans are output with possibly updated adjusted +counts. + +### Metric sampling + +Example use-cases for probability sampling of metrics +are aimed at lowering cost and addressing high cardinality. + +#### Statsd Counter + +A Statsd counter event appears as a line of text, describing a +number-valued event with optional attributes and inclusion probability +("sample rate"). + +For example, a metric named `name` is incremented by `increment` using +a counter event (`c`) with the given `sample_rate`. + +``` +name:increment|c|@sample_rate +``` + +For example, a count of 100 that was selected for a 1-in-10 simple +random sampling scheme will arrive as: + +``` +counter:100|c|@0.1 +``` + +Events in the example have with 0.1 inclusion probability have +adjusted count of 10. Assuming the sample was selected using an +unbiased algorithm, we can interpret this event as having an expected +count of `100/0.1 = 1000`. + +#### Metric exemplars with adjusted counts + +The OTLP protocol for metrics includes a repeated exemplars field in +every data point. This is a place where Metric aggregators (e.g., +histograms) are able to provide example context to correlate metrics +with traces. + +OTLP exemplars support additional attributes, those that were present +on the API event and were dropped during aggregation. Exemplars that +are selected probabilistically and recorded with their adjusted counts +make it possible to approximately count events using dimensions that +were dropped during metric aggregation. + +An end-to-end pipeline of sampled metrics events can be constructed +based on exemplars with adjusted counts, one capable of supporting +approximate-count queries over sampled metric events at high +cardinality. + +#### Metric cardinality limiter + +A metrics processor can be configured to limit cardinality for a +single metric name, allowing no more than K distinct label sets per +export interval. The export interval is fixed to a short interval so +that a complete set of distinct labels can be stored temporarily. + +Caveats: as presented, this works for Sum and Histogram points +received with Delta aggregation temporality and where the Sum is +monotonic (see +[opentelemetry-proto/issues/303](https://github.com/open-telemetry/opentelemetry-proto/issues/303)). + +Considering data points received during the interval, when the number +of points exceeds K, select a probability proportional to size sample +of points, output every point with a `sampler.adjusted_count` attribute. + +## Explanation + +Consider a hypothetical telemetry signal in which a stream of +data items is produced containing one or more associated numbers. +Using the OpenTelemetry Metrics data model terminology, we have two +scenarios in which sampling is common. + +1. _Counter events:_ Each event represents a count, signifying the change in a sum. +2. _Histogram events:_ Each event represents an individual variable, signifying membership in a distribution. + +A Tracing Span event qualifies as both of these cases simultaneously. +One span can be interpreted as at least one Counter event (e.g., one +request, the number of bytes read) and at least one Histogram event +(e.g., request latency, request size). + +In Metrics, [Statsd Counter and Histogram events meet this definition](https://github.com/statsd/statsd/blob/master/docs/metric_types.md#sampling). + +In both cases, the goal in sampling is to estimate the count of events +in the whole population, meaning all the events, using only the events +that were selected in the sample. + +### Model and terminology + +This model is meant to apply in telemetry collection situations where +individual events at an API boundary are sampled for collection. Once +the process of sampling individual API-level events is understood, we +will learn to apply these techniques for sampling aggregated data. + +In sampling, the term _sampling design_ refers to how sampling +probability is decided and the term _sample frame_ refers to how +events are organized into discrete populations. The design of a +sampling strategy dictates how the population is framed. + +For example, a simple design uses uniform probability, and a simple +framing technique is to collect one sample per distinct span name per +hour. A different sample framing could collect one sample across all +span names every 10 minutes. + +After executing a sampling design over a frame, each item selected in +the sample will have known _inclusion probability_, that determines +how likely the item was to being selected. Implicitly, all the items +that were not selected for the sample have zero inclusion probability. + +Descriptive words that are often used to describe sampling designs: + +- *Fixed*: the sampling design is the same from one frame to the next +- *Adaptive*: the sampling design changes from one frame to the next based on the observed data +- *Equal-Probability*: the sampling design uses a single inclusion probability per frame +- *Unequal-Probability*: the sampling design uses multiple inclusion probabilities per frame +- *Reservoir*: the sampling design uses fixed space, has fixed-size output. + +Our goal is to support flexibility in choosing sampling designs for +producers of telemetry data, while allowing consumers of sampled +telemetry data to be agnostic to the sampling design used. + +#### Sampling without replacement + +We are interested in the common case in telemetry collection, where +sampling is performed while processing a stream of events and each +event is considered just once. Sampling designs of this form are +referred to as _sampling without replacement_. Unless stated +otherwise, "sampling" in telemetry collection always refers to +sampling without replacement. + +After executing a given sampling design over a complete frame of data, +the result is a set of selected sample events, each having known and +non-zero inclusion probability. There are several other quantities of +interest, after calculating a sample from a sample frame. + +- *Sample size*: the number of events with non-zero inclusion probability +- *True population total*: the exact number of events in the frame, which may be unknown +- *Estimated population total*: the estimated number of events in the frame, which is computed from the sample. + +The sample size is always known after it is calculated, but the size +may or may not be known ahead of time, depending on the design. +Probabilistic sampling schemes require that the estimated population +total equals the expected value of the true population total. + +#### Adjusted sample count + +Following the model above, every event defines the notion of an +_adjusted count_. + +- _Adjusted count_ is zero if the event was not selected for the sample +- _Adjusted count_ is the reciprocal of its inclusion probability, otherwise. + +The adjusted count of an event represents the expected contribution to +the estimated population total of a sample frame represented by the +individual event. + +The use of a reciprocal inclusion probability matches our intuition +for probabilities. Items selected with "one-out-of-N" probability of +inclusion count for N each, approximately speaking. + +This intuition is backed up with statistics. This equation is known +as the Horvitz-Thompson estimator of the population total, a +general-purpose statistical "estimator" that applies to all _without +replacement_ sampling designs. + +Assuming sample data is correctly computed, the consumer of sample +data can treat every sample event as though an identical copy of +itself has occurred _adjusted count_ times. Every sample event is +representative for adjusted count many copies of itself. + +There is one essential requirement for this to work. The selection +procedure must be _statistically unbiased_, a term meaning that the +process is required to give equal consideration to all possible +outcomes. + +#### Sampling and variance + +The use of unbiased sampling outlined above makes it possible to +estimate the population total for arbitrary subsets of the sample, as +every individual sample has been independently assigned an adjusted +count. + +There is a natural relationship between statistical bias and variance. +Approximate counting comes with variance, a matter of fact which can +be controlled for by the sample size. Variance is unavoidable in an +unbiased sample, but variance diminishes with increasing sample size. + +Although this makes it sound like small sample sizes are a problem, +due to expected high variance, this is just a limitation of the +technique. When variance is high, use a larger sample size. + +An easy approach for lowering variance is to aggregate sample frames +together across time, which generally increases the size of the +subpopulations being counted. For example, although the estimates for +the rate of spans by distinct name drawn from a one-minute sample may +have high variance, combining an hour of one-minute sample frames into +an aggregate data set is guaranteed to lower variance (assuming the +numebr of span names stays fixed). It must, because the data remains +unbiased, so more data results in lower variance. + +### Conveying the sampling probability + +Some possibilities for encoding the adjusted count or inclusion +probability are discussed below, depending on the circumstances and +the protocol. Here, the focus is on how to count sampled telemetry +events in general, not a specific kind of event. As we shall see in +the following section, tracing comes with addional complications. + +There are several ways of encoding this adjusted count or inclusion +probability: + +- as a dedicated field in an OTLP protobuf message +- as a non-descriptive Attribute in an OTLP Span, Metric, or Log +- without any dedicated field. + +#### Encoding adjusted count + +We can encode the adjusted count directly as a floating point or +integer number in the range [0, +Inf). This is a conceptually easy +way to understand sampling because larger numbers mean greater +representivity. + +Note that it is possible, given this description, to produce adjusted +counts that are not integers. Adjusted counts are an approximatation, +and the expected value of an integer can be a fractional count. +Floating-point adjusted counts can be avoided with the use of +integer-reciprocal inclusion probabilities. + +#### Encoding inclusion probability + +We can encode the inclusion probability directly as a floating point +number in the range [0, 1). This is typical of the Statsd format, +where each line includes an optional probability. In this context, +the probability is also commonly referred to as a "sampling rate". In +this case, smaller numbers mean greater representivity. + +#### Encoding base-2 logarithm of adjusted count + +We can encode the base-2 logarithm of adjusted count (i.e., negative +base-2 logarithm of inclusion probability). By using an integer +field, restricting adjusted counts and inclusion probabilities to +powers of two, this allows the use of small non-negative integers to +encode the adjusted count. In this case, larger numbers mean +exponentially greater representivity. + +#### Multiply the adjusted count into the data + +When the data itself carries counts, such as for the Metrics Sum and +Histogram points, the adjusted count can be multipled into the data. + +This technique is less desirable because, while it preserves the +expected value of the count or sum, the data loses information about +variance. This may also lead to rounding errors, when adjusted counts +are not integer valued. + +### Trace Sampling + +Sampling techniques are always about lowering the cost of data +collection and analysis, but in trace collection and analysis +specifically, approaches can be categorized by whether they reduce +Tracer overhead. Tracer overhead is reduced by not recording spans +for unsampled traces and requires making the sampling decision for a +trace before all of its attributes are known. + +Traces are expected to be complete, meaning that a tree or sub-tree of +spans branching from a certain root are expected to be fully +collected. When sampling is applied to reduce Tracer overhead, there +is generally an expectation that complete traces will still be +produced. Sampling techniques that lower Tracer overhead and produce +complete traces are known as _Head-based trace sampling_ techniques. + +The decision to produce and collect a sample trace has to be made when +the root span starts, to avoid incomplete traces. Then, assuming +complete traces can be collected, the adjusted count of the root span +determines an adjusted count for every span in the trace. + +#### Counting child spans using root span adjusted counts + +The adjusted count of a root span determines the adjusted count of +each of its children based on the following logic: + +- The root span is considered representative of `adjusted_count` many + identical root spans, because it was selected using unbiased sampling +- Context propagation conveys _causation_, the fact the one span produces + another +- A root span causes each of the child spans in its trace to be produced +- A sampled root span represents `adjusted_count` many traces, representing + the cause of `adjusted_count` many occurances per child span in the + sampled trace. + +Using this reasoning, we can define a sample collected from all root +spans in the system, which allows estimating the count of all spans in +the population. Take a simple probability sample of root spans: + +1. In the `Sampler` decision for root spans, use the initial span properties + to determine the inclusion probability `P` +2. Make a pseudo-random selection with probability `P`, if true return + `RECORD_AND_SAMPLE` (so that the W3C Trace Context `is-sampled` + flag is set in all child contexts) +3. Encode a span attribute `sampler.adjusted_count` equal to `1/P` on the root span +4. Collect all spans where the W3C Trace Context `is-sampled` flag is set. + +After collecting all sampled spans, locate the root span for each. +Apply the root span's adjusted count to every child in the associated +trace. The sum of adjusted counts on all sampled spans is expected to +equal the population total number of spans. + +Now, having stored the sample spans with their adjusted counts, and +assuming the source of randomness is good, we can extrapolate counts +for the population using arbitrary queries over the sampled spans. +Sampled spans can be translated into approximate metrics over the +population of spans, after their adjusted counts are known. + +The cost of this analysis, using only the root span's adjusted count, +is that all root spans have to be collected before we can count +non-root spans. The cost of indexing and looking up the root span +adjusted counts makes this analysis relatively expensive to perform in +real time. + +#### Using head trace probability to count all spans + +If the W3C `is-sampled` flag will be used to determine whether +`RECORD_AND_SAMPLE` is returned in a Sampler, then in order to count +sample spans without first locating the root span requires propagating +the _head trace sampling probability_ through the context. + +Head trace sampling probability may be thought of as the probability +of causing a child span to be a sampled. Propagators that maintain +this variable MUST obey the rules of conditional probability. In this +model, the adjusted count of each span depends on the adjusted count +of its parent, not of the root in a trace. Still, the sum of adjusted +counts of all sampled spans is expected to equal the population total +number of spans. + +This applies to other forms of telemetry that happen (i.e., are +caused) within a context carrying head trace sampling probability. +For example, we may record log events and metrics exemplars with +adjusted counts equal to the inverse of the current head trace +sampling probability when they are produced. + +This technique allows translating spans and logs to metrics without +first locating their root span, a significant performance advantage +compared with first collecting and indexing root spans. + +Several head sampling techniques are discussed in the following +sections and evaluated in terms of their ability to meet all of the +following criteria: + +- Reduces Tracer overhead +- Produces complete traces +- Spans are countable. + +#### Head sampling for traces + +Details about Sampler implementations that meet +the requirements stated above. + +##### `Parent` Sampler + +The `Parent` Sampler ensures complete traces, provided all spans are +successfully recorded. A downside of `Parent` sampling is that it +takes away control over Tracer overhead from non-roots in the trace. +To support real-time span-to-metrics applications, this Sampler +requires propagating the sampling probability or adjusted count of +the context in effect when starting child spans. This is expanded +upon in [OTEP 168 (WIP)](https://github.com/open-telemetry/oteps/pull/168). + +When propagating head sampling probability, spans recorded by the +`Parent` sampler MAY encode the adjusted count in the corresponding +`SpanData` using a non-descriptive Span attribute named +`sampler.adjusted_count`. + +##### `TraceIDRatio` Sampler + +The OpenTelemetry tracing specification includes a built-in Sampler +designed for probability sampling using a deterministic sampling +decision based on the TraceID. This Sampler was not finished before +the OpenTelemetry version 1.0 specification was released; it was left +in place, with [a TODO and the recommendation to use it only for trace +roots](https://github.com/open-telemetry/opentelemetry-specification/issues/1413). +[OTEP 135 proposed a solution](https://github.com/open-telemetry/oteps/pull/135). + +The goal of the `TraceIDRatio` Sampler is to coordinate the tracing +decision, but give each service control over Tracer overhead. Each +service sets its sampling probability independently, and the +coordinated decision ensures that some traces will be complete. +Traces are complete when the TraceID ratio falls below the minimum +Sampler probability across the whole trace. Techniques have been +developed for [analysis of partial traces that are compatible with +TraceID ratio sampling](https://arxiv.org/pdf/2107.07703.pdf). + +The `TraceIDRatio` Sampler has another difficulty with testing for +completeness. It is impossible to know whether there are missing leaf +spans in a trace without using external information. One approach, +[lost in the transition from OpenCensus to OpenTelemetry is to count +the number of children of each +span](https://github.com/open-telemetry/opentelemetry-specification/issues/355). + +Lacking the number of expected children, we require a way to know the +minimum Sampler probability across traces to ensure they are complete. + +To count TraceIDRatio-sampled spans, each span MAY encode its adjusted +count in the corresponding `SpanData` using a non-descriptive Span +attribute named `sampler.adjusted_count`. + +##### Dapper's "Inflationary" Sampler + +Google's [Dapper](https://research.google/pubs/pub36356/) tracing +system describes the use of sampling to control the cost of trace +collection at scale. Dapper's early Sampler algorithm, referred to as +an "inflationary" approach (although not published in the paper), is +reproduced here. + +This kind of Sampler allows non-root spans in a trace to raise the +probability of tracing, using a conditional probability formula shown +below. Traces produced in this way are complete sub-trees, not +necessarily complete. This technique is successful especially in +systems where a high-throughput service on occasion calls a +low-throughput service. Low-throughput services are meant to inflate +their sampling probability. + +The use of this technique requires propagating the head inclusion +probability (as discussed for the `Parent` sampler) of the incoming +Context and whether it was sampled, in order to calculate the +probability of starting to sample a new "sub-root" in the trace. + +Using standard notation for conditional probability, `P(x)` indicates +the probability of `x` being true, and `P(x|y)` indicates the +probability of `x` being true given that `y` is true. The axioms of +probability establish that: + +``` +P(x)=P(x|y)*P(y)+P(x|not y)*P(not y) +``` + +The variables are: + +- **`H`**: The head inclusion probability of the parent context that + is in effect, independent of whether the parent context was sampled +- **`I`**: The inflationary sampling probability for the span being + started. +- **`D`**: The decision probability for whether to start a new sub-root. + +This Sampler cannot lower sampling probability, so if the new span is +started with `H >= I` or when the context is already sampled, no new +sampling decisions are made. If the incoming context is already +sampled, the adjusted count of the new span is `1/H`. + +Assuming `H < I` and the incoming context was not sampled, we have the +following probability equations: + +``` +P(span sampled) = I +P(parent sampled) = H +P(span sampled | parent sampled) = 1 +P(span sampled | parent not sampled) = D +``` + +Using the formula above, + +``` +I = 1*H + D*(1-H) +``` + +solve for D: + +``` +D = (I - H) / (1 - H) +``` + +Now the Sampler makes a decision with probability `D`. Whether the +decision is true or false, propagate `I` as the new head inclusion +probability. If the decision is true, begin recording a sub-rooted +trace with adjusted count `1/I`. + +#### Zero adjusted count + +An adjusted count with zero value carries meaningful information, +specifically that the item participated in a probabilistic sampling +scheme and was not selected. A zero value can be be useful to record +events outside of a sample, when they provide useful information +despite their effective count. We can use this to record error +exemplars, for example, even when they are not selected by the +Sampler. + +## Proposed specification text + +The following text will be added to the semantic conventions for +recording the Sampler name and adjusted count (if known) as +OpenTelemetry Span attributes. + +``` +# Semantic conventions for Sampled spans + +This document defines how to describe an what sampling was performed +when recording a span that has had sampling logic applied. + +Span sampling attributes support computing metrics about spans that +are part of a sampled trace from knowing their sampling inclusion +probability. + +The _adjusted count_ of a span is defined as follows: + +- Adjusted count equals zero when inclusion probability equals zero +- Adjusted count equals the mathematical inverse (i.e., reciprocal) of sampling inclusion probability when inclusion probability is non-zero. + +Consumers of spans carrying an adjusted count attribute are able to +use the adjusted count of the span to increment a counter of matching +spans. + +## Probability Sampling Attributes + +The `sampler.adjusted_count` attribute MUST reflect an unbiased +estimate of the number of representative spans in the population of +spans being produced. + +When built-in Samplers are used, the name of the effective Sampler +that computed the adjusted count is included to indicate how the sample +was computed, which may give additional information. + +| Attribute | Type | Description | Examples | Required | +|---|---|---|---|---| +| `sampler.adjusted_count` | number | Effective count of the associated span. | 10 | No | +| `sampler.name` | string | The name of the Sampler that determined the adjusted count. | `Parent` | Yes | + +For the built-in samplers, the following names are specified: + +| Built-in Sampler | Sets `sampler.adjusted_count`? | `sampler.name` | Notes | +| -- | -- | -- | +| AlwaysOn | No | Not applicable | Sampling attributes are not used | +| AlwaysOff | No | Not applicable | Spans are not recorded | +| ParentBased | Maybe | `Parent` | Adjusted count requires propagation | +| TraceIDRatio | Yes | `TraceIDRatio` | | +``` + +Note that the `AlwaysOn` and `AlwaysOff` Samplers do not need to +recorder their names, since they are indistinguishable from not having +a stampler configured. When there is no `sampler.name` attribute +present and a Span is recorded, it should be counted as one span +(i.e., count == adjusted_count). + +See [OTEP 168 (WIP)](https://github.com/open-telemetry/oteps/pull/168) +for details about how to report sampling probability when using the +`Parent` Sampler. + +## Recommended reading + +[Sampling, 3rd Edition, by Steven +K. Thompson](https://www.wiley.com/en-us/Sampling%2C+3rd+Edition-p-9780470402313). + +[A Generalization of Sampling Without Replacement From a Finite Universe](https://www.jstor.org/stable/2280784), JSTOR (1952) + +[Performance Is A Shape. Cost Is A Number: Sampling](https://docs.lightstep.com/otel/performance-is-a-shape-cost-is-a-number-sampling), 2020 blog post, Joshua MacDonald + +[Priority sampling for estimation of arbitrary subset sums](https://dl.acm.org/doi/abs/10.1145/1314690.1314696) + +[Stream sampling for variance-optimal estimation of subset sums](https://arxiv.org/abs/0803.0473). + +[Estimation from Partially Sampled Distributed Traces](https://arxiv.org/pdf/2107.07703.pdf), 2021 Dynatrace Research report, Otmar Ertl + +## Acknowledgements + +Thanks to [Neena Dugar](https://github.com/neena) and [Alex +Kehlenbeck](https://github.com/akehlenbeck) for their help +reconstructing the Dapper Sampler algorithm. From f2a7efc716dab7f69ba64493ffbe8b151506ab9e Mon Sep 17 00:00:00 2001 From: Joshua MacDonald Date: Tue, 27 Jul 2021 12:36:05 -0700 Subject: [PATCH 02/23] renumber --- ...{0000-sampling-probability.md => 0170-sampling-probability.md} | 0 1 file changed, 0 insertions(+), 0 deletions(-) rename text/trace/{0000-sampling-probability.md => 0170-sampling-probability.md} (100%) diff --git a/text/trace/0000-sampling-probability.md b/text/trace/0170-sampling-probability.md similarity index 100% rename from text/trace/0000-sampling-probability.md rename to text/trace/0170-sampling-probability.md From fefb309b65919854d6c7b7476398cc5eec2b24c7 Mon Sep 17 00:00:00 2001 From: Joshua MacDonald Date: Tue, 27 Jul 2021 12:38:12 -0700 Subject: [PATCH 03/23] typo in header --- text/trace/0170-sampling-probability.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/text/trace/0170-sampling-probability.md b/text/trace/0170-sampling-probability.md index 6ce72dae3..30a962ec9 100644 --- a/text/trace/0170-sampling-probability.md +++ b/text/trace/0170-sampling-probability.md @@ -1,4 +1,4 @@ -s# Probability sampling of telemetry events +# Probability sampling of telemetry events From df6b3eebec0029bdeaaf2454790b97dd2c9fcbde Mon Sep 17 00:00:00 2001 From: Joshua MacDonald Date: Tue, 27 Jul 2021 12:41:26 -0700 Subject: [PATCH 04/23] typos --- text/trace/0170-sampling-probability.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/text/trace/0170-sampling-probability.md b/text/trace/0170-sampling-probability.md index 30a962ec9..eb7f2650d 100644 --- a/text/trace/0170-sampling-probability.md +++ b/text/trace/0170-sampling-probability.md @@ -659,8 +659,8 @@ For the built-in samplers, the following names are specified: ``` Note that the `AlwaysOn` and `AlwaysOff` Samplers do not need to -recorder their names, since they are indistinguishable from not having -a stampler configured. When there is no `sampler.name` attribute +record their names, since they are indistinguishable from not having +a Sampler configured. When there is no `sampler.name` attribute present and a Span is recorded, it should be counted as one span (i.e., count == adjusted_count). From b1c2f83ea0c907276fefb427ed852b2dee69a548 Mon Sep 17 00:00:00 2001 From: Joshua MacDonald Date: Tue, 27 Jul 2021 12:43:46 -0700 Subject: [PATCH 05/23] formatting --- text/trace/0170-sampling-probability.md | 18 +++++++++--------- 1 file changed, 9 insertions(+), 9 deletions(-) diff --git a/text/trace/0170-sampling-probability.md b/text/trace/0170-sampling-probability.md index eb7f2650d..12aac76ca 100644 --- a/text/trace/0170-sampling-probability.md +++ b/text/trace/0170-sampling-probability.md @@ -644,18 +644,18 @@ that computed the adjusted count is included to indicate how the sample was computed, which may give additional information. | Attribute | Type | Description | Examples | Required | -|---|---|---|---|---| -| `sampler.adjusted_count` | number | Effective count of the associated span. | 10 | No | -| `sampler.name` | string | The name of the Sampler that determined the adjusted count. | `Parent` | Yes | +|---------- | ---- | ----------- | -------- | -------- | +| `sampler.adjusted_count` | number | Effective count of the span. | 10 | No | +| `sampler.name` | string | The name of the Sampler. | `Parent` | Yes | For the built-in samplers, the following names are specified: -| Built-in Sampler | Sets `sampler.adjusted_count`? | `sampler.name` | Notes | -| -- | -- | -- | -| AlwaysOn | No | Not applicable | Sampling attributes are not used | -| AlwaysOff | No | Not applicable | Spans are not recorded | -| ParentBased | Maybe | `Parent` | Adjusted count requires propagation | -| TraceIDRatio | Yes | `TraceIDRatio` | | +| Built-in Sampler | Sets `sampler.adjusted_count`? | `sampler.name` | Notes | +| ---------------- | ------------------------------ | -------------- | ------ | +| AlwaysOn | No | Not applicable | Sampling attributes are not used | +| AlwaysOff | No | Not applicable | Spans are not recorded | +| ParentBased | Maybe | `Parent` | Adjusted count requires propagation | +| TraceIDRatio | Yes | `TraceIDRatio` | | ``` Note that the `AlwaysOn` and `AlwaysOff` Samplers do not need to From 159f9cfe1555e091700e076e28fc67bdafe53303 Mon Sep 17 00:00:00 2001 From: Joshua MacDonald Date: Tue, 27 Jul 2021 13:45:39 -0700 Subject: [PATCH 06/23] clean TOC --- text/trace/0170-sampling-probability.md | 14 -------------- 1 file changed, 14 deletions(-) diff --git a/text/trace/0170-sampling-probability.md b/text/trace/0170-sampling-probability.md index 12aac76ca..21b87382f 100644 --- a/text/trace/0170-sampling-probability.md +++ b/text/trace/0170-sampling-probability.md @@ -28,10 +28,6 @@ - [`Parent` Sampler](#parent-sampler) - [`TraceIDRatio` Sampler](#traceidratio-sampler) - [Dapper's "Inflationary" Sampler](#dappers-inflationary-sampler) - * [Working with adjusted counts](#working-with-adjusted-counts) - + [Merging samples](#merging-samples) - + [Maintaining "Probability proportional to size"](#maintaining-probability-proportional-to-size) - + [Zero adjusted count](#zero-adjusted-count) - [Proposed specification text](#proposed-specification-text) - [Recommended reading](#recommended-reading) - [Acknowledgements](#acknowledgements) @@ -598,16 +594,6 @@ decision is true or false, propagate `I` as the new head inclusion probability. If the decision is true, begin recording a sub-rooted trace with adjusted count `1/I`. -#### Zero adjusted count - -An adjusted count with zero value carries meaningful information, -specifically that the item participated in a probabilistic sampling -scheme and was not selected. A zero value can be be useful to record -events outside of a sample, when they provide useful information -despite their effective count. We can use this to record error -exemplars, for example, even when they are not selected by the -Sampler. - ## Proposed specification text The following text will be added to the semantic conventions for From ee6252a47c130d8c06234a92ed2d462d6f9e7312 Mon Sep 17 00:00:00 2001 From: Joshua MacDonald Date: Tue, 27 Jul 2021 13:48:29 -0700 Subject: [PATCH 07/23] TOC edit --- text/trace/0170-sampling-probability.md | 5 +++-- 1 file changed, 3 insertions(+), 2 deletions(-) diff --git a/text/trace/0170-sampling-probability.md b/text/trace/0170-sampling-probability.md index 21b87382f..048a8cd4a 100644 --- a/text/trace/0170-sampling-probability.md +++ b/text/trace/0170-sampling-probability.md @@ -20,10 +20,11 @@ * [Conveying the sampling probability](#conveying-the-sampling-probability) + [Encoding adjusted count](#encoding-adjusted-count) + [Encoding inclusion probability](#encoding-inclusion-probability) - + [Encoding negative base-2 logarithm of inclusion probability](#encoding-negative-base-2-logarithm-of-inclusion-probability) + + [Encoding base-2 logarithm of adjusted count](#encoding-base-2-logarithm-of-adjusted-count) + [Multiply the adjusted count into the data](#multiply-the-adjusted-count-into-the-data) * [Trace Sampling](#trace-sampling) - + [Counting spans and traces](#counting-spans-and-traces) + + [Counting child spans using root span adjusted counts](#counting-child-spans-using-root-span-adjusted-counts) + + [Using head trace probability to count all spans](#using-head-trace-probability-to-count-all-spans) + [Head sampling for traces](#head-sampling-for-traces) - [`Parent` Sampler](#parent-sampler) - [`TraceIDRatio` Sampler](#traceidratio-sampler) From f76d11eb926fd57b05fef4d59dc05617968cfe8b Mon Sep 17 00:00:00 2001 From: Joshua MacDonald Date: Wed, 28 Jul 2021 11:46:16 -0700 Subject: [PATCH 08/23] Clarify the counting algorithm --- text/trace/0170-sampling-probability.md | 104 ++++++++++++++++-------- 1 file changed, 70 insertions(+), 34 deletions(-) diff --git a/text/trace/0170-sampling-probability.md b/text/trace/0170-sampling-probability.md index 048a8cd4a..991e08f83 100644 --- a/text/trace/0170-sampling-probability.md +++ b/text/trace/0170-sampling-probability.md @@ -601,60 +601,96 @@ The following text will be added to the semantic conventions for recording the Sampler name and adjusted count (if known) as OpenTelemetry Span attributes. -``` -# Semantic conventions for Sampled spans +### Semantic conventions for Sampled spans [Proposed text] -This document defines how to describe an what sampling was performed -when recording a span that has had sampling logic applied. +This document defines conventions for counting spans in a sample taken +over all spans created in all contexts in a distributed system. These +conventions support accurate counting of system-wide events using only +the fraction of spans that were collected in a probability sampling +scheme. With these conventions, consumers of OTLP Span data are able +to compute approximate metrics about the system using only the sample +Spans that was collected, thus we refer to these conventions as +supporting Span-to-Metrics pipelines. -Span sampling attributes support computing metrics about spans that -are part of a sampled trace from knowing their sampling inclusion -probability. +The _sampling rate_, also known as _inclusion probability_, is the +probability that a Span is included in the Sample that is collected. +Sampling rate is conveyed in a form known as _adjusted count_, which +tells the receiver how many events in the population are represented +by the individual Span as a result of sampling. -The _adjusted count_ of a span is defined as follows: +The adjusted count of a span is defined as follows: - Adjusted count equals zero when inclusion probability equals zero -- Adjusted count equals the mathematical inverse (i.e., reciprocal) of sampling inclusion probability when inclusion probability is non-zero. +- Adjusted count equals the mathematical inverse (i.e., reciprocal) of inclusion probability when inclusion probability is non-zero. Consumers of spans carrying an adjusted count attribute are able to use the adjusted count of the span to increment a counter of matching spans. -## Probability Sampling Attributes +#### Probability Sampling Attributes [Proposed text] + +The `sampler.adjusted_count` attribute, when set, MUST equal an +unbiased estimate of the number of representative spans in the +population of spans in the system. + +The _exported count_ associated with a span is defined as either 1 or +0, depending on whether the span is exported and thus counted. The +exported count is 1 if the span is exported (because it will be +counted) and 0 if the span is not exported. + +To avoid recording redundent information, both the `sampler.name` and +`sampler.adjusted_count` attributes MAY be omitted when the counting +algorithm given below produces a correct result. + +There are scenarios where the adjusted count is unknown, such as when +using the `ParentBased` Sampler with a W3C version-0 `traceparent` +context. + +The `sampler.adjusted_count` SHOULD be omitted when its value is 1 or +unknown. The adjusted count can be safely omitted when it is 1 +because that is exactly the number of events associated the span in +that case. In case the adjusted count is unknown, `sampler.name` MUST +be set with a Sampler name to signify an unknown adjusted count. -The `sampler.adjusted_count` attribute MUST reflect an unbiased -estimate of the number of representative spans in the population of -spans being produced. +The presence of a `sampler.name` without a `sampler.adjusted_count` +SHOULD be taken as a signal that a span-to-metrics pipeline cannot be +established without external information. Otherwise, `sampler.name` +SHOULD be set when the adjusted count is not equal to 1. -When built-in Samplers are used, the name of the effective Sampler -that computed the adjusted count is included to indicate how the sample -was computed, which may give additional information. +The algorithm for spans-to-metrics is as follows: + +``` +// Calls `span_to_metrics(span, C)` for the effective count `C` of +// every `span` received. +for _, span := <-spans_received { + if count, has := span.attributes['sampler.adjusted_count']; has { + span_to_metrics(span, count) + } else if name, has := span.attributes['sampler.name]; has { + log.Error("span requires trace assembly before counting") + } else { + span_to_metrics(span, 1) + } +} +``` + +To summarize, these two attributes convey information about the +Sampler that recorded a span. | Attribute | Type | Description | Examples | Required | |---------- | ---- | ----------- | -------- | -------- | -| `sampler.adjusted_count` | number | Effective count of the span. | 10 | No | -| `sampler.name` | string | The name of the Sampler. | `Parent` | Yes | +| `sampler.adjusted_count` | number | Effective count of the span. | 10 | Yes, when adjusted count is not equal to 1 | +| `sampler.name` | string | The name of the Sampler. | `Parent` | Yes, when adjusted count is not equal to the exported count | For the built-in samplers, the following names are specified: -| Built-in Sampler | Sets `sampler.adjusted_count`? | `sampler.name` | Notes | -| ---------------- | ------------------------------ | -------------- | ------ | -| AlwaysOn | No | Not applicable | Sampling attributes are not used | -| AlwaysOff | No | Not applicable | Spans are not recorded | -| ParentBased | Maybe | `Parent` | Adjusted count requires propagation | -| TraceIDRatio | Yes | `TraceIDRatio` | | +| Built-in Sampler | Sets `sampler.adjusted_count`? | `sampler.name` | Notes | +| ---------------- | ------------------------------ | -------------- | ------------------------- | +| AlwaysOn | Not set | Not set | Adjusted count equals exported count | +| AlwaysOff | Don't care | Don't care | Exported count is zero, spans are not counted | +| ParentBased | Maybe | `Parent` | Adjusted count is known when it is propagated | +| TraceIDRatio | Yes | `TraceIDRatio` | Adjusted count is known | ``` -Note that the `AlwaysOn` and `AlwaysOff` Samplers do not need to -record their names, since they are indistinguishable from not having -a Sampler configured. When there is no `sampler.name` attribute -present and a Span is recorded, it should be counted as one span -(i.e., count == adjusted_count). - -See [OTEP 168 (WIP)](https://github.com/open-telemetry/oteps/pull/168) -for details about how to report sampling probability when using the -`Parent` Sampler. - ## Recommended reading [Sampling, 3rd Edition, by Steven From b847d468d5378e6eb1d150e66a4ab3b8770624cd Mon Sep 17 00:00:00 2001 From: Joshua MacDonald Date: Wed, 28 Jul 2021 11:47:37 -0700 Subject: [PATCH 09/23] typos --- text/trace/0170-sampling-probability.md | 5 ++--- 1 file changed, 2 insertions(+), 3 deletions(-) diff --git a/text/trace/0170-sampling-probability.md b/text/trace/0170-sampling-probability.md index 991e08f83..004a64ebc 100644 --- a/text/trace/0170-sampling-probability.md +++ b/text/trace/0170-sampling-probability.md @@ -601,7 +601,7 @@ The following text will be added to the semantic conventions for recording the Sampler name and adjusted count (if known) as OpenTelemetry Span attributes. -### Semantic conventions for Sampled spans [Proposed text] +### Semantic conventions for Sampled spans (Proposed text) This document defines conventions for counting spans in a sample taken over all spans created in all contexts in a distributed system. These @@ -627,7 +627,7 @@ Consumers of spans carrying an adjusted count attribute are able to use the adjusted count of the span to increment a counter of matching spans. -#### Probability Sampling Attributes [Proposed text] +#### Probability Sampling Attributes (Proposed text) The `sampler.adjusted_count` attribute, when set, MUST equal an unbiased estimate of the number of representative spans in the @@ -689,7 +689,6 @@ For the built-in samplers, the following names are specified: | AlwaysOff | Don't care | Don't care | Exported count is zero, spans are not counted | | ParentBased | Maybe | `Parent` | Adjusted count is known when it is propagated | | TraceIDRatio | Yes | `TraceIDRatio` | Adjusted count is known | -``` ## Recommended reading From 398649c20fc06b4a0cc931378b1a57c9c3522f61 Mon Sep 17 00:00:00 2001 From: Joshua MacDonald Date: Wed, 28 Jul 2021 11:51:55 -0700 Subject: [PATCH 10/23] grammar --- text/trace/0170-sampling-probability.md | 17 ++++++++--------- 1 file changed, 8 insertions(+), 9 deletions(-) diff --git a/text/trace/0170-sampling-probability.md b/text/trace/0170-sampling-probability.md index 004a64ebc..ed7694bd4 100644 --- a/text/trace/0170-sampling-probability.md +++ b/text/trace/0170-sampling-probability.md @@ -603,17 +603,16 @@ OpenTelemetry Span attributes. ### Semantic conventions for Sampled spans (Proposed text) -This document defines conventions for counting spans in a sample taken -over all spans created in all contexts in a distributed system. These -conventions support accurate counting of system-wide events using only -the fraction of spans that were collected in a probability sampling -scheme. With these conventions, consumers of OTLP Span data are able -to compute approximate metrics about the system using only the sample -Spans that was collected, thus we refer to these conventions as -supporting Span-to-Metrics pipelines. +This document defines conventions for counting system-wide span events +using sampled spans. These conventions support accurate counting of +system-wide events using only the fraction of spans that were +collected in a probability sampling scheme. With these conventions, +consumers of OTLP Span data are able to compute approximate metrics +about the system using only the sample data that was collected, thus +we refer to these conventions as supporting Span-to-Metrics pipelines. The _sampling rate_, also known as _inclusion probability_, is the -probability that a Span is included in the Sample that is collected. +probability that a Span is included in the Sample being collected. Sampling rate is conveyed in a form known as _adjusted count_, which tells the receiver how many events in the population are represented by the individual Span as a result of sampling. From 00e64ef7d1d2a9e276974d658dee92dd4ebf4eb5 Mon Sep 17 00:00:00 2001 From: Joshua MacDonald Date: Wed, 28 Jul 2021 11:53:42 -0700 Subject: [PATCH 11/23] grammar --- text/trace/0170-sampling-probability.md | 2 +- 1 file changed, 1 insertion(+), 1 deletion(-) diff --git a/text/trace/0170-sampling-probability.md b/text/trace/0170-sampling-probability.md index ed7694bd4..217eca4c2 100644 --- a/text/trace/0170-sampling-probability.md +++ b/text/trace/0170-sampling-probability.md @@ -684,7 +684,7 @@ For the built-in samplers, the following names are specified: | Built-in Sampler | Sets `sampler.adjusted_count`? | `sampler.name` | Notes | | ---------------- | ------------------------------ | -------------- | ------------------------- | -| AlwaysOn | Not set | Not set | Adjusted count equals exported count | +| AlwaysOn | No | Not set | Adjusted count equals exported count | | AlwaysOff | Don't care | Don't care | Exported count is zero, spans are not counted | | ParentBased | Maybe | `Parent` | Adjusted count is known when it is propagated | | TraceIDRatio | Yes | `TraceIDRatio` | Adjusted count is known | From ab95faa7f5e77d6d0c7186a4c944e2d5a9d5129f Mon Sep 17 00:00:00 2001 From: Joshua MacDonald Date: Wed, 28 Jul 2021 12:20:21 -0700 Subject: [PATCH 12/23] two paragraphs --- text/trace/0170-sampling-probability.md | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/text/trace/0170-sampling-probability.md b/text/trace/0170-sampling-probability.md index 217eca4c2..32f181d3e 100644 --- a/text/trace/0170-sampling-probability.md +++ b/text/trace/0170-sampling-probability.md @@ -622,9 +622,15 @@ The adjusted count of a span is defined as follows: - Adjusted count equals zero when inclusion probability equals zero - Adjusted count equals the mathematical inverse (i.e., reciprocal) of inclusion probability when inclusion probability is non-zero. +The zero value for adjusted count can be used when recording a Span +that was not selected by the Sampler, as a means of conveying +exceptional events while maintaining accurate accounting. + Consumers of spans carrying an adjusted count attribute are able to use the adjusted count of the span to increment a counter of matching -spans. +spans. This probabilistic counting method is will be accurate as long +as the Sampler produces unbiased adjusted counts that are expected to +equal true population counts. #### Probability Sampling Attributes (Proposed text) From d38c7192429c700d6a343338a108ccf349523a75 Mon Sep 17 00:00:00 2001 From: Joshua MacDonald Date: Mon, 9 Aug 2021 15:26:14 -0700 Subject: [PATCH 13/23] Summarize from the prototype --- text/trace/0170-sampling-probability.md | 25 +++++++++++++++++-------- 1 file changed, 17 insertions(+), 8 deletions(-) diff --git a/text/trace/0170-sampling-probability.md b/text/trace/0170-sampling-probability.md index 32f181d3e..6eede4668 100644 --- a/text/trace/0170-sampling-probability.md +++ b/text/trace/0170-sampling-probability.md @@ -686,14 +686,23 @@ Sampler that recorded a span. | `sampler.adjusted_count` | number | Effective count of the span. | 10 | Yes, when adjusted count is not equal to 1 | | `sampler.name` | string | The name of the Sampler. | `Parent` | Yes, when adjusted count is not equal to the exported count | -For the built-in samplers, the following names are specified: - -| Built-in Sampler | Sets `sampler.adjusted_count`? | `sampler.name` | Notes | -| ---------------- | ------------------------------ | -------------- | ------------------------- | -| AlwaysOn | No | Not set | Adjusted count equals exported count | -| AlwaysOff | Don't care | Don't care | Exported count is zero, spans are not counted | -| ParentBased | Maybe | `Parent` | Adjusted count is known when it is propagated | -| TraceIDRatio | Yes | `TraceIDRatio` | Adjusted count is known | +For the built-in samplers, the specified behavior for setting +`sampler.adjusted_count` and `sampler.name` is as follows. + +| Built-in Sampler | Sets `sampler.adjusted_count`? | `sampler.name` | Notes | +| ---------------- | --------------------------- | -------------- | ------------------------- | +| AlwaysOn | No | Not set | Adjusted count equals exported count | +| AlwaysOff | Don't care | Don't care | Exported count is zero, spans are never counted | +| ParentBased | Yes | Not set | In case the adjusted count is known. | +| ParentBased | No | `Parent` | In case the adjusted count is unknown. | +| TraceIDRatio | Yes | Not set | In case the adjusted count is known. | +| TraceIDRatio | No | `TraceIDRatio` | In case of unspecified behavior. | + +When this proposal is adopted across a system using built-in samplers, +probability sampling can be applied and spans can be unambiguously +counted by the receiver. In the case where a Sampler name is set +because the adjusted count is unknown, the reciever will have to +assemble the trace in order to count it properly. ## Recommended reading From 4ab3df639c2e3beda93f46539c712e063ae69a05 Mon Sep 17 00:00:00 2001 From: Joshua MacDonald Date: Tue, 10 Aug 2021 00:15:56 -0700 Subject: [PATCH 14/23] Remove exported count from proposed spec language --- text/trace/0170-sampling-probability.md | 15 +++++---------- 1 file changed, 5 insertions(+), 10 deletions(-) diff --git a/text/trace/0170-sampling-probability.md b/text/trace/0170-sampling-probability.md index 6eede4668..7d249d32c 100644 --- a/text/trace/0170-sampling-probability.md +++ b/text/trace/0170-sampling-probability.md @@ -608,7 +608,7 @@ using sampled spans. These conventions support accurate counting of system-wide events using only the fraction of spans that were collected in a probability sampling scheme. With these conventions, consumers of OTLP Span data are able to compute approximate metrics -about the system using only the sample data that was collected, thus +about the system using only the sample span data that was collected, thus we refer to these conventions as supporting Span-to-Metrics pipelines. The _sampling rate_, also known as _inclusion probability_, is the @@ -628,7 +628,7 @@ exceptional events while maintaining accurate accounting. Consumers of spans carrying an adjusted count attribute are able to use the adjusted count of the span to increment a counter of matching -spans. This probabilistic counting method is will be accurate as long +spans. This probabilistic counting method will be accurate as long as the Sampler produces unbiased adjusted counts that are expected to equal true population counts. @@ -638,18 +638,13 @@ The `sampler.adjusted_count` attribute, when set, MUST equal an unbiased estimate of the number of representative spans in the population of spans in the system. -The _exported count_ associated with a span is defined as either 1 or -0, depending on whether the span is exported and thus counted. The -exported count is 1 if the span is exported (because it will be -counted) and 0 if the span is not exported. - To avoid recording redundent information, both the `sampler.name` and `sampler.adjusted_count` attributes MAY be omitted when the counting algorithm given below produces a correct result. There are scenarios where the adjusted count is unknown, such as when -using the `ParentBased` Sampler with a W3C version-0 `traceparent` -context. +using the `ParentBased` Sampler without the `tracestate` specified in +this proposal. The `sampler.adjusted_count` SHOULD be omitted when its value is 1 or unknown. The adjusted count can be safely omitted when it is 1 @@ -684,7 +679,7 @@ Sampler that recorded a span. | Attribute | Type | Description | Examples | Required | |---------- | ---- | ----------- | -------- | -------- | | `sampler.adjusted_count` | number | Effective count of the span. | 10 | Yes, when adjusted count is not equal to 1 | -| `sampler.name` | string | The name of the Sampler. | `Parent` | Yes, when adjusted count is not equal to the exported count | +| `sampler.name` | string | The name of the Sampler. | `Parent` | Yes, when the true adjusted count is unknown | For the built-in samplers, the specified behavior for setting `sampler.adjusted_count` and `sampler.name` is as follows. From 43c661fc76f7444711a848dda6c635492736631a Mon Sep 17 00:00:00 2001 From: Joshua MacDonald Date: Tue, 10 Aug 2021 12:27:41 -0700 Subject: [PATCH 15/23] statement about not dropping sampler attributes --- text/trace/0170-sampling-probability.md | 3 +++ 1 file changed, 3 insertions(+) diff --git a/text/trace/0170-sampling-probability.md b/text/trace/0170-sampling-probability.md index 7d249d32c..57abe4569 100644 --- a/text/trace/0170-sampling-probability.md +++ b/text/trace/0170-sampling-probability.md @@ -657,6 +657,9 @@ SHOULD be taken as a signal that a span-to-metrics pipeline cannot be established without external information. Otherwise, `sampler.name` SHOULD be set when the adjusted count is not equal to 1. +Implementations SHOULD avoid dropping attributes that begin with the +`sampler.` prefix when [limiting the number of span attributes](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/sdk.md#span-limits). + The algorithm for spans-to-metrics is as follows: ``` From 7559fda96b0752c0994030b802dd151e5a292aea Mon Sep 17 00:00:00 2001 From: Joshua MacDonald Date: Fri, 20 Aug 2021 22:54:01 -0700 Subject: [PATCH 16/23] from Thursday's SIG, limit proposal to head sampling probability --- text/trace/0170-sampling-probability.md | 229 +++++++++--------------- 1 file changed, 81 insertions(+), 148 deletions(-) diff --git a/text/trace/0170-sampling-probability.md b/text/trace/0170-sampling-probability.md index 57abe4569..c5add7d37 100644 --- a/text/trace/0170-sampling-probability.md +++ b/text/trace/0170-sampling-probability.md @@ -42,18 +42,19 @@ Objective: Specify a foundation for sampling techniques in OpenTelemetry. Probability sampling allows consumers of sampled telemetry data to collect a fraction of telemetry events and use them to estimate total quantities about the population of events, such as the total rate of -events with a particular attribute. Sampling is a general-purpose -facility for lowering cost at the expense of lower data quality. +events with a particular attribute. These techniques enable reducing the cost of telemetry collection, both for producers (i.e., SDKs) and for processors (i.e., Collectors), without losing the ability to (at least coarsely) monitor the whole system. -Sampling builds on results from probability theory, most significantly -the concept of expected value. Estimates drawn from probability -samples are *random variables* that, when correct procedures are -followed, accurately reflect their true value, making them unbiased. +Sampling builds on results from probability theory. Estimates drawn +from probability samples are *random variables* that are expected to +equal their true value. When all outcomes are equally likely, meaning +all the potential combinations of items used to compute a sample of +the sampling logic are equally likely, we say the sample is _unbiased_. + Unbiased samples can be used for after-the-fact analysis. We can answer questions such as "what fraction of events had property X?" using the fraction of events in the sample that have property X. @@ -68,25 +69,29 @@ the population represented by an individual sample event. ## Examples -These examples use the proposed attribute `sampler.adjusted_count` to +These examples use an attribute named `sampler.adjusted_count` to convey sampling probability. Consumers of spans, metrics, and logs annotated with adjusted counts are able to calculate accurate -statistics about the whole population of events, at a basic level, -without knowing details about the sampling configuration. +statistics about the whole population of events, without knowing +details about the sampling configuration. + +The hypothetical `sampler.adjusted_count` attribute is used throughout +these examples to demonstrate this concept, although the proposal +below for OpenTelemetry `Span` messages introduces a dedicated field +with specific interpretation for conveying head sampling probability. ### Span sampling -Example use-cases for probability sampling of spans -generally involve generating metrics from spans. +Example use-cases for probability sampling of spans generally involve +generating metrics from spans. #### Sample spans to Counter Metric For every complete span it receives, the example processor will synthesize -metric data as though a Counter instrument named `S.count` for span -named `S` had been incremented once per span at the original `Start()` -call site. +metric data as though a Counter named `S.count` corresponding to a span +named `S` had been incremented once per original span. -This processor will add the adjusted count of each span to the +This processor will add the span's adjusted count to the instrument (e.g., `Add(adjusted_count, labels...)`) for every span it receives, logically taking place at the start or end time of the span. @@ -94,8 +99,7 @@ receives, logically taking place at the start or end time of the span. For every span it receives, the example processor will synthesize metric data as though a Histogram instrument named `S.duration` for -span named `S` had been observed once per span at the original `End()` -call site. +span named `S` had been observed once per original span. The OpenTelemetry Metric data model does not support histogram buckets with non-integer counts, which forces the use of integer adjusted @@ -152,24 +156,6 @@ adjusted count of 10. Assuming the sample was selected using an unbiased algorithm, we can interpret this event as having an expected count of `100/0.1 = 1000`. -#### Metric exemplars with adjusted counts - -The OTLP protocol for metrics includes a repeated exemplars field in -every data point. This is a place where Metric aggregators (e.g., -histograms) are able to provide example context to correlate metrics -with traces. - -OTLP exemplars support additional attributes, those that were present -on the API event and were dropped during aggregation. Exemplars that -are selected probabilistically and recorded with their adjusted counts -make it possible to approximately count events using dimensions that -were dropped during metric aggregation. - -An end-to-end pipeline of sampled metrics events can be constructed -based on exemplars with adjusted counts, one capable of supporting -approximate-count queries over sampled metric events at high -cardinality. - #### Metric cardinality limiter A metrics processor can be configured to limit cardinality for a @@ -184,7 +170,7 @@ monotonic (see Considering data points received during the interval, when the number of points exceeds K, select a probability proportional to size sample -of points, output every point with a `sampler.adjusted_count` attribute. +of points, output every point with an adjusted count attribute. ## Explanation @@ -419,7 +405,7 @@ the population. Take a simple probability sample of root spans: 2. Make a pseudo-random selection with probability `P`, if true return `RECORD_AND_SAMPLE` (so that the W3C Trace Context `is-sampled` flag is set in all child contexts) -3. Encode a span attribute `sampler.adjusted_count` equal to `1/P` on the root span +3. Encode a span adjusted count attribute equal to `1/P` on the root span 4. Collect all spans where the W3C Trace Context `is-sampled` flag is set. After collecting all sampled spans, locate the root span for each. @@ -488,9 +474,8 @@ the context in effect when starting child spans. This is expanded upon in [OTEP 168 (WIP)](https://github.com/open-telemetry/oteps/pull/168). When propagating head sampling probability, spans recorded by the -`Parent` sampler MAY encode the adjusted count in the corresponding -`SpanData` using a non-descriptive Span attribute named -`sampler.adjusted_count`. +`Parent` sampler could encode the adjusted count in the corresponding +`SpanData` using a Span attribute named `sampler.adjusted_count`. ##### `TraceIDRatio` Sampler @@ -521,9 +506,9 @@ span](https://github.com/open-telemetry/opentelemetry-specification/issues/355). Lacking the number of expected children, we require a way to know the minimum Sampler probability across traces to ensure they are complete. -To count TraceIDRatio-sampled spans, each span MAY encode its adjusted -count in the corresponding `SpanData` using a non-descriptive Span -attribute named `sampler.adjusted_count`. +To count TraceIDRatio-sampled spans, each span could encode its +adjusted count in the corresponding `SpanData` using a Span attribute +named `sampler.adjusted_count`. ##### Dapper's "Inflationary" Sampler @@ -595,112 +580,60 @@ decision is true or false, propagate `I` as the new head inclusion probability. If the decision is true, begin recording a sub-rooted trace with adjusted count `1/I`. -## Proposed specification text - -The following text will be added to the semantic conventions for -recording the Sampler name and adjusted count (if known) as -OpenTelemetry Span attributes. - -### Semantic conventions for Sampled spans (Proposed text) - -This document defines conventions for counting system-wide span events -using sampled spans. These conventions support accurate counting of -system-wide events using only the fraction of spans that were -collected in a probability sampling scheme. With these conventions, -consumers of OTLP Span data are able to compute approximate metrics -about the system using only the sample span data that was collected, thus -we refer to these conventions as supporting Span-to-Metrics pipelines. - -The _sampling rate_, also known as _inclusion probability_, is the -probability that a Span is included in the Sample being collected. -Sampling rate is conveyed in a form known as _adjusted count_, which -tells the receiver how many events in the population are represented -by the individual Span as a result of sampling. - -The adjusted count of a span is defined as follows: - -- Adjusted count equals zero when inclusion probability equals zero -- Adjusted count equals the mathematical inverse (i.e., reciprocal) of inclusion probability when inclusion probability is non-zero. - -The zero value for adjusted count can be used when recording a Span -that was not selected by the Sampler, as a means of conveying -exceptional events while maintaining accurate accounting. - -Consumers of spans carrying an adjusted count attribute are able to -use the adjusted count of the span to increment a counter of matching -spans. This probabilistic counting method will be accurate as long -as the Sampler produces unbiased adjusted counts that are expected to -equal true population counts. - -#### Probability Sampling Attributes (Proposed text) - -The `sampler.adjusted_count` attribute, when set, MUST equal an -unbiased estimate of the number of representative spans in the -population of spans in the system. - -To avoid recording redundent information, both the `sampler.name` and -`sampler.adjusted_count` attributes MAY be omitted when the counting -algorithm given below produces a correct result. - -There are scenarios where the adjusted count is unknown, such as when -using the `ParentBased` Sampler without the `tracestate` specified in -this proposal. - -The `sampler.adjusted_count` SHOULD be omitted when its value is 1 or -unknown. The adjusted count can be safely omitted when it is 1 -because that is exactly the number of events associated the span in -that case. In case the adjusted count is unknown, `sampler.name` MUST -be set with a Sampler name to signify an unknown adjusted count. - -The presence of a `sampler.name` without a `sampler.adjusted_count` -SHOULD be taken as a signal that a span-to-metrics pipeline cannot be -established without external information. Otherwise, `sampler.name` -SHOULD be set when the adjusted count is not equal to 1. - -Implementations SHOULD avoid dropping attributes that begin with the -`sampler.` prefix when [limiting the number of span attributes](https://github.com/open-telemetry/opentelemetry-specification/blob/main/specification/trace/sdk.md#span-limits). - -The algorithm for spans-to-metrics is as follows: - -``` -// Calls `span_to_metrics(span, C)` for the effective count `C` of -// every `span` received. -for _, span := <-spans_received { - if count, has := span.attributes['sampler.adjusted_count']; has { - span_to_metrics(span, count) - } else if name, has := span.attributes['sampler.name]; has { - log.Error("span requires trace assembly before counting") - } else { - span_to_metrics(span, 1) - } -} -``` +## Proposed `Span` protocol + +Earlier drafts of this document had proposed the use of Span +attributes to convey a the combined effects of head- and tail-sampling +in the form of an (optional) adjusted count and (optional) sampler +name. The group did not reach agreement on whether and/or how to +convey tail sampling. + +Following the proposal for propagating consistent head trace sampling +probability developed in [OTEP +168](https://github.com/open-telemetry/oteps/pull/168), this proposal +is limited to adding a field to encode the head sampling probability. +The OTEP 168 proposal for propagation limits head sampling +probabilities to powers of two, hence we are able to encode the +corresponding adjusted count using a small non-negative integer. + +Interoperability with existing Propagators and Span data means +recognizing Spans with unknown adjusted count when the new field is +unset. Thus, the 0 value shall mean unknown adjusted count. + +The OTEP 168 proposal for propagating head sampling probability uses 6 +bits of information, with 63 ordinary values and one zero value. +Here, we propose a biased encoding for head sampling probability equal +to 1 plus the `P` value as proposed in OTEP 168. The proposed span +field, a biased base-2 logarithm of the adjusted count, is named +simply `log_adjusted_count` and requires 7 bits of information: + +| Value | Head Adjusted Count | +| ----- | ---------------- | +| 0 | _Unknown_ | +| 1 | 1 | +| 2 | 2 | +| 3 | 4 | +| 4 | 8 | +| 5 | 16 | +| 6 | 32 | +| ... | ... | +| X | 2^(X-1) | +| ... | ... | +| 63 | 2^62 | +| 64 | 0 | + +Combined with the proposal for propagating head sampling probability +in OTEP 168, the result is that Sampling can be enabled in an +up-to-date system and all Spans, roots and children alike, will have a +non-zero values in the `log_adjusted_count` field. Consumers of a +stream of Span data with non-zero values in the `log_adjusted_count` +field can approximately and accurately count Spans using adjusted +counts. -To summarize, these two attributes convey information about the -Sampler that recorded a span. - -| Attribute | Type | Description | Examples | Required | -|---------- | ---- | ----------- | -------- | -------- | -| `sampler.adjusted_count` | number | Effective count of the span. | 10 | Yes, when adjusted count is not equal to 1 | -| `sampler.name` | string | The name of the Sampler. | `Parent` | Yes, when the true adjusted count is unknown | - -For the built-in samplers, the specified behavior for setting -`sampler.adjusted_count` and `sampler.name` is as follows. - -| Built-in Sampler | Sets `sampler.adjusted_count`? | `sampler.name` | Notes | -| ---------------- | --------------------------- | -------------- | ------------------------- | -| AlwaysOn | No | Not set | Adjusted count equals exported count | -| AlwaysOff | Don't care | Don't care | Exported count is zero, spans are never counted | -| ParentBased | Yes | Not set | In case the adjusted count is known. | -| ParentBased | No | `Parent` | In case the adjusted count is unknown. | -| TraceIDRatio | Yes | Not set | In case the adjusted count is known. | -| TraceIDRatio | No | `TraceIDRatio` | In case of unspecified behavior. | - -When this proposal is adopted across a system using built-in samplers, -probability sampling can be applied and spans can be unambiguously -counted by the receiver. In the case where a Sampler name is set -because the adjusted count is unknown, the reciever will have to -assemble the trace in order to count it properly. +Non-probabilistic Samplers such as the [Leaky-bucket rate-limited +sampler](https://github.com/open-telemetry/opentelemetry-specification/issues/1769) +SHOULD set the `log_adjusted_count` field to zero to indicate an +unknown adjusted count. ## Recommended reading From 84acc9448af0b8913f56a5cf84389ba0fd40e1d3 Mon Sep 17 00:00:00 2001 From: Joshua MacDonald Date: Fri, 20 Aug 2021 23:04:36 -0700 Subject: [PATCH 17/23] log_head_adjusted_count --- text/trace/0170-sampling-probability.md | 31 +++++++++++++++++++++---- 1 file changed, 27 insertions(+), 4 deletions(-) diff --git a/text/trace/0170-sampling-probability.md b/text/trace/0170-sampling-probability.md index c5add7d37..0a6a84e16 100644 --- a/text/trace/0170-sampling-probability.md +++ b/text/trace/0170-sampling-probability.md @@ -605,7 +605,7 @@ bits of information, with 63 ordinary values and one zero value. Here, we propose a biased encoding for head sampling probability equal to 1 plus the `P` value as proposed in OTEP 168. The proposed span field, a biased base-2 logarithm of the adjusted count, is named -simply `log_adjusted_count` and requires 7 bits of information: +simply `log_head_adjusted_count` and requires 7 bits of information: | Value | Head Adjusted Count | | ----- | ---------------- | @@ -625,16 +625,39 @@ simply `log_adjusted_count` and requires 7 bits of information: Combined with the proposal for propagating head sampling probability in OTEP 168, the result is that Sampling can be enabled in an up-to-date system and all Spans, roots and children alike, will have a -non-zero values in the `log_adjusted_count` field. Consumers of a -stream of Span data with non-zero values in the `log_adjusted_count` +non-zero values in the `log_head_adjusted_count` field. Consumers of a +stream of Span data with non-zero values in the `log_head_adjusted_count` field can approximately and accurately count Spans using adjusted counts. Non-probabilistic Samplers such as the [Leaky-bucket rate-limited sampler](https://github.com/open-telemetry/opentelemetry-specification/issues/1769) -SHOULD set the `log_adjusted_count` field to zero to indicate an +SHOULD set the `log_head_adjusted_count` field to zero to indicate an unknown adjusted count. +### Proposed `Span` field documentation + +The following text will be added to the `Span` message in +`opentelemetry/proto/trace/v1/trace.proto`: + +``` + // Log-head-adjusted count is the logarithm of adjusted count for + // this span as calculated at the head, offset by +1, with the + // following recognized values. + // + // 0: The zero value represents an UNKNOWN adjusted count. + // Consumers of these Spans cannot cannot compute span metrics. + // + // 1: An adjusted count of 1. + // + // 2-63: Values 2 through 63 represent an adjusted count of 2^(Value-1) + // + // 64: Value 64 represents an adjusted count of zero. + // + // Values greater than 64 are unrecognized. + uint32 log_head_adjusted_count = ; +``` + ## Recommended reading [Sampling, 3rd Edition, by Steven From fb8563cb1ef4f379fd330f9d9779994e589e4c74 Mon Sep 17 00:00:00 2001 From: Joshua MacDonald Date: Mon, 23 Aug 2021 13:13:16 -0700 Subject: [PATCH 18/23] Use 6 bits --- text/trace/0170-sampling-probability.md | 13 ++++++++----- 1 file changed, 8 insertions(+), 5 deletions(-) diff --git a/text/trace/0170-sampling-probability.md b/text/trace/0170-sampling-probability.md index 0a6a84e16..711a20197 100644 --- a/text/trace/0170-sampling-probability.md +++ b/text/trace/0170-sampling-probability.md @@ -600,12 +600,15 @@ Interoperability with existing Propagators and Span data means recognizing Spans with unknown adjusted count when the new field is unset. Thus, the 0 value shall mean unknown adjusted count. -The OTEP 168 proposal for propagating head sampling probability uses 6 -bits of information, with 63 ordinary values and one zero value. +The OTEP 168 proposal for _propagating_ head sampling probability uses +6 bits of information, with 62 ordinary values, one zero value, and a +single unused value. + Here, we propose a biased encoding for head sampling probability equal to 1 plus the `P` value as proposed in OTEP 168. The proposed span field, a biased base-2 logarithm of the adjusted count, is named -simply `log_head_adjusted_count` and requires 7 bits of information: +simply `log_head_adjusted_count` and still requires 6 bits of +information. | Value | Head Adjusted Count | | ----- | ---------------- | @@ -619,8 +622,8 @@ simply `log_head_adjusted_count` and requires 7 bits of information: | ... | ... | | X | 2^(X-1) | | ... | ... | -| 63 | 2^62 | -| 64 | 0 | +| 62 | 2^61 | +| 63 | 0 | Combined with the proposal for propagating head sampling probability in OTEP 168, the result is that Sampling can be enabled in an From 02b06d092b968055c02cef008d1cf32b600bd113 Mon Sep 17 00:00:00 2001 From: Joshua MacDonald Date: Wed, 25 Aug 2021 10:57:38 -0700 Subject: [PATCH 19/23] update the proto text --- text/trace/0170-sampling-probability.md | 4 ++-- 1 file changed, 2 insertions(+), 2 deletions(-) diff --git a/text/trace/0170-sampling-probability.md b/text/trace/0170-sampling-probability.md index 711a20197..c39870ba5 100644 --- a/text/trace/0170-sampling-probability.md +++ b/text/trace/0170-sampling-probability.md @@ -653,9 +653,9 @@ The following text will be added to the `Span` message in // // 1: An adjusted count of 1. // - // 2-63: Values 2 through 63 represent an adjusted count of 2^(Value-1) + // 2-62: Values 2 through 62 represent an adjusted count of 2^(Value-1) // - // 64: Value 64 represents an adjusted count of zero. + // 63: Value 63 represents an adjusted count of zero. // // Values greater than 64 are unrecognized. uint32 log_head_adjusted_count = ; From 4e6c69aa90f33d8c69599e779b07edc0a9d809af Mon Sep 17 00:00:00 2001 From: Joshua MacDonald Date: Fri, 27 Aug 2021 14:03:43 -0700 Subject: [PATCH 20/23] add detail on SamplerResult --- text/trace/0170-sampling-probability.md | 22 ++++++++++++++++++++++ 1 file changed, 22 insertions(+) diff --git a/text/trace/0170-sampling-probability.md b/text/trace/0170-sampling-probability.md index c39870ba5..1a328b256 100644 --- a/text/trace/0170-sampling-probability.md +++ b/text/trace/0170-sampling-probability.md @@ -661,6 +661,28 @@ The following text will be added to the `Span` message in uint32 log_head_adjusted_count = ; ``` +### Proposed `Sampler` interface changes + +The Trace SDK specification of the `SamplingResult` will be extended +with a new field to be returned by all Samplers. + +``` +- The sampling probability of the span is encoded as one plus the + inverse of head inclusion probability, known as "adjusted count", + which is the effective count of the Span for use in Span-to-Metrics + pipelines. The value 0 is used to represent unknown adjusted count, + and the value 63 is used to represent known-zero adjusted count. + For values >0 and <63, the adjusted count of the Span is + 2^(value-1), representing power-of-two probabilities between + 1 and 2^-61. + + The corresonding `SamplerResult` field SHOULD be named + `log_head_adjusted_count` to match the Span data model. +``` + +See [OTEP 168](https://github.com/open-telemetry/oteps/pull/168) for +details on how each of the built-in Samplers is expected to behave. + ## Recommended reading [Sampling, 3rd Edition, by Steven From f6259eb8639c49702d4e5e60c9a55a14c984c2b5 Mon Sep 17 00:00:00 2001 From: Joshua MacDonald Date: Thu, 9 Sep 2021 15:38:22 -0700 Subject: [PATCH 21/23] remove metrics examples, add to span-to-metrics examples --- text/trace/0170-sampling-probability.md | 104 ++++++++++-------------- 1 file changed, 44 insertions(+), 60 deletions(-) diff --git a/text/trace/0170-sampling-probability.md b/text/trace/0170-sampling-probability.md index 1a328b256..e908465ca 100644 --- a/text/trace/0170-sampling-probability.md +++ b/text/trace/0170-sampling-probability.md @@ -8,10 +8,6 @@ + [Sample spans to Counter Metric](#sample-spans-to-counter-metric) + [Sample spans to Histogram Metric](#sample-spans-to-histogram-metric) + [Sample span rate limiting](#sample-span-rate-limiting) - * [Metric sampling](#metric-sampling) - + [Statsd Counter](#statsd-counter) - + [Metric exemplars with adjusted counts](#metric-exemplars-with-adjusted-counts) - + [Metric cardinality limiter](#metric-cardinality-limiter) - [Explanation](#explanation) * [Model and terminology](#model-and-terminology) + [Sampling without replacement](#sampling-without-replacement) @@ -87,27 +83,61 @@ generating metrics from spans. #### Sample spans to Counter Metric -For every complete span it receives, the example processor will synthesize -metric data as though a Counter named `S.count` corresponding to a span -named `S` had been incremented once per original span. +In this example, an OpenTelemetry SDK for tracing is configured with a +`SpanProcessor` that counts sample spans as they are processed based +on their adjusted counts. The SDK could be used to monitor request +rates using Prometheus, for example. -This processor will add the span's adjusted count to the -instrument (e.g., `Add(adjusted_count, labels...)`) for every span it -receives, logically taking place at the start or end time of the span. +For every complete sample span it receives, the example +`SpanProcessor` will synthesize metric data as though a Counter named +`S_count` corresponding to a span named `S` had been incremented once +per original span. Using the adjusted count of sampled spans instead, +the value of `S_count` is expected to equal to equal the true number +of spans. + +This `SpanProcessor` will for every span it receives add the span's +adjusted count to a corresponding metric Counter instrument. For +example using the OpenTelemetry Metrics API directly, + +``` +func (p *spanToMetricsProcessor) OnEnd(span trace.ReadOnlySpan) { + ctx := context.Background() + counter := p.meter.NewInt64Counter(span.Name() + "_count") + counter.Add( + ctx, + span.AdjustedCount(), + span.Attributes()..., + ) +} +``` #### Sample spans to Histogram Metric For every span it receives, the example processor will synthesize -metric data as though a Histogram instrument named `S.duration` for +metric data as though a Histogram instrument named `S_duration` for span named `S` had been observed once per original span. The OpenTelemetry Metric data model does not support histogram buckets with non-integer counts, which forces the use of integer adjusted counts here (i.e., 1-in-N sampling rates where N is an integer). -Logically speaking, this processor will observe the span's duration its -adjusted count number of times for every span it receives, at the end -time of the span. +Logically speaking, this processor will observe the span's duration +_adjusted count_ number of times for every sample span it receives. +This example, therefore, uses a hypothetical `RecordMany()` method to +capture multiple observations of a Histogram measurement at once: + +``` + histogram := p.meter.NewFloat64Histogram( + span.Name() + "_duration", + metric.WithUnits("ms"), + ) + histogram.RecordMany( + ctx, + span.Duration().Milliseconds(), + span.AdjustedCount(), + span.Attributes()..., + ) +``` #### Sample span rate limiting @@ -126,52 +156,6 @@ When the interval expires and the sample frame is considered complete, the selected sample spans are output with possibly updated adjusted counts. -### Metric sampling - -Example use-cases for probability sampling of metrics -are aimed at lowering cost and addressing high cardinality. - -#### Statsd Counter - -A Statsd counter event appears as a line of text, describing a -number-valued event with optional attributes and inclusion probability -("sample rate"). - -For example, a metric named `name` is incremented by `increment` using -a counter event (`c`) with the given `sample_rate`. - -``` -name:increment|c|@sample_rate -``` - -For example, a count of 100 that was selected for a 1-in-10 simple -random sampling scheme will arrive as: - -``` -counter:100|c|@0.1 -``` - -Events in the example have with 0.1 inclusion probability have -adjusted count of 10. Assuming the sample was selected using an -unbiased algorithm, we can interpret this event as having an expected -count of `100/0.1 = 1000`. - -#### Metric cardinality limiter - -A metrics processor can be configured to limit cardinality for a -single metric name, allowing no more than K distinct label sets per -export interval. The export interval is fixed to a short interval so -that a complete set of distinct labels can be stored temporarily. - -Caveats: as presented, this works for Sum and Histogram points -received with Delta aggregation temporality and where the Sum is -monotonic (see -[opentelemetry-proto/issues/303](https://github.com/open-telemetry/opentelemetry-proto/issues/303)). - -Considering data points received during the interval, when the number -of points exceeds K, select a probability proportional to size sample -of points, output every point with an adjusted count attribute. - ## Explanation Consider a hypothetical telemetry signal in which a stream of From ec3b41d28bd5d223ac194fb477ad8fd8d0a9b747 Mon Sep 17 00:00:00 2001 From: Joshua MacDonald Date: Thu, 9 Sep 2021 15:48:03 -0700 Subject: [PATCH 22/23] whitespace --- Makefile | 1 - text/trace/0170-sampling-probability.md | 54 ++++++++++++------------- 2 files changed, 27 insertions(+), 28 deletions(-) diff --git a/Makefile b/Makefile index 5de766ee5..a3277a8a9 100644 --- a/Makefile +++ b/Makefile @@ -43,4 +43,3 @@ install-markdown-lint: .PHONY: markdown-lint markdown-lint: markdownlint -c .markdownlint.yaml '**/*.md' - diff --git a/text/trace/0170-sampling-probability.md b/text/trace/0170-sampling-probability.md index e908465ca..36fb24587 100644 --- a/text/trace/0170-sampling-probability.md +++ b/text/trace/0170-sampling-probability.md @@ -20,7 +20,7 @@ + [Multiply the adjusted count into the data](#multiply-the-adjusted-count-into-the-data) * [Trace Sampling](#trace-sampling) + [Counting child spans using root span adjusted counts](#counting-child-spans-using-root-span-adjusted-counts) - + [Using head trace probability to count all spans](#using-head-trace-probability-to-count-all-spans) + + [Using head trace probability to count all spans](#using-head-trace-probability-to-count-all-spans) + [Head sampling for traces](#head-sampling-for-traces) - [`Parent` Sampler](#parent-sampler) - [`TraceIDRatio` Sampler](#traceidratio-sampler) @@ -101,13 +101,13 @@ example using the OpenTelemetry Metrics API directly, ``` func (p *spanToMetricsProcessor) OnEnd(span trace.ReadOnlySpan) { - ctx := context.Background() - counter := p.meter.NewInt64Counter(span.Name() + "_count") - counter.Add( - ctx, - span.AdjustedCount(), - span.Attributes()..., - ) + ctx := context.Background() + counter := p.meter.NewInt64Counter(span.Name() + "_count") + counter.Add( + ctx, + span.AdjustedCount(), + span.Attributes()..., + ) } ``` @@ -127,16 +127,16 @@ This example, therefore, uses a hypothetical `RecordMany()` method to capture multiple observations of a Histogram measurement at once: ``` - histogram := p.meter.NewFloat64Histogram( - span.Name() + "_duration", - metric.WithUnits("ms"), - ) - histogram.RecordMany( - ctx, - span.Duration().Milliseconds(), - span.AdjustedCount(), - span.Attributes()..., - ) + histogram := p.meter.NewFloat64Histogram( + span.Name() + "_duration", + metric.WithUnits("ms"), + ) + histogram.RecordMany( + ctx, + span.Duration().Milliseconds(), + span.AdjustedCount(), + span.Attributes()..., + ) ``` #### Sample span rate limiting @@ -186,7 +186,7 @@ will learn to apply these techniques for sampling aggregated data. In sampling, the term _sampling design_ refers to how sampling probability is decided and the term _sample frame_ refers to how -events are organized into discrete populations. The design of a +events are organized into discrete populations. The design of a sampling strategy dictates how the population is framed. For example, a simple design uses uniform probability, and a simple @@ -373,18 +373,18 @@ each of its children based on the following logic: - The root span is considered representative of `adjusted_count` many identical root spans, because it was selected using unbiased sampling -- Context propagation conveys _causation_, the fact the one span produces +- Context propagation conveys _causation_, the fact the one span produces another - A root span causes each of the child spans in its trace to be produced - A sampled root span represents `adjusted_count` many traces, representing - the cause of `adjusted_count` many occurances per child span in the + the cause of `adjusted_count` many occurances per child span in the sampled trace. Using this reasoning, we can define a sample collected from all root spans in the system, which allows estimating the count of all spans in the population. Take a simple probability sample of root spans: -1. In the `Sampler` decision for root spans, use the initial span properties +1. In the `Sampler` decision for root spans, use the initial span properties to determine the inclusion probability `P` 2. Make a pseudo-random selection with probability `P`, if true return `RECORD_AND_SAMPLE` (so that the W3C Trace Context `is-sampled` @@ -547,7 +547,7 @@ P(span sampled | parent sampled) = 1 P(span sampled | parent not sampled) = D ``` -Using the formula above, +Using the formula above, ``` I = 1*H + D*(1-H) @@ -636,7 +636,7 @@ The following text will be added to the `Span` message in // Consumers of these Spans cannot cannot compute span metrics. // // 1: An adjusted count of 1. - // + // // 2-62: Values 2 through 62 represent an adjusted count of 2^(Value-1) // // 63: Value 63 represents an adjusted count of zero. @@ -657,14 +657,14 @@ with a new field to be returned by all Samplers. pipelines. The value 0 is used to represent unknown adjusted count, and the value 63 is used to represent known-zero adjusted count. For values >0 and <63, the adjusted count of the Span is - 2^(value-1), representing power-of-two probabilities between + 2^(value-1), representing power-of-two probabilities between 1 and 2^-61. - + The corresonding `SamplerResult` field SHOULD be named `log_head_adjusted_count` to match the Span data model. ``` -See [OTEP 168](https://github.com/open-telemetry/oteps/pull/168) for +See [OTEP 168](https://github.com/open-telemetry/oteps/pull/168) for details on how each of the built-in Samplers is expected to behave. ## Recommended reading From 1d1020374ab89925557e51aa0faed643ca3297ca Mon Sep 17 00:00:00 2001 From: Joshua MacDonald Date: Thu, 9 Sep 2021 15:50:10 -0700 Subject: [PATCH 23/23] lint --- text/trace/0170-sampling-probability.md | 6 +++--- 1 file changed, 3 insertions(+), 3 deletions(-) diff --git a/text/trace/0170-sampling-probability.md b/text/trace/0170-sampling-probability.md index 36fb24587..c71409b7c 100644 --- a/text/trace/0170-sampling-probability.md +++ b/text/trace/0170-sampling-probability.md @@ -296,7 +296,7 @@ Some possibilities for encoding the adjusted count or inclusion probability are discussed below, depending on the circumstances and the protocol. Here, the focus is on how to count sampled telemetry events in general, not a specific kind of event. As we shall see in -the following section, tracing comes with addional complications. +the following section, tracing comes with additional complications. There are several ways of encoding this adjusted count or inclusion probability: @@ -338,7 +338,7 @@ exponentially greater representivity. #### Multiply the adjusted count into the data When the data itself carries counts, such as for the Metrics Sum and -Histogram points, the adjusted count can be multipled into the data. +Histogram points, the adjusted count can be multiplied into the data. This technique is less desirable because, while it preserves the expected value of the count or sum, the data loses information about @@ -377,7 +377,7 @@ each of its children based on the following logic: another - A root span causes each of the child spans in its trace to be produced - A sampled root span represents `adjusted_count` many traces, representing - the cause of `adjusted_count` many occurances per child span in the + the cause of `adjusted_count` many occurrences per child span in the sampled trace. Using this reasoning, we can define a sample collected from all root