[probabilistic sampling processor] encoded sampling probability (support OTEP 235) #31894

jmacd · 2024-03-21T14:34:51Z

Description: Creates new sampler modes named "equalizing" and "proportional". Preserves existing functionality under the mode named "hash_seed".

Fixes #31918

This is the final step in a sequence, the whole of this work was factored into 3+ PRs, including the new pkg/sampling and the previous step, #31946. The two new Sampler modes enable mixing OTel sampling SDKs with Collectors in a consistent way.

The existing hash_seed mode is also a consistent sampling mode, which makes it possible to have a 1:1 mapping between its decisions and the OTEP 235 randomness and threshold values. Specifically, the 14-bit hash value and sampling probability are mapped into 56-bit R-value and T-value encodings, so that all sampling decisions in all modes include threshold information.

This implements the semantic conventions of open-telemetry/semantic-conventions#793, namely the sampling.randomness and sampling.threshold attributes used for logs where there is no tracestate.

The default sampling mode remains HashSeed. We consider a future change of default to Proportional to be desirable, because:

Sampling probability is the same, only the hashing algorithm changes
Proportional respects and preserves information about earlier sampling decisions, which HashSeed can't do, so it has greater interoperability with OTel SDKs which may also adopt OTEP 235 samplers.

Link to tracking Issue:

Draft for open-telemetry/opentelemetry-specification#3602.
Previously #24811, see also open-telemetry/oteps#235
Part of #29738

Testing: New testing has been added.

Documentation: ✅

to iterate.

…tor-contrib into jmacd/tvaluesampler

…n#3602

…and resampler

…tor-contrib into jmacd/tvaluesampler

…ype similar to configcomprsesion.CompressionType

…tor-contrib into jmacd/tvaluesampler

jpkrohling

Apart from small details, I think this looks good. I would also ask you to think about which telemetry you'd want from this component in case of a bug report, and add metrics for them. Do we need to keep a histogram of Rs and Ts?

processor/probabilisticsamplerprocessor/README.md

processor/probabilisticsamplerprocessor/config.go

processor/probabilisticsamplerprocessor/sampler_mode.go

jpkrohling · 2024-06-10T10:25:12Z

processor/probabilisticsamplerprocessor/sampler_mode.go

-		logger.Debug(description, zap.Error(err))
+		var se samplerError
+		if errors.As(err, &se) {
+			logger.Info(description, zap.Error(err))


this is part of the hot path, I'd rather have a metric recording the number of errors that happened, with samplerError being a dimension of the metric (error=sampler, error=other).

Let me suggest something different: let them all be (debug) logs for this PR, but then convert the metrics for this component as a whole from OC to OTel, using mdatagen. On that PR, you could add the new metric for the error conditions.

I like the idea of making the logs into Debug() and Info()-level. I also do not expect any of the Info()-level log statements to be common, for they require a misbehaving SDK or some other form of data corruption, as opposed to misconfigured sampling. Because these messages signal corruption of some kind, I think they should be logs. My expectation is that logs are sampled at runtime (or they should be) to avoid impacting the hot path.

With that said, I feel that a new metric is not helpful -- it just means the customer has to monitor something new when we have metric signals already. There is a single metric output by this code specifically, which will have a "policy=missing_randomness" when these errors arise.

We also have (or at least desire) standard pipeline metrics, which ought to provide a standard way to count how many spans succeed or fail. If the sampler is configured with FailClosed=true and these missing_randomness conditions are happening, the result will be loss of spans. I do not want the user to have to discover a new metric for this, because there ought to be a standard metric for rejected items. All processors should be able to count the number of items that are rejected for malformed data.

jpkrohling · 2024-06-10T10:26:11Z

processor/probabilisticsamplerprocessor/sampler_mode.go

+			if errors.Is(err, sampling.ErrInconsistentSampling) {
+				// This is working-as-intended.  You can't lower
+				// the threshold, it's illogical.
+				logger.Debug(description, zap.Error(err))


same comment here about the metric vs. log

To summarize: I think the existing metric's "policy=missing_randomness" is a useful-enough signal. Personally, I want standard pipeline metrics so that every component doesn't have to invent a bespoke metric definition.

jpkrohling · 2024-06-10T10:26:23Z

processor/probabilisticsamplerprocessor/sampler_mode.go

+			}
+		}
+		if err := carrier.reserialize(); err != nil {
+			logger.Info(description, zap.Error(err))


and same here :-)

This is some sort of grievous corruption, and I personally do not want every component inventing new metrics to monitor for things we never expect to happen.

jmacd

(partial)

processor/probabilisticsamplerprocessor/README.md

processor/probabilisticsamplerprocessor/config.go

jmacd · 2024-06-10T23:43:23Z

processor/probabilisticsamplerprocessor/go.mod

@@ -18,6 +18,7 @@ require (
 	go.opentelemetry.io/otel/metric v1.27.0
 	go.opentelemetry.io/otel/trace v1.27.0
 	go.uber.org/goleak v1.3.0
+	go.uber.org/multierr v1.11.0


Thanks, now using errors.Join. 95ecbae

processor/probabilisticsamplerprocessor/sampler_mode.go

jpkrohling · 2024-06-11T12:29:44Z

@jmacd, please let me know once this is ready for another round.

Co-authored-by: Juraci Paixão Kröhling <juraci.github@kroehling.de>

jmacd · 2024-06-11T16:23:02Z

I would also ask you to think about which telemetry you'd want from this component in case of a bug report, and add metrics for them. Do we need to keep a histogram of Rs and Ts?

My personal opinion is that we should not invent new metrics to monitor for bugs, that is what logs are good at and if we think logs do not perform well enough, we should reconsider -- metrics are not clearly better for performance, compared with sampled logs.

Moreover, I want us to encourage use of standard pipeline metrics. Practically all real processors are going to encounter errors that would cause data to be dropped or malformed in some way, and we shouldn't need new metrics for every one of them.

For your question about histograms of R and T; there is something I would recommend, but not a histogram of T or R values (and both of these could be high cardinality), and probably not at default-verbosity level. What we do care about, and I'm open to suggestions, is that the sum of adjusted counts after sampling is expected to equal the sum of adjusted counts before sampling. This is a probabilistic argument, so the match is not exact. We should have a metric that counts items by their adjusted count. I would argue that such a metric should be standardized and comparable with standard pipeline metrics, so if otelcol_incoming_items and otelcol_outgoing_items are the standard pipeline metrics, sampling processors could emit otelcol_outgoing_items_adjusted and otel_incoming_items_adjusted. There would be some additional expense and code complexity to this.

Another direction to take this question is that the span-to-metrics connector should be able to use the adjusted counts so that it counts the number of representative spans, not the number of actual spans. This sounds more useful to me than a pipeline metric of adjusted count, and anyway I do not prefer to use metrics as a debugging signal. If there are bugs, I would recommend users connect a debugging exporter and review the output data.

…tor-contrib into jmacd/tvaluesampler

jmacd · 2024-06-12T20:26:54Z

These test failures are independent, #33520.

jmacd · 2024-06-12T23:17:02Z

processor/probabilisticsamplerprocessor/sampler_mode.go

+		// Convert the accept threshold to a reject threshold,
+		// then shift it into 56-bit value.
+		reject := numHashBuckets - scaledSamplerate
+		reject56 := uint64(reject) << 42

-	threshold, _ := sampling.UnsignedToThreshold(reject56)
+		threshold, _ := sampling.UnsignedToThreshold(reject56)

-	return &hashingSampler{
-		tvalueThreshold: threshold,
-		hashSeed:        cfg.HashSeed,
+		return &hashingSampler{
+			tvalueThreshold: threshold,
+			hashSeed:        cfg.HashSeed,

-		// Logs specific:
-		logsTraceIDEnabled:            cfg.AttributeSource == traceIDAttributeSource,
-		logsRandomnessSourceAttribute: cfg.FromAttribute,
+			// Logs specific:
+			logsTraceIDEnabled:            cfg.AttributeSource == traceIDAttributeSource,
+			logsRandomnessSourceAttribute: cfg.FromAttribute,
+		}


Sampler SIG, see here. Reference: open-telemetry/semantic-conventions#793 (comment)

kentquirk · 2024-06-05T18:32:01Z

processor/probabilisticsamplerprocessor/README.md

+
+### Equalizing
+
+This mode uses the same randomness mechanism as the propotional


Suggested change

This mode uses the same randomness mechanism as the propotional

This mode uses the same randomness mechanism as the proportional

kentquirk · 2024-06-06T14:36:32Z

processor/probabilisticsamplerprocessor/tracesprocessor_test.go

@@ -105,16 +106,16 @@ func Test_tracesamplerprocessor_SamplingPercentageRange(t *testing.T) {
 			},
 			numBatches:        1e5,
 			numTracesPerBatch: 2,
-			acceptableDelta:   0.01,


If these adjustments are to get a faster but less-flaky test, what do you think of the idea of adding a loop? You can try a few times to get a result in the acceptable range; it will finish as soon as it sees an acceptable result.

This is the idea behind the "Eventually" function in the stretchr/testify library (we don't use it here but the concept is sound).

kentquirk · 2024-06-06T14:59:27Z

processor/probabilisticsamplerprocessor/README.md

 - `sampling_percentage` (32-bit floating point, required): Percentage at which items are sampled; >= 100 samples all items, 0 rejects all items.
 - `hash_seed` (32-bit unsigned integer, optional, default = 0): An integer used to compute the hash algorithm. Note that all collectors for a given tier (e.g. behind the same load balancer) should have the same hash_seed.
 - `fail_closed` (boolean, optional, default = true): Whether to reject items with sampling-related errors.
+- `sampling_precision` (integer, optional, default = 4): Determines the number of hexadecimal digits used to encode the sampling threshold.


maybe include the range of valid values here (1-14)?

kentquirk · 2024-06-06T15:02:45Z

processor/probabilisticsamplerprocessor/config.go

@@ -45,6 +75,14 @@ type Config struct {
 	// despite errors using priority.
 	FailClosed bool `mapstructure:"fail_closed"`

+	// SamplingPrecision is how many hex digits of sampling
+	// threshold will be encoded, from 1 up to 14.  Default is 4.
+	// 0 is treated as full precision.


Not according to invalid_zero.yaml.

kentquirk · 2024-06-13T14:42:14Z

processor/probabilisticsamplerprocessor/sampler_mode.go

@@ -230,46 +376,82 @@ func consistencyCheck(rnd randomnessNamer, _ samplingCarrier) error {
 //
 // Extending this logic, we round very small probabilities up to the
 // minimum supported value(s) which varies according to sampler mode.
-func makeSampler(cfg *Config) dataSampler {
+func makeSampler(cfg *Config, isLogs bool) dataSampler {


please add an explanation of 'isLogs' to the comment

kentquirk

My comments are mainly cosmetic; in general, we need this; approving to start to push the train out of the station.

jpkrohling

Now that older and newer collectors can co-exist and come to the same decisions, I think this is ready to go. The logs vs. metrics matter can be addressed later, and changed if we think it's easier to respond to possible bug reports.

jmacd added 30 commits May 12, 2023 15:20

Add t-value sampler draft

e822a9b

copy/import tracestate parser package

1bc6017

test ot tracestate

d1fd891

tidy

85e4472

renames

bb75f8a

testing two parsers w/ generic code

6a57b77

integrated

7fa8130

Comments

36230e7

revert two files

7bae35c

Update with r, s, and t-value. Now using regexps and strings.IndexByte()

9010a67

to iterate.

fix sampler build

0e27e40

add support for s-value for non-consistent mode

efcdc3d

WIP

939c758

Merge branch 'main' of github.com:open-telemetry/opentelemetry-collec…

b9a1e56

…tor-contrib into jmacd/tvaluesampler

use new proposed syntax see open-telemetry/opentelemetry-specificatio…

a31266c

…n#3602

update tracestate libs for new encoding

690cd64

wip working on probabilistic sampler with two new modes: downsampler …

c8baf29

…and resampler

unsigned implement split

7f47e4a

two implementations

422e0b2

wip

787b9fd

Merge branch 'main' of github.com:open-telemetry/opentelemetry-collec…

ed36f03

…tor-contrib into jmacd/tvaluesampler

Updates for OTEP 235

d795210

wip TODO

09000f7

versions.yaml

a4d467b

Add proportional sampler mode; comment on TODOs; create SamplerMode t…

e373b9b

…ype similar to configcomprsesion.CompressionType

back from internal

fe6a085

wip

396efb1

fix existing tests

36de5dd

:wip:

f1aa0ad

Update for rejection threshold

700734e

Merge branch 'main' of github.com:open-telemetry/opentelemetry-collec…

9cb1586

…tor-contrib into jmacd/tvaluesampler

github-actions bot removed the Stale label May 31, 2024

jmacd mentioned this pull request May 31, 2024

Add OpenTelemetry sampling conventions open-telemetry/semantic-conventions#793

Closed

3 tasks

jpkrohling reviewed Jun 10, 2024

View reviewed changes

jmacd added 5 commits June 10, 2024 16:02

Add sampler mode use-cases

a98db61

rephrase tracestate; logs do not use tracestate

d33660b

explain sampling precision

c67350d

move misplaced text

b0a9516

remove multierr

95ecbae

jmacd commented Jun 10, 2024

View reviewed changes

jmacd and others added 2 commits June 11, 2024 08:27

Apply suggestions from code review

cbcc853

Co-authored-by: Juraci Paixão Kröhling <juraci.github@kroehling.de>

only debug and info

ad32651

jmacd added 3 commits June 11, 2024 11:23

adjust test for debug-level logs

6b71ea8

revert change of default mode

61abf1f

Merge branch 'main' of github.com:open-telemetry/opentelemetry-collec…

0664ea1

…tor-contrib into jmacd/tvaluesampler

Merge branch 'main' into jmacd/tvaluesampler

1926afb

jmacd commented Jun 12, 2024

View reviewed changes

kentquirk reviewed Jun 13, 2024

View reviewed changes

kentquirk approved these changes Jun 13, 2024

View reviewed changes

jpkrohling approved these changes Jun 13, 2024

View reviewed changes

jpkrohling merged commit 9f0a3db into open-telemetry:main Jun 13, 2024
154 checks passed

github-actions bot added this to the next release milestone Jun 13, 2024

jmacd mentioned this pull request Jun 18, 2024

Span Metrics connector support for OTEP 235 probability sampling #33632

Open

lahsivjar mentioned this pull request Sep 11, 2024

[connector/spanmetricsv2] Scale spans based on adjusted count elastic/opentelemetry-collector-components#95

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[probabilistic sampling processor] encoded sampling probability (support OTEP 235) #31894

[probabilistic sampling processor] encoded sampling probability (support OTEP 235) #31894

jmacd commented Mar 21, 2024 •

edited

Loading

jpkrohling left a comment

jpkrohling Jun 10, 2024

jpkrohling Jun 10, 2024

jmacd Jun 11, 2024

jpkrohling Jun 10, 2024

jmacd Jun 11, 2024

jpkrohling Jun 10, 2024

jmacd Jun 11, 2024

jmacd left a comment

jmacd Jun 10, 2024

jpkrohling commented Jun 11, 2024

jmacd commented Jun 11, 2024

jmacd commented Jun 12, 2024 •

edited

Loading

jmacd Jun 12, 2024 •

edited

Loading

kentquirk Jun 5, 2024

kentquirk Jun 6, 2024

kentquirk Jun 6, 2024

kentquirk Jun 6, 2024

kentquirk Jun 13, 2024

kentquirk left a comment

jpkrohling left a comment


		### Equalizing

		This mode uses the same randomness mechanism as the propotional

[probabilistic sampling processor] encoded sampling probability (support OTEP 235) #31894

[probabilistic sampling processor] encoded sampling probability (support OTEP 235) #31894

Conversation

jmacd commented Mar 21, 2024 • edited Loading

jpkrohling left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jmacd left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jpkrohling commented Jun 11, 2024

jmacd commented Jun 11, 2024

jmacd commented Jun 12, 2024 • edited Loading

jmacd Jun 12, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kentquirk left a comment

Choose a reason for hiding this comment

jpkrohling left a comment

Choose a reason for hiding this comment

jmacd commented Mar 21, 2024 •

edited

Loading

jmacd commented Jun 12, 2024 •

edited

Loading

jmacd Jun 12, 2024 •

edited

Loading