Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[probabilistic sampling processor] encoded sampling probability (support OTEP 235) #31894

Merged
merged 99 commits into from
Jun 13, 2024

Conversation

jmacd
Copy link
Contributor

@jmacd jmacd commented Mar 21, 2024

Description: Creates new sampler modes named "equalizing" and "proportional". Preserves existing functionality under the mode named "hash_seed".

Fixes #31918

This is the final step in a sequence, the whole of this work was factored into 3+ PRs, including the new pkg/sampling and the previous step, #31946. The two new Sampler modes enable mixing OTel sampling SDKs with Collectors in a consistent way.

The existing hash_seed mode is also a consistent sampling mode, which makes it possible to have a 1:1 mapping between its decisions and the OTEP 235 randomness and threshold values. Specifically, the 14-bit hash value and sampling probability are mapped into 56-bit R-value and T-value encodings, so that all sampling decisions in all modes include threshold information.

This implements the semantic conventions of open-telemetry/semantic-conventions#793, namely the sampling.randomness and sampling.threshold attributes used for logs where there is no tracestate.

The default sampling mode remains HashSeed. We consider a future change of default to Proportional to be desirable, because:

  1. Sampling probability is the same, only the hashing algorithm changes
  2. Proportional respects and preserves information about earlier sampling decisions, which HashSeed can't do, so it has greater interoperability with OTel SDKs which may also adopt OTEP 235 samplers.

Link to tracking Issue:

Draft for open-telemetry/opentelemetry-specification#3602.
Previously #24811, see also open-telemetry/oteps#235
Part of #29738

Testing: New testing has been added.

Documentation:

Copy link
Member

@jpkrohling jpkrohling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Apart from small details, I think this looks good. I would also ask you to think about which telemetry you'd want from this component in case of a bug report, and add metrics for them. Do we need to keep a histogram of Rs and Ts?

processor/probabilisticsamplerprocessor/README.md Outdated Show resolved Hide resolved
processor/probabilisticsamplerprocessor/README.md Outdated Show resolved Hide resolved
processor/probabilisticsamplerprocessor/README.md Outdated Show resolved Hide resolved
processor/probabilisticsamplerprocessor/sampler_mode.go Outdated Show resolved Hide resolved
processor/probabilisticsamplerprocessor/sampler_mode.go Outdated Show resolved Hide resolved
logger.Debug(description, zap.Error(err))
var se samplerError
if errors.As(err, &se) {
logger.Info(description, zap.Error(err))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is part of the hot path, I'd rather have a metric recording the number of errors that happened, with samplerError being a dimension of the metric (error=sampler, error=other).

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let me suggest something different: let them all be (debug) logs for this PR, but then convert the metrics for this component as a whole from OC to OTel, using mdatagen. On that PR, you could add the new metric for the error conditions.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the idea of making the logs into Debug() and Info()-level. I also do not expect any of the Info()-level log statements to be common, for they require a misbehaving SDK or some other form of data corruption, as opposed to misconfigured sampling. Because these messages signal corruption of some kind, I think they should be logs. My expectation is that logs are sampled at runtime (or they should be) to avoid impacting the hot path.

With that said, I feel that a new metric is not helpful -- it just means the customer has to monitor something new when we have metric signals already. There is a single metric output by this code specifically, which will have a "policy=missing_randomness" when these errors arise.

We also have (or at least desire) standard pipeline metrics, which ought to provide a standard way to count how many spans succeed or fail. If the sampler is configured with FailClosed=true and these missing_randomness conditions are happening, the result will be loss of spans. I do not want the user to have to discover a new metric for this, because there ought to be a standard metric for rejected items. All processors should be able to count the number of items that are rejected for malformed data.

if errors.Is(err, sampling.ErrInconsistentSampling) {
// This is working-as-intended. You can't lower
// the threshold, it's illogical.
logger.Debug(description, zap.Error(err))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same comment here about the metric vs. log

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To summarize: I think the existing metric's "policy=missing_randomness" is a useful-enough signal. Personally, I want standard pipeline metrics so that every component doesn't have to invent a bespoke metric definition.

}
}
if err := carrier.reserialize(); err != nil {
logger.Info(description, zap.Error(err))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and same here :-)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is some sort of grievous corruption, and I personally do not want every component inventing new metrics to monitor for things we never expect to happen.

Copy link
Contributor Author

@jmacd jmacd left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(partial)

processor/probabilisticsamplerprocessor/README.md Outdated Show resolved Hide resolved
processor/probabilisticsamplerprocessor/README.md Outdated Show resolved Hide resolved
processor/probabilisticsamplerprocessor/README.md Outdated Show resolved Hide resolved
@@ -18,6 +18,7 @@ require (
go.opentelemetry.io/otel/metric v1.27.0
go.opentelemetry.io/otel/trace v1.27.0
go.uber.org/goleak v1.3.0
go.uber.org/multierr v1.11.0
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks, now using errors.Join. 95ecbae

processor/probabilisticsamplerprocessor/sampler_mode.go Outdated Show resolved Hide resolved
@jpkrohling
Copy link
Member

@jmacd, please let me know once this is ready for another round.

jmacd and others added 2 commits June 11, 2024 08:27
Co-authored-by: Juraci Paixão Kröhling <juraci.github@kroehling.de>
@jmacd
Copy link
Contributor Author

jmacd commented Jun 11, 2024

I would also ask you to think about which telemetry you'd want from this component in case of a bug report, and add metrics for them. Do we need to keep a histogram of Rs and Ts?

My personal opinion is that we should not invent new metrics to monitor for bugs, that is what logs are good at and if we think logs do not perform well enough, we should reconsider -- metrics are not clearly better for performance, compared with sampled logs.

Moreover, I want us to encourage use of standard pipeline metrics. Practically all real processors are going to encounter errors that would cause data to be dropped or malformed in some way, and we shouldn't need new metrics for every one of them.

For your question about histograms of R and T; there is something I would recommend, but not a histogram of T or R values (and both of these could be high cardinality), and probably not at default-verbosity level. What we do care about, and I'm open to suggestions, is that the sum of adjusted counts after sampling is expected to equal the sum of adjusted counts before sampling. This is a probabilistic argument, so the match is not exact. We should have a metric that counts items by their adjusted count. I would argue that such a metric should be standardized and comparable with standard pipeline metrics, so if otelcol_incoming_items and otelcol_outgoing_items are the standard pipeline metrics, sampling processors could emit otelcol_outgoing_items_adjusted and otel_incoming_items_adjusted. There would be some additional expense and code complexity to this.

Another direction to take this question is that the span-to-metrics connector should be able to use the adjusted counts so that it counts the number of representative spans, not the number of actual spans. This sounds more useful to me than a pipeline metric of adjusted count, and anyway I do not prefer to use metrics as a debugging signal. If there are bugs, I would recommend users connect a debugging exporter and review the output data.

@jmacd
Copy link
Contributor Author

jmacd commented Jun 12, 2024

These test failures are independent, #33520.

Comment on lines +440 to +454
// Convert the accept threshold to a reject threshold,
// then shift it into 56-bit value.
reject := numHashBuckets - scaledSamplerate
reject56 := uint64(reject) << 42

threshold, _ := sampling.UnsignedToThreshold(reject56)
threshold, _ := sampling.UnsignedToThreshold(reject56)

return &hashingSampler{
tvalueThreshold: threshold,
hashSeed: cfg.HashSeed,
return &hashingSampler{
tvalueThreshold: threshold,
hashSeed: cfg.HashSeed,

// Logs specific:
logsTraceIDEnabled: cfg.AttributeSource == traceIDAttributeSource,
logsRandomnessSourceAttribute: cfg.FromAttribute,
// Logs specific:
logsTraceIDEnabled: cfg.AttributeSource == traceIDAttributeSource,
logsRandomnessSourceAttribute: cfg.FromAttribute,
}
Copy link
Contributor Author

@jmacd jmacd Jun 12, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sampler SIG, see here. Reference: open-telemetry/semantic-conventions#793 (comment)


### Equalizing

This mode uses the same randomness mechanism as the propotional
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
This mode uses the same randomness mechanism as the propotional
This mode uses the same randomness mechanism as the proportional

@@ -105,16 +106,16 @@ func Test_tracesamplerprocessor_SamplingPercentageRange(t *testing.T) {
},
numBatches: 1e5,
numTracesPerBatch: 2,
acceptableDelta: 0.01,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If these adjustments are to get a faster but less-flaky test, what do you think of the idea of adding a loop? You can try a few times to get a result in the acceptable range; it will finish as soon as it sees an acceptable result.

This is the idea behind the "Eventually" function in the stretchr/testify library (we don't use it here but the concept is sound).

- `sampling_percentage` (32-bit floating point, required): Percentage at which items are sampled; >= 100 samples all items, 0 rejects all items.
- `hash_seed` (32-bit unsigned integer, optional, default = 0): An integer used to compute the hash algorithm. Note that all collectors for a given tier (e.g. behind the same load balancer) should have the same hash_seed.
- `fail_closed` (boolean, optional, default = true): Whether to reject items with sampling-related errors.
- `sampling_precision` (integer, optional, default = 4): Determines the number of hexadecimal digits used to encode the sampling threshold.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe include the range of valid values here (1-14)?

@@ -45,6 +75,14 @@ type Config struct {
// despite errors using priority.
FailClosed bool `mapstructure:"fail_closed"`

// SamplingPrecision is how many hex digits of sampling
// threshold will be encoded, from 1 up to 14. Default is 4.
// 0 is treated as full precision.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not according to invalid_zero.yaml.

@@ -230,46 +376,82 @@ func consistencyCheck(rnd randomnessNamer, _ samplingCarrier) error {
//
// Extending this logic, we round very small probabilities up to the
// minimum supported value(s) which varies according to sampler mode.
func makeSampler(cfg *Config) dataSampler {
func makeSampler(cfg *Config, isLogs bool) dataSampler {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please add an explanation of 'isLogs' to the comment

Copy link
Member

@kentquirk kentquirk left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My comments are mainly cosmetic; in general, we need this; approving to start to push the train out of the station.

Copy link
Member

@jpkrohling jpkrohling left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Now that older and newer collectors can co-exist and come to the same decisions, I think this is ready to go. The logs vs. metrics matter can be addressed later, and changed if we think it's easier to respond to possible bug reports.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pkg/sampling processor/probabilisticsampler Probabilistic Sampler processor
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Update probabilistic sampler processor with OTEP 235 support
4 participants