Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Define Exemplar requirements in the Metrics SDK spec #1797

Closed
reyang opened this issue Jul 6, 2021 · 7 comments · Fixed by #1828
Closed

Define Exemplar requirements in the Metrics SDK spec #1797

reyang opened this issue Jul 6, 2021 · 7 comments · Fixed by #1828
Assignees
Labels
area:sdk Related to the SDK spec:metrics Related to the specification/metrics directory

Comments

@reyang
Copy link
Member

reyang commented Jul 6, 2021

What are you trying to achieve?

The metrics data model specification has covered Exemplar here.

The goal is to have the SDK specification supporting exemplars.

Related to #1260.

@reyang reyang added spec:metrics Related to the specification/metrics directory area:sdk Related to the SDK labels Jul 6, 2021
@reyang reyang assigned jsuereth and unassigned bogdandrutu Jul 6, 2021
@jsuereth
Copy link
Contributor

jsuereth commented Jul 8, 2021

Requirements for SDK + Metric Exemplars

Here's a set of requirements for Metric Exemplars based on some prototype exemplar sampling work I've done, as well as looking at existing Exemplar implementations. This is for discussion (for now) and i'll formalize into a PR once the aggregator section in the SDK is a bit more fleshed out, as this relies on aggregators.

Basics

  • MeasurementProcessor should be able to sample incoming measurements as exemplars
  • Sampled exemplars are NOT cumulative. The list of sampled exemplars may change during every metric stream export.
  • Exemplars should pull a recording timestamp with the measurement. This can be done when the sampling decision is made, if that decision is done "synchronously".
  • Exemplars should automatically pull TraceId/SpanId information from associated context on a measurement.
  • When configuring an SDK (or MeterProvider), the user MUST be able to configure exemplar sampling. See the sampling header for more details.

Sampling

  • Aggregators should be able to influence exemplar sampling, e.g. histogram leveraging bucket boundaries for exemplar selection, and attempting to keep exemplars per-bucket.
  • Exemplar samplers need access to context (Span/Trace), and an leverage Span information in sampling decisions.
  • Can't be the same as Trace sampling (memory overhead)
  • Should be able to leverage trace sampling decisions

Built-in Implementations

The following built-in samplers SHOULD be provided with easy configuration:

  • No-Sampling - This sampler never selects exemplars.
  • Preserve-latest-with-sampled-trace
    • Only samples measurements that are recording in a Context with a sampled Span.
    • Only keeps "latest" exemplar and drops any past history.
    • For histogram aggreagation, this should keep "latest exemplar per bucket".

Prometheus Exporter

When exporting to Prometheus, the following should happen:

  • The latest sampled exemplar should be reported with any "Sample" point.
  • For Histograms, this should additionally be restricted based on the bucket being reported such that the exemplar chosen is unique to the currently reported metric sample. While histograms are reported in cumulative "less than or equal" count sums, the exemplar for a particular bucket should not be one that could be included in previous buckets.

Prototype implementation can be found here;

@reyang
Copy link
Member Author

reyang commented Jul 9, 2021

@jsuereth nice summary! Here are my feedback:

No-Sampling - consider Always-off Sampling. When people hear "no-sampling", they might have different interpretations - 1) "there is no sampler, so I will get everything" 2) "there is a sampler and it is taking nothing" 3) "there is a sampler and it is not filtering out anything, so I will get everything".

I think similar to Span Limits, we will have some limits on how many samples are we going to allow at maximum (for a bucket, for a time series data point. etc.). This could be useful for sync instruments where users are taking too many samples, or can be useful for pull exporter scenario where we don't want to hold the samples for too long time (e.g. if the scraper stopped pulling for hours).

When we need to "merge" histograms based on interpolation (whenever lossless merge is not available), samples can actually go to the new buckets with 100% confidence (because we have the raw information such as the duration).

@jmacd
Copy link
Contributor

jmacd commented Jul 12, 2021

I prefer when "Sampling" means something statistical is taking place, and the word "Exemplar" explicitly suggests a selection technique that is not sampling. Thus, Sampling should be an option and instead of "Always-off sampling" or "No-sampling", maybe just "No exemplars".

Instead of "Preserve-latest-with-sampled-trace", maybe "Latest exemplars".

When it comes to sampling, open-telemetry/oteps#148 has recommendations for using exemplars to convey sample events with a sampling.adjusted_count attribute. To compute a sample (i.e., Exemplars with Probabilities) probably means using a reservoir sampling algorithm and picking more than 1 exemplar per stream point per period, and there are simple algorithmic options available.

@reyang
Copy link
Member Author

reyang commented Jul 13, 2021

Do we prefer to model MIN/MAX as exemplars (e.g. a cumulative sum 100, with the MAX 5 and MIN 2)?
Or we think there are many cases where people just want to know MIN/MAX without all the other details (e.g. trace id, span id, all the attributes, etc.) so they should be modeled as separate aggregation?

@reyang
Copy link
Member Author

reyang commented Jul 13, 2021

Do we allow users to control what data to report with exemplars (e.g. I want the trace id / span id and all the items in the baggage vs. I just need trace id / span id)?

@jsuereth
Copy link
Contributor

jsuereth commented Jul 13, 2021

@jmac

I prefer when "Sampling" means something statistical is taking place, and the word "Exemplar" explicitly suggests a selection technique that is not sampling.

I like this phrasing. When propsing defaults I'll use this.

To compute a sample (i.e., Exemplars with Probabilities) probably means using a reservoir sampling algorithm and picking more than 1 exemplar per stream point per period

Yes, I'm working on reservoir sampling in the Java Metrics prototype right now so we can see how well it does in practice. Specifically, right now Prometheus (and OpenCensus) sample with a "take-latest-per-histogram-bucket" approach (for histogram aggregation). I like the idea of reservoir sampling, and I like the idea of it being the default. The only question in my mind is if we should have a "sample like OpenCensus/Prometheus" hook here.

@reyang

Do we allow users to control what data to report with exemplars (e.g. I want the trace id / span id and all the items in the baggage vs. I just need trace id / span id)?

This is a good point. Want to call out a few things:

  1. Views can specify which baggage attributes to preserve in a metric, so THAT does exist if necessary.
  2. Exemplars only display "difference" attributes (i.e. those the aggregator removed), so in the base-case will not report anything.

So, I don't think baggage-labels on Exemplar is initially important here, but it's a good use case to follow up with. From my view, that's some kind of Measurement => Exemplar function, likely something we should specify on the MeasurementProcessor interface / API.

Do we prefer to model MIN/MAX as exemplars (e.g. a cumulative sum 100, with the MAX 5 and MIN 2)?
Or we think there are many cases where people just want to know MIN/MAX without all the other details (e.g. trace id, span id, all the attributes, etc.) so they should be modeled as separate aggregation?

I think MIN/MAX could be exemplars (possible where we add labels denoting this). However, I don't think that should be the default behavior, and it makes consuming the data a bit harder. I'd prefer reservoir sampling and knowing your min/max are min/max BUT we could encode min/max into exemplars.

@jsuereth
Copy link
Contributor

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area:sdk Related to the SDK spec:metrics Related to the specification/metrics directory
Projects
None yet
Development

Successfully merging a pull request may close this issue.

4 participants