Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Integrate Exemplars with Metrics SDK #113

Merged
merged 8 commits into from
Jul 10, 2020
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
104 changes: 104 additions & 0 deletions text/metrics/0113-exemplars.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,104 @@
# Integrate Exemplars with Metrics

This OTEP adds exemplar support to aggregations defined in the Metrics SDK.

## Definition

Exemplars are example data points for aggregated data. They provide specific context to otherwise general aggregations. For histogram-type metrics, exemplars are points associated with each bucket in the histogram giving an example of what was aggregated into the bucket. Exemplars are augmented beyond just measurements with references to the sampled trace where the measurement was recorded and labels that were attached to the measurement.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Exemplars are example data points for aggregated data.

Are they this "generic" thing, or are they "traces"? The proto schema below suggests the latter. For example, could I store "customer id" as exemplar, so that I could answer the question "which sample customer IDs have latencies in this bucket"?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The RawValue representation is a generic way to represent sampled metric events. There will some SDK-specific/custom selection logic that may decide to select only exemplars that have trace context, or they can decide to focus on the distribution of customer IDs. The customer ID would be represented by a label value.

When the aggregator is a histogram:
The SDK can select samples using fixed-size uniform selection on a per-bucket basis, or it can select items probabilistically so as to produce an expected number of exemplars per bucket that is equal (the latter is likely to have better coverage in the case where there are empty buckets--this can be accomplished using Weighted Sampling and inverse-probability weights, for example).

Let's suppose you configure exemplar selection to choose 100 exemplars per bucket per collection period. If the selection is unbiased and the sample_count fields are accurate, you will be able to summarize the contribution to each bucket by up to 100 customers.

Suppose it's a Counter producing a Sum aggregation, instead of a histogram. You could use exemplars selected from the Counter to summarize the contribution to a sum by customer ID. There are lots of ways to sample, and I believe this representation will support a large number of useful approaches.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Customer ID as a label value would kill most metric backends. To me that's the whole point of exemplars, to allow associating samples of high-cardinality values with metrics. I am fine if we limit this high-cardinality dimension to trace IDs for now, but I am not seeing a "generic" solution here that would support exemplars on other dimensions.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the comment above had the same concern as you? Would adding correlation context as an attribute on RawValue that can have the customer ID solve the problem?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The SDK has built-in support for aggregation so that high-cardinality labels can be eliminated before they reach most metric backends. The Sum, Histogram, or Summary that you export can be aggregated so that customer_id does not appear in the aggregation value.

Exemplars selected from the same series of events (that were summarized without customer_id) can include the customer_id, and the exemplars may be used to approximate the distribution of customer_id and other dimensions that were aggregated away in the Sum, Histogram or Summary value. One of the nice properties of the approach described here is that by limiting the number of exemplars, we limit cardinality reported in a single collection interval. For example, you could select 100 exemplars and even if there are 1000 actual customer_ids, you will collect at most 100 distinct values, and if chosen probabilistically, we can expect to recover the customer_ids that were most representative of the actual distribution (i.e., the "heavy hitters", the top of the distribution).

I want to emphasize that the API and the Protocol should not discourage the use of high-cardinality metrics. Given that I see exemplars as exactly the tool for addressing high cardinality, I'm confused by:

I am fine if we limit this high-cardinality dimension to trace IDs for now

What do you think we should restrict?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The issue I have with this schema is that it makes no distinction between regular label dimensions (which should survive aggregation) and the exemplar dimensions like customer-id. Only trace id is explicitly separated as exemplar dimensions. That makes it very easy for a user to shoot themselves in the foot and send an explosion of dimensions to the backend. The only way to avoid it is by carefully defining custom aggregation rules in the SDK and explicitly defining which labels should be treated as time series dimensions vs. exemplar attributes. While it minimizes the API surface, I feel that it makes the API more dangerous to use. Why not allow specifying exemplar labels explicitly from the beginning, and keep them separate from regular labels?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for clarifying. I understand the concern, now.

By the way, an earlier draft of the metrics API allowed the application writer to recommend aggregation dimensions, by the name "Recommended Keys". It was removed: open-telemetry/opentelemetry-specification#463. The reason these keys were recommended is that we do not believe the author of the code knows which labels the system or the operator wants to monitor. If we ask the developer decide which dimensions are for aggregation and which are for exemplars, we make a semantic distinction out of a performance limitation (and not a universal one, as far as I know).

There is a practical reason to support arbitrary labels and deal with them through configuration: this is the natural thing to do when creating Metric events from Spans. Span attributes simply become Metric labels. We are adding a semantic convention to cover duration measurements: open-telemetry/opentelemetry-specification#657

The span-to-metrics issue is discussed here: open-telemetry/opentelemetry-specification#381

One way to address your concern would be to set the default to aggregate over zero dimensions, so that all labels are exemplar labels unless configured otherwise.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bogdandrutu I would like your opinion on this topic. We introduced RecommendedKeys() to address a perceived need in Prometheus, since Prometheus clients actually enforce pre-declared label keys. We discovered that the Prometheus protocol does not have any such restriction, which made it appealing to remove a feature. In the (dog)statsd world, it's common to add labels as needed. In modern terminology, we had created (proposed) Metrics Processors named "defaultkeys" that would use the developer-provided recommended keys, and named "ungrouped" that would use all the keys when exporting metrics. Removing recommended keys brought us back to a single basic metrics processor.

With this proposal, we begin to see a "Sampler API" for metrics, that is one that takes a full set of labels, applies a sampling decision (whether to select an exemplar or not) and then returns the set of labels to use for aggregation

If we have a choice between:

(1) asking the user to choose which labels are significant for aggregation and which are not
(2) making it really easy to configure which labels are used for aggregation

I would absolutely prefer the second choice--whether aggregation is configured by a dynamic configuration API, by a static configuration API, or by hard-coding a View in your main() function, any of these should be viable and relatively easy, and all of these are more appealing to me than asking the user to distinguish two kinds of label.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@bogdandrutu Thanks for merging, but I think we should capture this discussion or at least address the question.


## Motivation

Defining exemplar behaviour for aggregations allows OpenTelemetry to support exemplars in Google Cloud Monitoring.

Exemplars provide a link between metrics and traces. Consider a user using a Histogram aggregation to track response latencies over time for a high QPS server. The histogram is composed of buckets based on the speed of the request, for example, "there were 55 requests that took 400-500 milliseconds". The user wants to troubleshoot slow requests, so they would need to find a trace where the latency was high. With exemplars, the user is able to get an exemplar trace from a high latency bucket, an exemplar trace from a low latency bucket, and compare them to figure out the reason for the high latency.

Exemplars are meaningful for all aggregations where relevant traces can provide more context to the aggregation, as well as when exemplars can display specific information not otherwise shown in the aggregation (for example, the full set of labels where they otherwise might be aggregated away).

## Internal details

An exemplar is a `RawValue`, which is defined as:

```
message RawValue {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

s/RawValue/Exemplar?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We redefined Exemplar as RawValue so we could use the same data type for all measurements instead of just exemplars (eg for exact aggregator). Will try to reword An Exemplar is defined as better

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The WIP open-telemetry/opentelemetry-proto#162 specifies that RawValue messages may be used in two ways.

  1. As exemplars that add additional information to SCALAR, HISTOGRAM, and SUMMARY data points.
  2. As raw data in its own right.

The SDK spec discusses three types of aggregation that can be represented as a scalar value: Sum, LastValue, and Exact. The exact representation would use RawValues with sample_count == 1.

// Numerical value of the measurement that was recorded. Only one of these two fields is
// used for the data, depending on its type
double double_value = 0;
int64 int64_value = 1;
jmacd marked this conversation as resolved.
Show resolved Hide resolved

// Exact time that the measurement was recorded
fixed64 time_unix_nano = 2;

// 'label:value' map of all labels that were provided by the user recording the measurement
repeated opentelemetry.proto.common.v1.StringKeyValue labels = 3;
jmacd marked this conversation as resolved.
Show resolved Hide resolved

// Span ID of the current trace
optional bytes span_id = 4;

// Trace ID of the current trace
optional bytes trace_id = 5;

// When sample_count is non-zero, this exemplar has been chosen in a statistically
// unbiased way such that the exemplar is representative of `sample_count` individual events
optional double sample_count = 6;
}
jmacd marked this conversation as resolved.
Show resolved Hide resolved
```

Exemplar collection should be enabled through an optional parameter (disabled by default), and when not enabled, there should be no collection/logic performed related to exemplars. This is to ensure that when necessary, aggregators are as high performance as possible. Aggregators should also have a parameter to determine whether exemplars should only be collected if they are recorded during a sampled trace, or if tracing should have no effect on which exemplars are sampled. This allows aggregations to prioritize either the link between metrics and traces or the statistical significance of exemplars, when necessary.

[#347](https://github.com/open-telemetry/opentelemetry-specification/pull/347) describes a set of standard aggregators in the metrics SDK. Here we describe how exemplars could be implemented for each aggregator.

### Exemplar behaviour for standard aggregators

#### HistogramAggregator

The HistogramAggregator MUST (when enabled) maintain a list of exemplars whose values are distributed across all buckets of the histogram (there should be one or more exemplars in every bucket that has a population of at least one sample-able measurement). Implementations SHOULD NOT retain an unbounded number of exemplars.

#### Sketch

A Sketch aggregator SHOULD maintain a list of exemplars whose values are spaced out across the distribution. There is no specific number of exemplars that should be retained (although the amount SHOULD NOT be unbounded), but the implementation SHOULD pick exemplars that represent as much of the distribution as possible. (Specific details not defined, see open questions.)

#### Last-Value

Most (if not all) Last-Value aggregators operate asynchronously and do not ever interact with context. Since the value of a Last-Value is the last measurement (essentially the other parts of an exemplar), exemplars are not worth implementing for Last-Value.

#### Exact

The Exact aggregator will function by maintaining a list of `RawValue`s, which contain all of the information exemplars would carry. Therefore the Exact aggregator will not need to maintain any exemplars.
jmacd marked this conversation as resolved.
Show resolved Hide resolved

#### Counter

Exemplars give value to counter aggregations in two ways: One, by tying metric and trace data together, and two, by providing necessary information to re-create the input distribution. When enabled, the aggregator will retain a bounded list of exemplars at each checkpoint, sampled from across the distribution of the data. Exemplars should be sampled in a statistically significant way.

#### MinMaxSumCount

Similar to Counter, MinMaxSumCount should retain a bounded list of exemplars that were sampled from across the input distribution in a statistically significant way.
jmacd marked this conversation as resolved.
Show resolved Hide resolved

#### Custom Aggregators

Custom aggregators MAY support exemplars by maintaining a list of exemplars that can be retrieved by exporters. Custom aggregators should select exemplars based on their usage by the connected exporter (for example, exemplars recorded for Google Cloud Monitoring should only be retained if they were recorded within a sampled trace).

Exemplars will always be retrieved from aggregations (by the exporter) as a list of RawValue objects. They will be communicated via a

```
optional repeated RawValue exemplars = 6
```

attribute on the `Metric` object.

## Trade-offs and mitigations

Performance (in terms of memory usage and to some extent time complexity) is the main concern of implementing exemplars. However, by making recording exemplars optional, there should be minimal overhead when exemplars are not enabled.

## Prior art and alternatives

Exemplars are implemented in [OpenCensus](https://github.com/census-instrumentation/opencensus-specs/blob/master/stats/Exemplars.md#exemplars), but only for HistogramAggregator. This OTEP is largely a port from the OpenCensus definition of exemplars, but it also adds exemplar support to other aggregators.

[Cloud monitoring API doc for exemplars](https://cloud.google.com/monitoring/api/ref_v3/rpc/google.api#google.api.Distribution.Exemplar)

## Open questions

- Exemplars usually refer to a span in a sampled trace. While using the collector to perform tail-sampling, the sampling decision may be deferred until after the metric would be exported. How do we create exemplars in this case?
jmacd marked this conversation as resolved.
Show resolved Hide resolved
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we address this question from the resolved discussion?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't feel that this is the place to describe a fancy SDK approach to this problem. This question leads to arbitrarily complex approaches that are also found in the discussion about tail sampling itself. How should we decide to propagate a trace-is-sampled bit in-band when making child spans during a span lifetime? It's almost the same question.

A simple approach would be to maintain a per-span sample of metric events and buffer metric data until the span ends.

Another approach would use the statistics of the spans that are being selected by the tail sampler to form an unequal probability sampling scheme. Select sample metric events that are likely to be associated with spans that match the tail-sampling decision. E.g., if tail latency is used to select exemplars, and a high correlation is observed between latency and label X, then use label X to boost sample weight on metric events. This leads to a speculative approach where you try to choose exemplars that will have interesting traces.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cnnradams Would you be willing to add an answer to this question? If head sampling, the logic for selecting trace contexts that are also being sampled is simple. If tail sampling, the logic for selecting metric samples has to be coordinated with tracing, delayed, or somehow speculative--and this decision is practically the same as deciding to what to tell your child in a span before the span is finished.

Copy link
Member Author

@cnnradams cnnradams Jul 10, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I missed this 😬

To the depth that this OTEP goes, yes, this question is answered by "either the tail sampler needs to pick traces with exemplar choices in mind, or exemplars will need to be picked without a guarantee that they will have a trace". But the actual details of this still need to be worked out as far as I'm aware. I can't really mark this as answered now that its merged, so this will have to do 🤷

Another approach would use the statistics of the spans that are being selected by the tail sampler to form an unequal probability sampling scheme.

How would you have knowledge of the spans that were chosen to be sampled when that decision was made in a different process without your input?


- We don’t have a strong grasp on how the sketch aggregator works in terms of implementation - so we don’t have enough information to design how exemplars should work properly.
jmacd marked this conversation as resolved.
Show resolved Hide resolved

- The spec doesn't yet define a standard set of aggregations, just default aggregations for standard metric instruments. Since exemplars are always attached to particular aggregations, it's impossible to fully specify the behavior of exemplars.
jmacd marked this conversation as resolved.
Show resolved Hide resolved