Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Metrics Proto Does Not Support the OpenTelemetry Specification #125

Closed
MrAlias opened this issue Mar 13, 2020 · 15 comments
Closed

Metrics Proto Does Not Support the OpenTelemetry Specification #125

MrAlias opened this issue Mar 13, 2020 · 15 comments
Labels
release:required-for-ga Must be resolved before GA release, or nice to have before GA spec:metrics

Comments

@MrAlias
Copy link
Contributor

MrAlias commented Mar 13, 2020

As language SIGs include support to export metrics using the OTLP it is being discovered that the native export format they are exporting from do not fit nicely or at all into the OTLP structure. This issue is intended to track these friction points, proposed changes, and action items to resolution.

Ideally the OTLP would natively support the output of all combination of Instruments and Aggregators defined in the OpenTelemetry Specification. Currently, this is not the case.

Identified Issues

First Class Support For Min and Max Values of a MinMaxSumCount Aggregation

There is no metric kind for a MinMaxSumCount aggregator that outputs the minimum, maximum, sum, and count of events observed.

It has been suggested that a Summary metric kind can be used to transport these values. The Summary kind does have fields for the count and sum, however, it does not have dedicated fields for the maximum nor minimum.

The suggested work around here is to send the maximum as the 100th percentile and the minimum as the 0th percentile. Both of which are not obvious and the latter is mathematically incorrect (the 0th percentile is the value which 0% of events occured, where as the minimum is the minimal value where at least 1 event occured).

HistogramValue.Bucket.Exemplar is inadequate

As outlined in the linked ticket:

  1. It's not possible to provide exemplars for non-histograms
  2. It's not possible to provide more than one exemplar per bucket
  3. It requires converting certain data types to a string, which will require additional specification

Instrumentation Instead of Aggregation

The OTLP metric kinds are centered around instruments not output from Aggregation:

    ...
    GAUGE_INT64 = 1;
    GAUGE_DOUBLE = 2;
    GAUGE_HISTOGRAM = 3;
    COUNTER_INT64 = 4;
    COUNTER_DOUBLE = 5;
    CUMULATIVE_HISTOGRAM = 6;
    SUMMARY = 7;

The incongruence of the instruments these kinds were modeled after and the actual instruments of the OpenTelemetry Specification is secondary to the fact that they are modeled after instruments in the first place.

If the goal of the OTLP is to transport the output of Instrument -> Aggregator, it should be modeled after the output of the Aggregators, aggregations. And while, yes, the Histogram is one of these aggregations, the included histogram kinds are conflated with instrument qualifiers (Gauge, Cumulative).

Nebulously Defined Need to be Compatible with OpenMetrics

A common refrain for the design of the OTLP is "We are trying to also stay compatible with OpenMetrics protocol".

This desire to remain compatible is an admirable one, it is negatively impacting this project. The OTLP in its pursuit to maintain easy translation to OpenMetrics has made it difficult to translate from OpenTelemetry into it. Additionally, it has led to a design that is bloated with OpenMetrics metric kinds (GAUGE_INT64, GAUGE_DOUBLE) that OpenTelemetry does not use.

It seems as though if support for the OpenMetrics protocol is desired, the implementations of the OpenTelemetry Specification should support that protocol. Instead of having the OTLP try to match that projects decisions.

Steps Forward

I'm hoping to have this Issue start the discussion on how we can resolve these issues. I see the metrics proto in need of a major overhaul to provide first class support for OpenTelemetry. Possibly being done in a v2 of the proto.

cc @bogdandrutu, @jmacd, @jkwatson, @c24t

@tigrannajaryan
Copy link
Member

@MrAlias our approach so far has been that we file individual issues and PRs that fix them, gradually moving towards the desirable final proto. We had more time to do this for traces and it looks like it went well. With metric we had less time and there are issues that need to be addressed.

If you know what the right fixes to the identified issues are it would be great if you could submit corresponding PRs.

major overhaul to provide first class support for OpenTelemetry. Possibly being done in a v2 of the proto.

Is there such a disconnect that requires a major overhaul? I am not a metric spec/SDK contributor, so I cannot assess myself how big is the disconnect.

@open-telemetry/specs-approvers whatever changes we want to do to the proto it is very desirable to do them quickly. I know multiple people who are currently ramping up to write code that assumes the current proto. If we do not finalize the proto quickly we will cause pain by introducing breaking changes.

If we are unable to have the final metric proto version within the next few days it is best that we announce the metric part of the protocol as not ready for implementation.

@c24t
Copy link
Member

c24t commented Apr 9, 2020

If we are unable to have the final metric proto version within the next few days it is best that we announce the metric part of the protocol as not ready for implementation.

@tigrannajaryan what did you decide on this? As it's written now, the types in the metrics proto are very different than the instruments and aggregations in the spec or OTEPS (e.g. open-telemetry/oteps#88 and open-telemetry/oteps#93).

Unless we expect to change the spec to match the proto this looks to me like it's not ready for implementation.

@tigrannajaryan
Copy link
Member

@c24t I do not know enough about the metrics to make a call on this. @bogdandrutu @jmacd can you help?

@MrAlias
Copy link
Contributor Author

MrAlias commented Apr 13, 2020

@tigrannajaryan: I started in on a proposal refactor of the proto, but have not found the time to get it into a presentable state. The rough structure of the proposal is to switch to something closer to the Stackdriver proto model for metrics and update the data points to be more along the lines of instantaneous, delta, and cumulative values. I'm hoping to get this more formalized and will share when it is.

I agree with @c24t in that it might be wasted work to implement at this point as a I think there needs to be a refactor to support the OpenTelemetry metric data.

@jmacd
Copy link
Contributor

jmacd commented Apr 13, 2020

I look forward to seeing this proposal @MrAlias.
I've updated OTEPs 93 and 96 with the terms "delta", "cumulative", and "instantaneous", fwiw.

@tigrannajaryan
Copy link
Member

Do note that we have an implementation of metrics proto in Collector already and increasingly large amount of code depends on it because it is the foundation of internal representation.

The longer we wait for the proposal the more code needs to be rewritten. If the proposal doesn't happen quickly them we have a problem, we do not have time rewrite thousands of lines of cod (depending on how big the changes are).

@tigrannajaryan
Copy link
Member

If this gets delayed more we may end up staying with OpenCensus representation in the Collector, since I don't think we will have time or people to rewrite all that code. The consequence will be that metrics will work significantly slower in the Collector.

@jmacd
Copy link
Contributor

jmacd commented Apr 14, 2020 via email

@jkwatson
Copy link
Contributor

@jmacd do you think the current proto definition is robust enough to encode all of the aggregations you envision for OTEP-88 and following?

@jmacd
Copy link
Contributor

jmacd commented Apr 14, 2020 via email

MrAlias added a commit to MrAlias/opentelemetry-proto that referenced this issue Apr 15, 2020
Update metric descriptor to contain information on the data qualities.
This includes:

 - The meaning of the time interval measurements were made over is now
   described by the `interval` value.
 - The monotonic nature of the data is captured if known.
 - If the data values are restricted to a subset of numbers.

The `SummaryDataPoint` and `HistogramDataPoint` were merged into a
unified `Distribution` message that contains statistics about the
observed population of values.

Adds support for multiple histogram bucket definitions and adds a linear
bucket definition message.

Includes an explicit minimum and maximum for a `Distribution` (open-telemetry#122).

Removes the exemplar code (open-telemetry#81). This can be added back in as a more
generalized case now given the new `Data` structure of the Metrics. But,
it has been left for a separate PR.

Unifies the common parts of the previous `*DataPoints` into a unified
`Data` message.

This is not complete. It likely needs better names and it certainly
needs comments.

Related to open-telemetry#125
Fixes to open-telemetry#81
This was referenced Apr 24, 2020
@jmacd
Copy link
Contributor

jmacd commented May 28, 2020

After some progress in both the set of instruments and the terminology used to explain them, I've come to think that the ideal set of properties for describing an aggregation are:

  • Collective vs Additive representation
  • Made synchronously vs asynchronously
  • Monotonic output.

These can be stored using 3 bits. These bits are independent of whether a unit of metric data is considered as "cumulative" or "delta", the so-called temporal quality. In my thinking, both kinds of data are characterized by two timestamps: the start of the interval, and the end of the interval over which the data applies. A cumulative expression of metric data will re-use the same start time, as in: [T0, T1], [T0, T2], [T0, T3], ...

A delta expression of metric data will not re-use the start time:
[T0, T1], [T1, T2], [T2, T3], ...

@bogdandrutu
Copy link
Member

@MrAlias this is very hard to track, do you mind splitting in smaller PRs so we make sure every issue is track accordingly.

@bogdandrutu bogdandrutu added the release:required-for-ga Must be resolved before GA release, or nice to have before GA label Jul 30, 2020
@bogdandrutu
Copy link
Member

If the goal of the OTLP is to transport the output of Instrument -> Aggregator, it should be modeled after the output of the Aggregators, aggregations. And while, yes, the Histogram is one of these aggregations, the included histogram kinds are conflated with instrument qualifiers (Gauge, Cumulative).

I think the goal of OTLP should be:

  1. Support Aggregator output - in a generic way and not having 100s of combinations like SumCount, SumCountMin, etc.
  2. There is a request to also support the output for synchronous Instruments -> to a possible Aggregator. This is still under discussion but will be addressed.

@bogdandrutu
Copy link
Member

@MrAlias is there anything left in this issue, I feel like more than 80-90% of the concerns were addressed. Would like to close this and recommend open a new issue if there is anything else left.

@MrAlias
Copy link
Contributor Author

MrAlias commented Aug 5, 2020

@MrAlias is there anything left in this issue, I feel like more than 80-90% of the concerns were addressed. Would like to close this and recommend open a new issue if there is anything else left.

SGTM. Was just going to recommend the same 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
release:required-for-ga Must be resolved before GA release, or nice to have before GA spec:metrics
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants