-
Notifications
You must be signed in to change notification settings - Fork 264
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Metrics Proto Does Not Support the OpenTelemetry Specification #125
Comments
@MrAlias our approach so far has been that we file individual issues and PRs that fix them, gradually moving towards the desirable final proto. We had more time to do this for traces and it looks like it went well. With metric we had less time and there are issues that need to be addressed. If you know what the right fixes to the identified issues are it would be great if you could submit corresponding PRs.
Is there such a disconnect that requires a major overhaul? I am not a metric spec/SDK contributor, so I cannot assess myself how big is the disconnect. @open-telemetry/specs-approvers whatever changes we want to do to the proto it is very desirable to do them quickly. I know multiple people who are currently ramping up to write code that assumes the current proto. If we do not finalize the proto quickly we will cause pain by introducing breaking changes. If we are unable to have the final metric proto version within the next few days it is best that we announce the metric part of the protocol as not ready for implementation. |
@tigrannajaryan what did you decide on this? As it's written now, the types in the metrics proto are very different than the instruments and aggregations in the spec or OTEPS (e.g. open-telemetry/oteps#88 and open-telemetry/oteps#93). Unless we expect to change the spec to match the proto this looks to me like it's not ready for implementation. |
@c24t I do not know enough about the metrics to make a call on this. @bogdandrutu @jmacd can you help? |
@tigrannajaryan: I started in on a proposal refactor of the proto, but have not found the time to get it into a presentable state. The rough structure of the proposal is to switch to something closer to the Stackdriver proto model for metrics and update the data points to be more along the lines of I agree with @c24t in that it might be wasted work to implement at this point as a I think there needs to be a refactor to support the OpenTelemetry metric data. |
I look forward to seeing this proposal @MrAlias. |
Do note that we have an implementation of metrics proto in Collector already and increasingly large amount of code depends on it because it is the foundation of internal representation. The longer we wait for the proposal the more code needs to be rewritten. If the proposal doesn't happen quickly them we have a problem, we do not have time rewrite thousands of lines of cod (depending on how big the changes are). |
If this gets delayed more we may end up staying with OpenCensus representation in the Collector, since I don't think we will have time or people to rewrite all that code. The consequence will be that metrics will work significantly slower in the Collector. |
I see. Using a data-transfer-object as the primary internal representation
for a large server has caused this headache for me in the past. I wonder if
we could use a protocol-independent type internally.
…On Tue, Apr 14, 2020 at 7:25 AM Tigran Najaryan ***@***.***> wrote:
If this gets delayed more we may end up staying with OpenCensus
representation in the Collector, since I don't think we will have time or
people to rewrite all that code. The consequence will be that metrics will
work significantly slower in the Collector.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#125 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA3WFCIFRZT7GR6WIC5AHRTRMRW6PANCNFSM4LHJALVA>
.
|
@jmacd do you think the current proto definition is robust enough to encode all of the aggregations you envision for OTEP-88 and following? |
I've been thinking about this today. :-)
The descriptor structure worries me. The fact that MinMaxSumCount has to be
expressed as a Summary doesn't worry me.
I haven't reached any conclusions, but I started looking into the collector
codebase.
…On Tue, Apr 14, 2020 at 12:47 PM John Watson ***@***.***> wrote:
@jmacd <https://github.com/jmacd> do you think the current proto
definition is robust enough to encode all of the aggregations you envision
for OTEP-88 and following?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#125 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AA3WFCNPSBTZAAZYG6PM4TLRMS4ULANCNFSM4LHJALVA>
.
|
Update metric descriptor to contain information on the data qualities. This includes: - The meaning of the time interval measurements were made over is now described by the `interval` value. - The monotonic nature of the data is captured if known. - If the data values are restricted to a subset of numbers. The `SummaryDataPoint` and `HistogramDataPoint` were merged into a unified `Distribution` message that contains statistics about the observed population of values. Adds support for multiple histogram bucket definitions and adds a linear bucket definition message. Includes an explicit minimum and maximum for a `Distribution` (open-telemetry#122). Removes the exemplar code (open-telemetry#81). This can be added back in as a more generalized case now given the new `Data` structure of the Metrics. But, it has been left for a separate PR. Unifies the common parts of the previous `*DataPoints` into a unified `Data` message. This is not complete. It likely needs better names and it certainly needs comments. Related to open-telemetry#125 Fixes to open-telemetry#81
After some progress in both the set of instruments and the terminology used to explain them, I've come to think that the ideal set of properties for describing an aggregation are:
These can be stored using 3 bits. These bits are independent of whether a unit of metric data is considered as "cumulative" or "delta", the so-called temporal quality. In my thinking, both kinds of data are characterized by two timestamps: the start of the interval, and the end of the interval over which the data applies. A cumulative expression of metric data will re-use the same start time, as in: [T0, T1], [T0, T2], [T0, T3], ... A delta expression of metric data will not re-use the start time: |
@MrAlias this is very hard to track, do you mind splitting in smaller PRs so we make sure every issue is track accordingly. |
I think the goal of OTLP should be:
|
@MrAlias is there anything left in this issue, I feel like more than 80-90% of the concerns were addressed. Would like to close this and recommend open a new issue if there is anything else left. |
SGTM. Was just going to recommend the same 👍 |
As language SIGs include support to export metrics using the OTLP it is being discovered that the native export format they are exporting from do not fit nicely or at all into the OTLP structure. This issue is intended to track these friction points, proposed changes, and action items to resolution.
Ideally the OTLP would natively support the output of all combination of
Instrument
s andAggregator
s defined in the OpenTelemetry Specification. Currently, this is not the case.Identified Issues
First Class Support For Min and Max Values of a MinMaxSumCount Aggregation
There is no metric kind for a
MinMaxSumCount
aggregator that outputs the minimum, maximum, sum, and count of events observed.It has been suggested that a
Summary
metric kind can be used to transport these values. TheSummary
kind does have fields for the count and sum, however, it does not have dedicated fields for the maximum nor minimum.The suggested work around here is to send the maximum as the 100th percentile and the minimum as the 0th percentile. Both of which are not obvious and the latter is mathematically incorrect (the 0th percentile is the value which 0% of events occured, where as the minimum is the minimal value where at least 1 event occured).
HistogramValue.Bucket.Exemplar is inadequate
As outlined in the linked ticket:
Instrumentation Instead of Aggregation
The OTLP metric kinds are centered around instruments not output from Aggregation:
The incongruence of the instruments these kinds were modeled after and the actual instruments of the OpenTelemetry Specification is secondary to the fact that they are modeled after instruments in the first place.
If the goal of the OTLP is to transport the output of
Instrument
->Aggregator
, it should be modeled after the output of theAggregator
s, aggregations. And while, yes, theHistogram
is one of these aggregations, the included histogram kinds are conflated with instrument qualifiers (Gauge
,Cumulative
).Nebulously Defined Need to be Compatible with OpenMetrics
A common refrain for the design of the OTLP is "We are trying to also stay compatible with OpenMetrics protocol".
This desire to remain compatible is an admirable one, it is negatively impacting this project. The OTLP in its pursuit to maintain easy translation to OpenMetrics has made it difficult to translate from OpenTelemetry into it. Additionally, it has led to a design that is bloated with OpenMetrics metric kinds (
GAUGE_INT64
,GAUGE_DOUBLE
) that OpenTelemetry does not use.It seems as though if support for the OpenMetrics protocol is desired, the implementations of the OpenTelemetry Specification should support that protocol. Instead of having the OTLP try to match that projects decisions.
Steps Forward
I'm hoping to have this Issue start the discussion on how we can resolve these issues. I see the metrics proto in need of a major overhaul to provide first class support for OpenTelemetry. Possibly being done in a
v2
of the proto.cc @bogdandrutu, @jmacd, @jkwatson, @c24t
The text was updated successfully, but these errors were encountered: