Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

statsdreceiver: fix start timestamp / temporality for counters #5714

Merged

Conversation

mwear
Copy link
Member

@mwear mwear commented Oct 11, 2021

Description:
Fixes #5667. The statsdreceiver was not setting the start timestamp for counter metrics, and was conditionally setting the temporality as cumulative for some counters. This PR fixes both issues. The metric parser now tracks the last interval time and uses that as the start timestamp for counters. The translation code was also updated to only produce counters with delta temporality.

Link to tracking Issue: #5667

Testing: The metric_translator_test and statsd_parser_test were updated accordingly.

Documentation:

@mwear mwear requested a review from jmacd as a code owner October 11, 2021 22:05
@mwear mwear requested review from a team and bogdandrutu October 11, 2021 22:05
Co-authored-by: Joshua MacDonald <jmacd@users.noreply.github.com>
@bogdandrutu bogdandrutu merged commit 21f0ded into open-telemetry:main Oct 19, 2021
@locmai
Copy link
Contributor

locmai commented Jan 13, 2022

Hi, the lines

if isMonotonicCounter {
  nm.Sum().SetAggregationTemporality(pdata.MetricAggregationTemporalityCumulative)
}

was introduced by me when Prometheus Exporter is dropping all of these MetricAggregationTemporalityCumulative metrics. Wondering if you guys had the same issue with statsdreceiver -> prometheusexporter or just me? And if it was a real issue, the proposal would be fixing on the exporter side or the receiver one?

@locmai
Copy link
Contributor

locmai commented Jan 13, 2022

As for this case, you would always treat the metric's AggregationTemporality spec as MetricAggregationTemporalityDelta at the end of the line and it will always go down to the line from prometheusexporter that drop it:

// Drop metrics with non-cumulative aggregations
if doubleSum.AggregationTemporality() != pdata.MetricAggregationTemporalityCumulative {
return
}

I'm assuming this is correct, just that the specifications for both side are different?

@jmacd
Copy link
Contributor

jmacd commented Jan 14, 2022

@locmai Thanks for asking about this. The behavior was indeed going against the specifications for statsd and Prometheus (though if we're nit-picking, statsd isn't really specified). Statsd represents one of the major reasons for requiring Delta temporality in OTLP, but Prometheus accepts only Cumulative temporality.

I apologize that something appears to be missing here. There should be some sort of processor that will take delta temporality and convert it to cumulative temporality. (I am a bit surprised. This ought to exist somewhere, and I thought it did. Maybe it's been factored into another metrics processor? Looking for deltatocumulativeprocessor whereas today we have deltatorateprocessor and cumulativetodelta processor, which goes the other direction.

(In the Prometheus ecosystem, the "statsd_exporter" equals a combination of statsd receiver, delta-to-cumulative processor, and a Prometheus exporter.)

If we study the statsdreceiver code, might suggest another approach. If it would simply not reset its state on the collection interval, you'd have cumulative behavior. This looks like a simple change we could make, are you interested?

@jmacd
Copy link
Contributor

jmacd commented Jan 14, 2022

@locmai As I read through the change again, now the what the old code was doing makes sense. (I had been confused!)

The suggestion above should help. We can add a new configuration to the receiver called "isCumulative", in which case we will never reset the map. This option is orthogonal to the existing "isMonotonic", which dictates how to translate the statsd data. If we set isMonotonic and isCumulative, you'll get the Prometheus data you want, but new problems will emerge. The handling for Histogram/Timing points will create problems at least until statsd receiver is updated to use an exponential histogram, after which point we'll have a minor issue to resolve related to histogram entropy, see the Prometheus handling of histogram resets in their corresponding histogram structure, e.g. open-telemetry/opentelemetry-specification#2252 (comment)

@locmai
Copy link
Contributor

locmai commented Jan 14, 2022

Hi! Thanks for the great explanation and suggestions.

Yeah I thought we had this before deltatocumulativeprocessor. Just to verify my understanding on your suggestions, we could either:

  1. Implement a deltatocumulativeprocessor like the opposite cumulativetodeltaprocessor OR
  2. Add an isCumulativeattribute, similar to isMonotonic and stop the reset (the reset is actually hurting us from different angle as well when using with Prometheus Counter). I'm not sure how this would be related to the Histogram and Timing if we put it in the buildCounterMetric function only?

@jmacd
Copy link
Contributor

jmacd commented Jan 14, 2022

Re: (2) if you only change the behavior for counters, then summary will still behave oddly but you're right it will be less problematic. I'm trying to avoid building a list of every observation from the start of time.

It looks like a third option is to change the way the prometheus exporter accumulates deltas. Currently it drops them, it could sum them instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Statsdreceiver outputs cumulative temporality (incorrectly) for Sums w/o start timestamp
7 participants