Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

OpenTelemetry stats reports histograms incorrectly #31016

Closed
dashpole opened this issue Nov 22, 2023 · 10 comments · Fixed by #31197
Closed

OpenTelemetry stats reports histograms incorrectly #31016

dashpole opened this issue Nov 22, 2023 · 10 comments · Fixed by #31197

Comments

@dashpole
Copy link

dashpole commented Nov 22, 2023

Title: OpenTelemetry stats reports histograms incorrectly

Description:

Sending envoy OpenTelemetry metrics to an OpenTelemetry collector, and using the logging exporter, I observed a histogram where the Count did not match the count of the buckets (see below). From the OTLP proto definition:

// bucket_counts is an optional field contains the count values of histogram
// for each bucket.
//
// The sum of the bucket_counts must equal the value in the count field.
//
// The number of elements in bucket_counts array must be by one greater than
// the number of elements in explicit_bounds array.
repeated fixed64 bucket_counts = 6;

The number of bucket_counts also appears to be the same as the number of explicit bounds, rather than one greater.

Reading through the implementation, it looks like we are using computedBuckets():

data_point->add_bucket_counts(histogram_stats.computedBuckets()[i]);

... which appears to be the count of the number below the threshold:

/**
* Returns computed bucket values during the period. The vector contains an approximation
* of samples below each quantile bucket defined in supportedBuckets(). This vector is
* guaranteed to be the same length as supportedBuckets().
*/
virtual const std::vector<uint64_t>& computedBuckets() const PURE;

computeDisjointBuckets() seems like it potentially does what we are looking for.

/**
* Returns version of computedBuckets() with disjoint buckets. This vector is
* guaranteed to be the same length as supportedBuckets().
*/
virtual std::vector<uint64_t> computeDisjointBuckets() const PURE;

Collector logging exporter output:

StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2023-11-22 00:54:12.184643877 +0000 UTC
Count: 1
Sum: 375.000000
ExplicitBounds #0: 0.500000
ExplicitBounds #1: 1.000000
ExplicitBounds #2: 5.000000
ExplicitBounds #3: 10.000000
ExplicitBounds #4: 25.000000
ExplicitBounds #5: 50.000000
ExplicitBounds #6: 100.000000
ExplicitBounds #7: 250.000000
ExplicitBounds #8: 500.000000
ExplicitBounds #9: 1000.000000
ExplicitBounds #10: 2500.000000
ExplicitBounds #11: 5000.000000
ExplicitBounds #12: 10000.000000
ExplicitBounds #13: 30000.000000
ExplicitBounds #14: 60000.000000
ExplicitBounds #15: 300000.000000
ExplicitBounds #16: 600000.000000
ExplicitBounds #17: 1800000.000000
ExplicitBounds #18: 3600000.000000
Buckets #0, Count: 0
Buckets #1, Count: 0
Buckets #2, Count: 0
Buckets #3, Count: 0
Buckets #4, Count: 0
Buckets #5, Count: 0
Buckets #6, Count: 0
Buckets #7, Count: 0
Buckets #8, Count: 1
Buckets #9, Count: 1
Buckets #10, Count: 1
Buckets #11, Count: 1
Buckets #12, Count: 1
Buckets #13, Count: 1
Buckets #14, Count: 1
Buckets #15, Count: 1
Buckets #16, Count: 1
Buckets #17, Count: 1
Buckets #18, Count: 1

The sum of buckets is 10, but the count is 1.

Repro steps:
Run envoy configured with the OpenTelemetry stats sync and send to an OpenTelemetry collector with the logging exporter, with logLevel: debug to print out the OTLP.

Note: The Envoy_collect tool
gathers a tarball with debug logs, config and the following admin
endpoints: /stats, /clusters and /server_info. Please note if there are
privacy concerns, sanitize the data prior to sharing the tarball/pasting.

Admin and Stats Output:

Include the admin output for the following endpoints: /stats,
/clusters, /routes, /server_info. For more information, refer to the
admin endpoint documentation.

Note: If there are privacy concerns, sanitize the data prior to
sharing.

Config:

Include the config used to configure Envoy.

Logs:

Include the access logs and the Envoy logs.

Note: If there are privacy concerns, sanitize the data prior to
sharing.

Call Stack:

If the Envoy binary is crashing, a call stack is required.
Please refer to the Bazel Stack trace documentation.

cc @ohadvano

@dashpole dashpole added bug triage Issue requires triage labels Nov 22, 2023
@KBaichoo KBaichoo added area/opentelemetry and removed triage Issue requires triage labels Nov 23, 2023
@bhavisubu
Copy link

This is the output from otel collector which includes the metric name.

Metric #184
Descriptor:
-> Name: tls_inspector.bytes_processed
-> Description:
-> Unit:
-> DataType: Histogram
-> AggregationTemporality: Cumulative
HistogramDataPoints #0
Data point attributes:
-> hostname: Str(,region:us-east1)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2023-11-22 00:51:12.180321127 +0000 UTC
Count: 1
Sum: 375.000000
ExplicitBounds #0: 0.500000
ExplicitBounds #1: 1.000000
ExplicitBounds #2: 5.000000
ExplicitBounds #3: 10.000000
ExplicitBounds #4: 25.000000
ExplicitBounds #5: 50.000000
ExplicitBounds #6: 100.000000
ExplicitBounds #7: 250.000000
ExplicitBounds #8: 500.000000
ExplicitBounds #9: 1000.000000
ExplicitBounds #10: 2500.000000
ExplicitBounds #11: 5000.000000
ExplicitBounds #12: 10000.000000
ExplicitBounds #13: 30000.000000
ExplicitBounds #14: 60000.000000
ExplicitBounds #15: 300000.000000
ExplicitBounds #16: 600000.000000
ExplicitBounds #17: 1800000.000000
ExplicitBounds #18: 3600000.000000
Buckets #0, Count: 0
Buckets #1, Count: 0
Buckets #2, Count: 0
Buckets #3, Count: 0
Buckets #4, Count: 0
Buckets #5, Count: 0
Buckets #6, Count: 0
Buckets #7, Count: 0
Buckets #8, Count: 1
Buckets #9, Count: 1
Buckets #10, Count: 1
Buckets #11, Count: 1
Buckets #12, Count: 1
Buckets #13, Count: 1
Buckets #14, Count: 1
Buckets #15, Count: 1
Buckets #16, Count: 1
Buckets #17, Count: 1
Buckets #18, Count: 1
{"kind": "exporter", "data_type": "metrics", "name": "logging"}
2023-11-22T00:51:13.349Z error exporterhelper/queued_retry.go:364 Exporting failed. Try enabling retry_on_failure config option to retry on retryable errors {"kind": "exporter", "data_type": "metrics", "name": "googlecloud", "error": "rpc error: code = InvalidArgument desc = One or more TimeSeries could not be written: Field timeSeries[164].points[0].distributionValue had an invalid value: Distribution value has |bucket_counts| fields that sum to 19 which does not equal the |count| field value of 1.; Field timeSeries[163].points[0].distributionValue had an invalid value: Distribution value has |bucket_counts| fields that sum to 5447 which does not equal the |count| field value of 480.; Field timeSeries[167].points[0].distributionValue had an invalid value: Distribution value has |bucket_counts| fields that sum to 19 which does not equal the |count| field value of 1.\nerror details: name = Unknown desc = total_point_count:168 success_point_count:165 errors:{status:{code:3} point_count:3}"}
go.opentelemetry.io/collector/exporter/exporterhelper.(*retrySender).send
go.opentelemetry.io/collector/exporter@v0.84.0/exporterhelper/queued_retry.go:364
go.opentelemetry.io/collector/exporter/exporterhelper.(*metricsSenderWithObservability).send
go.opentelemetry.io/collector/exporter@v0.84.0/exporterhelper/metrics.go:191
go.opentelemetry.io/collector/exporter/exporterhelper.(*queuedRetrySender).start.func1
go.opentelemetry.io/collector/exporter@v0.84.0/exporterhelper/queued_retry.go:202
go.opentelemetry.io/collector/exporter/exporterhelper/internal.(*boundedMemoryQueue).StartConsumers.func1
go.opentelemetry.io/collector/exporter@v0.84.0/exporterhelper/internal/bounded_memory_queue.go:47

@sundarms
Copy link

sundarms commented Dec 1, 2023

@dashpole Thanks for raising this issue.
As per openmetrics spec https://github.com/OpenObservability/OpenMetrics/blob/main/specification/OpenMetrics.md#histogram-1

# TYPE foo histogram
foo_bucket{le="0.0"} 0
foo_bucket{le="1e-05"} 0
foo_bucket{le="0.0001"} 5
foo_bucket{le="0.1"} 8
foo_bucket{le="1.0"} 10
foo_bucket{le="10.0"} 11
foo_bucket{le="100000.0"} 11
foo_bucket{le="1e+06"} 15
foo_bucket{le="1e+23"} 16
foo_bucket{le="1.1e+23"} 17
foo_bucket{le="+Inf"} 17
foo_count 17
foo_sum 324789.3
foo_created 1520430000.123

Count should be no of observed events and sum should be sum of all observed values correct?

In case of envoy output

Timestamp: 2023-11-22 00:54:12.184643877 +0000 UTC
Count: 1
Sum: 375.000000
ExplicitBounds #0: 0.500000
ExplicitBounds #1: 1.000000
ExplicitBounds #2: 5.000000
ExplicitBounds #3: 10.000000
ExplicitBounds #4: 25.000000
ExplicitBounds #5: 50.000000
ExplicitBounds #6: 100.000000
ExplicitBounds #7: 250.000000
ExplicitBounds #8: 500.000000
ExplicitBounds #9: 1000.000000
ExplicitBounds #10: 2500.000000
ExplicitBounds #11: 5000.000000
ExplicitBounds #12: 10000.000000
ExplicitBounds #13: 30000.000000
ExplicitBounds #14: 60000.000000
ExplicitBounds #15: 300000.000000
ExplicitBounds #16: 600000.000000
ExplicitBounds #17: 1800000.000000
ExplicitBounds #18: 3600000.000000
Buckets #0, Count: 0
Buckets #1, Count: 0
Buckets #2, Count: 0
Buckets #3, Count: 0
Buckets #4, Count: 0
Buckets #5, Count: 0
Buckets #6, Count: 0
Buckets #7, Count: 0
Buckets #8, Count: 1
Buckets #9, Count: 1
Buckets #10, Count: 1
Buckets #11, Count: 1
Buckets #12, Count: 1
Buckets #13, Count: 1
Buckets #14, Count: 1
Buckets #15, Count: 1
Buckets #16, Count: 1
Buckets #17, Count: 1
Buckets #18, Count: 1

I see it is giving correct output
No of observed events (Count = 1)
Sum of observed values (Sum = 375.000000)

Is exporter expecting count as no of buckets? Isn't that incorrect? Did i miss anything?

@dashpole
Copy link
Author

dashpole commented Dec 1, 2023

Hi @sundarms, OpenTelemetry (opentelemetry.io) differs from OpenMetrics (openmetrics.io) in this regard. OpenMetrics buckets are cumulative (count includes all observations below the threshold), as you point out, but OpenTelemetry buckets are not:

https://opentelemetry.io/docs/specs/otel/metrics/data-model/#histogram

The correct output should have been:

Timestamp: 2023-11-22 00:54:12.184643877 +0000 UTC
Count: 1
Sum: 375.000000
ExplicitBounds #0: 0.500000
ExplicitBounds #1: 1.000000
ExplicitBounds #2: 5.000000
ExplicitBounds #3: 10.000000
ExplicitBounds #4: 25.000000
ExplicitBounds #5: 50.000000
ExplicitBounds #6: 100.000000
ExplicitBounds #7: 250.000000
ExplicitBounds #8: 500.000000
ExplicitBounds #9: 1000.000000
ExplicitBounds #10: 2500.000000
ExplicitBounds #11: 5000.000000
ExplicitBounds #12: 10000.000000
ExplicitBounds #13: 30000.000000
ExplicitBounds #14: 60000.000000
ExplicitBounds #15: 300000.000000
ExplicitBounds #16: 600000.000000
ExplicitBounds #17: 1800000.000000
ExplicitBounds #18: 3600000.000000
Buckets #0, Count: 0
Buckets #1, Count: 0
Buckets #2, Count: 0
Buckets #3, Count: 0
Buckets #4, Count: 0
Buckets #5, Count: 0
Buckets #6, Count: 0
Buckets #7, Count: 0
Buckets #8, Count: 1
- Buckets #9, Count: 1
+ Buckets #9, Count: 0
- Buckets #10, Count: 1
+ Buckets #10, Count: 0
- Buckets #11, Count: 1
+ Buckets #11, Count: 0
- Buckets #12, Count: 1
+ Buckets #12, Count: 0
- Buckets #13, Count: 1
+ Buckets #13, Count: 0
- Buckets #14, Count: 1
+ Buckets #14, Count: 0
- Buckets #15, Count: 1
+ Buckets #15, Count: 0
- Buckets #16, Count: 1
+ Buckets #16, Count: 0
- Buckets #17, Count: 1
+ Buckets #17, Count: 0
- Buckets #18, Count: 1
+ Buckets #18, Count: 0

I.e. a single Count: 1 at Bucket #8. The sum of all bucket counts is then 1, which equals the Count field at the top.

See, for example, the conversion of prometheus histogram bucket counts to OpenTelemetry histogram buckets here. It subtracts the previous bucket's count of a prometheus histogram from the current bucket's count to get the OpenTelemetry histogram bucket's count.

@sundarms
Copy link

sundarms commented Dec 1, 2023

@dashpole thanks for the explanation.

@owensulei3841
Copy link

Hi @ohadvano , could I ask what's the estimated timeline for this issue be resolved?
Thank you

@ohadvano
Copy link
Contributor

ohadvano commented Dec 5, 2023

Hi, I'll try to get to this soon. I am now noticing two problems actually. First is as described in this issue, the second is that the bucket_counts array needs to be 1 element bigger than the explicit_bounds array (according to the specs). This causes a problem where if you set bounds to be [1,2,3], and a histogram record of '200' is inserted, the count would be 1, the sum would be 200, but there won't be a bucket_count element indicating this (i.e., all values of the array will be 0) - [0, 0, 0]. I believe that in this case, it should be [0, 0, 0, 1] as the last element indicates the explicit_bounds[i-1], +infinity) for i == size(explicit_bounds)?

Fixing the first is trivial, but I'm trying to see how to fix the second problem. I'm not sure there's an API to get the count of elements out of the explicit bounds, so might need to get this in another way

@dashpole
Copy link
Author

dashpole commented Dec 5, 2023

Thanks @ohadvano!

I believe that in this case, it should be [0, 0, 0, 1] as the last element indicates the explicit_bounds[i-1], +infinity) for i == size(explicit_bounds)?

Yes, thats correct.

@jmarantz
Copy link
Contributor

jmarantz commented Dec 5, 2023

@ohadvano I'm not sure what level of API you are looking at, but you can get all of the data out of the histogram structures. E.g. see the implementation of the 'detailed' histogram admin endpoint used for rendering, here:

ParentHistogramImpl::detailedlBucketsHelper(const histogram_t& histogram) {

@sundarms
Copy link

sundarms commented Dec 6, 2023

@ohadvano thanks for fix pr. Is there any specific release version/date this will be pushed out?

@ohadvano
Copy link
Contributor

ohadvano commented Dec 6, 2023

You can either wait for the next official release, which is next month, or cherry pick the commit

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging a pull request may close this issue.

7 participants