Exporter prometheusremotewrite keeps sending data for 5m while receiver has only 1 data point #27893

cbos · 2023-10-20T19:49:01Z

Component(s)

exporter/prometheusremotewrite

What happened?

Description

prometheusremotewrite keeps sending data while the receiver only provides data once in a while or stopped delivering data.

What is the problem with that:

You cannot see how often a certain data point is actually send
If a instrumented app stops sending metrics data to the OLTP endpoint, the prometheusremotewrite keeps sending data. You will only notice the crash after 5 minutes! Only after that the data is missing in the graphs.
If you use the httpreceiver is will first report:

httpcheck.status{http.status_class:2xx, http.status_code:200,...} = 1
httpcheck.status{http.status_class:5xx, http.status_code:200,...} = 0

As soon as it gives problems the new data is from the httpreceiver will be like:

httpcheck.status{http.status_class:2xx, http.status_code:500,...} = 0
httpcheck.status{http.status_class:5xx, http.status_code:500,...} = 1

But the actual send by prometheusremotewrite for 5 minutes.

httpcheck.status{http.status_class:2xx, http.status_code:200,...} = 1
httpcheck.status{http.status_class:5xx, http.status_code:200,...} = 0
httpcheck.status{http.status_class:2xx, http.status_code:500,...} = 0
httpcheck.status{http.status_class:5xx, http.status_code:500,...} = 1

As soon as this is flacky, you will not see the switches either.

Steps to Reproduce

It is easy to reproduce with influxdb receiver (as provided in the separate config)

curl --request POST "http://localhost:8086/api/v2/write?precision=ns" \
  --header "Content-Type: text/plain; charset=utf-8" \
  --header "Accept: application/json" \
  --data-binary "
    airSensors,sensor_id=TLM0201 temperature=75.97038159354763 $(date +%s)000000000
    "

Expected Result

It is unexpected behaviour to keep it sending. Send a single datapoint only once.
If there no other option, then at least make this configurable how long it will keep sending stale data.

Actual Result

Debug log shows only 1 entry.
prometheus endpoint shows this data for the period as defined with metric_expiration. For prometheus endpoint I can understand that you don't know if the endpoint has been scraped. prometheus has the setting send_timestamps: true, then you can see when that last value is updated. A scraper can detect old/stale data.
prometheusremotewrite keeps sending the data for 5 minutes

Collector version

0.87.0

Environment information

OpenTelemetry Collector 0.87.0 docker container: otel/opentelemetry-collector-contrib:0.87.0

OpenTelemetry Collector configuration

receivers:
  otlp:
    protocols:
      grpc:
      http:

  influxdb:
    endpoint: 0.0.0.0:8086

processors:
  batch:

exporters:
  prometheusremotewrite/grafana_cloud_metrics:
    endpoint: "https://....grafana.net/api/prom/push"
    auth:
      authenticator: basicauth/grafana_cloud_prometheus

  prometheus:
    endpoint: "0.0.0.0:8889"
    send_timestamps: true
    metric_expiration: 23s
    resource_to_telemetry_conversion:
      enabled: true

  debug:
    verbosity: detailed
    sampling_initial: 5
    sampling_thereafter: 200

service:
    metrics:
      receivers: [otlp, influxdb]
      processors: []
      exporters: [prometheusremotewrite/grafana_cloud_metrics, prometheus, debug]

Log output

2023-10-20T19:23:25.717Z        info    MetricsExporter {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 1, "data points": 1}
2023-10-20T19:23:25.717Z        info    ResourceMetrics #0
Resource SchemaURL: 
ScopeMetrics #0
ScopeMetrics SchemaURL: 
InstrumentationScope  
Metric #0
Descriptor:
     -> Name: airSensors_temperature
     -> Description: 
     -> Unit: 
     -> DataType: Gauge
NumberDataPoints #0
Data point attributes:
     -> sensor_id: Str(TLM0201)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2023-10-20 19:23:25 +0000 UTC
Value: 75.970382
        {"kind": "exporter", "data_type": "metrics", "name": "debug"}

Additional context

No response

The text was updated successfully, but these errors were encountered:

github-actions · 2023-10-20T19:49:20Z

Pinging code owners:

exporter/prometheusremotewrite: @Aneurysm9 @rapphil

See Adding Labels via Comments if you do not have permissions to add labels yourself.

cbos · 2023-10-21T17:50:55Z

I have looked a bit more into this problem. As I use Grafana Cloud, I thought this might be a problem with Mimir.
So I did a new test with a local prometheus and enabled remote write on the that instance.
But that gives the same behaviour.

I did a reach for this and found this article:
https://www.robustperception.io/staleness-and-promql/

If prometheus scrape detects that an instance is down, if marks all related time series stale with a stale marker.
If a time serie is not marked as stale and does not have updates, prometheus gives the results for 5 minutes.
Prometheus remote write specs has covered that https://prometheus.io/docs/concepts/remote_write_spec/#stale-markers

But how can that be applied?

The prometheusreceiver can implement this probably.
But how does that work for the otlpreceiver, if an (java) instrumented application crashes, it does not send the metrics anymore.
How fast will it be marked as stale?

Same as the httpreceiver, old time series should be marked stale somehow.
Or a configuration option for the prometheusremotewrite should have a configuration option to mark time series stale if there are no updates after xxx time.

5 minutes is a long time to detect if an application has stopped.

dashpole · 2023-10-25T16:05:24Z

I believe the PRW exporter will not send points more than once unless it is retrying a failure. I strongly suspect what is happening is that prometheus displays a line for 5 minutes after it receives a point unless it receives a staleness marker. But since staleness markers are prometheus-specific, you won't get them when receiving data from non-prometheus sources.

The prometheusreceiver can implement this probably.

The prometheus receiver does implement this, and it should work correctly. It uses the OTLP data point flag for "no recorded value" to indicate that a series is stale. The PRW exporter should send a staleness marker when it sees that data point flag.

Overall, this is WAI, although the current UX isn't ideal. There are two potential paths forward:

Push exporters (e.g. OTLP, influx) start sending a version of staleness markers (e.g. by sending "no recorded value" points on shutdown).
The prometheus server uses service discovery to determine which applications it expects data to be pushed from, and generates staleness markers when the "discovered entity" disappears.

jwcesign · 2023-11-22T13:59:19Z

/cc @Aneurysm9 @rapphil

jwcesign · 2023-11-22T14:57:00Z

I solved by setting the Mimir parameter: lookback_delta: 1s

jmichalek132 · 2023-11-22T20:59:34Z

I don't think this should be marked as bug, nor it's an issue with the remote write exporter, as metion in #27893 (comment), this is due to the other types of receivers not having staleness markers. When the metrics are ingested via a receiver that supports them, the remote write exporter sends it to the prometheus backend. What do you think @dashpole?

dashpole · 2023-11-27T15:09:57Z

Agreed. I consider this a feature request to add a notion of staleness to OTel, which would is presumably blocked on such a thing existing in the specification.

github-actions · 2024-02-27T03:29:18Z

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

exporter/prometheusremotewrite: @Aneurysm9 @rapphil

See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions · 2024-04-27T05:18:41Z

This issue has been closed as inactive because it has been stale for 120 days with no activity.

cbos added bug Something isn't working needs triage New item requiring triage labels Oct 20, 2023

github-actions bot added the exporter/prometheusremotewrite label Oct 20, 2023

github-actions bot mentioned this issue Oct 24, 2023

Weekly Report: 2023-10-17 - 2023-10-24 #28557

Closed

crobert-1 removed the needs triage New item requiring triage label Oct 30, 2023

dashpole mentioned this issue Oct 31, 2023

[prometheusremotewrite] Otel keep sending metrics for 5 more minutes after last datapoint #28823

Closed

dashpole added enhancement New feature or request and removed bug Something isn't working labels Nov 28, 2023

github-actions bot added the Stale label Feb 27, 2024

github-actions bot added the closed as inactive label Apr 27, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Apr 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Exporter prometheusremotewrite keeps sending data for 5m while receiver has only 1 data point #27893

Exporter prometheusremotewrite keeps sending data for 5m while receiver has only 1 data point #27893

cbos commented Oct 20, 2023

github-actions bot commented Oct 20, 2023

cbos commented Oct 21, 2023

dashpole commented Oct 25, 2023

jwcesign commented Nov 22, 2023

jwcesign commented Nov 22, 2023 •

edited

Loading

jmichalek132 commented Nov 22, 2023

dashpole commented Nov 27, 2023

github-actions bot commented Feb 27, 2024

github-actions bot commented Apr 27, 2024

Exporter prometheusremotewrite keeps sending data for 5m while receiver has only 1 data point #27893

Exporter prometheusremotewrite keeps sending data for 5m while receiver has only 1 data point #27893

Comments

cbos commented Oct 20, 2023

Component(s)

What happened?

Description

Steps to Reproduce

Expected Result

Actual Result

Collector version

Environment information

OpenTelemetry Collector configuration

Log output

Additional context

github-actions bot commented Oct 20, 2023

cbos commented Oct 21, 2023

dashpole commented Oct 25, 2023

jwcesign commented Nov 22, 2023

jwcesign commented Nov 22, 2023 • edited Loading

jmichalek132 commented Nov 22, 2023

dashpole commented Nov 27, 2023

github-actions bot commented Feb 27, 2024

github-actions bot commented Apr 27, 2024

jwcesign commented Nov 22, 2023 •

edited

Loading