Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Exporter prometheusremotewrite keeps sending data for 5m while receiver has only 1 data point #27893

Closed
cbos opened this issue Oct 20, 2023 · 9 comments

Comments

@cbos
Copy link

cbos commented Oct 20, 2023

Component(s)

exporter/prometheusremotewrite

What happened?

Description

prometheusremotewrite keeps sending data while the receiver only provides data once in a while or stopped delivering data.

What is the problem with that:

  • You cannot see how often a certain data point is actually send
  • If a instrumented app stops sending metrics data to the OLTP endpoint, the prometheusremotewrite keeps sending data. You will only notice the crash after 5 minutes! Only after that the data is missing in the graphs.
  • If you use the httpreceiver is will first report:
httpcheck.status{http.status_class:2xx, http.status_code:200,...} = 1
httpcheck.status{http.status_class:5xx, http.status_code:200,...} = 0

As soon as it gives problems the new data is from the httpreceiver will be like:

httpcheck.status{http.status_class:2xx, http.status_code:500,...} = 0
httpcheck.status{http.status_class:5xx, http.status_code:500,...} = 1

But the actual send by prometheusremotewrite for 5 minutes.

httpcheck.status{http.status_class:2xx, http.status_code:200,...} = 1
httpcheck.status{http.status_class:5xx, http.status_code:200,...} = 0
httpcheck.status{http.status_class:2xx, http.status_code:500,...} = 0
httpcheck.status{http.status_class:5xx, http.status_code:500,...} = 1

As soon as this is flacky, you will not see the switches either.

Steps to Reproduce

It is easy to reproduce with influxdb receiver (as provided in the separate config)

curl --request POST "http://localhost:8086/api/v2/write?precision=ns" \
  --header "Content-Type: text/plain; charset=utf-8" \
  --header "Accept: application/json" \
  --data-binary "
    airSensors,sensor_id=TLM0201 temperature=75.97038159354763 $(date +%s)000000000
    "

Expected Result

It is unexpected behaviour to keep it sending. Send a single datapoint only once.
If there no other option, then at least make this configurable how long it will keep sending stale data.

Actual Result

  • Debug log shows only 1 entry.
  • prometheus endpoint shows this data for the period as defined with metric_expiration. For prometheus endpoint I can understand that you don't know if the endpoint has been scraped. prometheus has the setting send_timestamps: true, then you can see when that last value is updated. A scraper can detect old/stale data.
  • prometheusremotewrite keeps sending the data for 5 minutes

Collector version

0.87.0

Environment information

OpenTelemetry Collector 0.87.0 docker container: otel/opentelemetry-collector-contrib:0.87.0

OpenTelemetry Collector configuration

receivers:
  otlp:
    protocols:
      grpc:
      http:

  influxdb:
    endpoint: 0.0.0.0:8086

processors:
  batch:

exporters:
  prometheusremotewrite/grafana_cloud_metrics:
    endpoint: "https://....grafana.net/api/prom/push"
    auth:
      authenticator: basicauth/grafana_cloud_prometheus

  prometheus:
    endpoint: "0.0.0.0:8889"
    send_timestamps: true
    metric_expiration: 23s
    resource_to_telemetry_conversion:
      enabled: true

  debug:
    verbosity: detailed
    sampling_initial: 5
    sampling_thereafter: 200

service:
    metrics:
      receivers: [otlp, influxdb]
      processors: []
      exporters: [prometheusremotewrite/grafana_cloud_metrics, prometheus, debug]

Log output

2023-10-20T19:23:25.717Z        info    MetricsExporter {"kind": "exporter", "data_type": "metrics", "name": "debug", "resource metrics": 1, "metrics": 1, "data points": 1}
2023-10-20T19:23:25.717Z        info    ResourceMetrics #0
Resource SchemaURL: 
ScopeMetrics #0
ScopeMetrics SchemaURL: 
InstrumentationScope  
Metric #0
Descriptor:
     -> Name: airSensors_temperature
     -> Description: 
     -> Unit: 
     -> DataType: Gauge
NumberDataPoints #0
Data point attributes:
     -> sensor_id: Str(TLM0201)
StartTimestamp: 1970-01-01 00:00:00 +0000 UTC
Timestamp: 2023-10-20 19:23:25 +0000 UTC
Value: 75.970382
        {"kind": "exporter", "data_type": "metrics", "name": "debug"}

Additional context

No response

@cbos cbos added bug Something isn't working needs triage New item requiring triage labels Oct 20, 2023
@github-actions
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@cbos
Copy link
Author

cbos commented Oct 21, 2023

I have looked a bit more into this problem. As I use Grafana Cloud, I thought this might be a problem with Mimir.
So I did a new test with a local prometheus and enabled remote write on the that instance.
But that gives the same behaviour.

image

I did a reach for this and found this article:
https://www.robustperception.io/staleness-and-promql/

If prometheus scrape detects that an instance is down, if marks all related time series stale with a stale marker.
If a time serie is not marked as stale and does not have updates, prometheus gives the results for 5 minutes.
Prometheus remote write specs has covered that https://prometheus.io/docs/concepts/remote_write_spec/#stale-markers

But how can that be applied?

The prometheusreceiver can implement this probably.
But how does that work for the otlpreceiver, if an (java) instrumented application crashes, it does not send the metrics anymore.
How fast will it be marked as stale?

Same as the httpreceiver, old time series should be marked stale somehow.
Or a configuration option for the prometheusremotewrite should have a configuration option to mark time series stale if there are no updates after xxx time.

5 minutes is a long time to detect if an application has stopped.

@dashpole
Copy link
Contributor

I believe the PRW exporter will not send points more than once unless it is retrying a failure. I strongly suspect what is happening is that prometheus displays a line for 5 minutes after it receives a point unless it receives a staleness marker. But since staleness markers are prometheus-specific, you won't get them when receiving data from non-prometheus sources.

The prometheusreceiver can implement this probably.

The prometheus receiver does implement this, and it should work correctly. It uses the OTLP data point flag for "no recorded value" to indicate that a series is stale. The PRW exporter should send a staleness marker when it sees that data point flag.

Overall, this is WAI, although the current UX isn't ideal. There are two potential paths forward:

  1. Push exporters (e.g. OTLP, influx) start sending a version of staleness markers (e.g. by sending "no recorded value" points on shutdown).
  2. The prometheus server uses service discovery to determine which applications it expects data to be pushed from, and generates staleness markers when the "discovered entity" disappears.

@jwcesign
Copy link

/cc @Aneurysm9 @rapphil

@jwcesign
Copy link

jwcesign commented Nov 22, 2023

I solved by setting the Mimir parameter: lookback_delta: 1s

@jmichalek132
Copy link
Contributor

I don't think this should be marked as bug, nor it's an issue with the remote write exporter, as metion in #27893 (comment), this is due to the other types of receivers not having staleness markers. When the metrics are ingested via a receiver that supports them, the remote write exporter sends it to the prometheus backend. What do you think @dashpole?

@dashpole
Copy link
Contributor

Agreed. I consider this a feature request to add a notion of staleness to OTel, which would is presumably blocked on such a thing existing in the specification.

@dashpole dashpole added enhancement New feature or request and removed bug Something isn't working labels Nov 28, 2023
Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Feb 27, 2024
Copy link
Contributor

This issue has been closed as inactive because it has been stale for 120 days with no activity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Apr 27, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants