Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prometheusremotewrite context deadline exceeded #31910

Open
frakev opened this issue Mar 22, 2024 · 15 comments
Open

prometheusremotewrite context deadline exceeded #31910

frakev opened this issue Mar 22, 2024 · 15 comments
Labels

Comments

@frakev
Copy link

frakev commented Mar 22, 2024

Component(s)

exporter/prometheusremotewrite

What happened?

Description

If the endpoint is not reachable and OTEL can't send metrics, I get some error messages.

Steps to Reproduce

  • Configure a prometheus receiver which will scrap some metrics or itself.
  • Configure an exporter (prometheusremotewrite) with one endpoint.
  • Configure a processor. In my case "batch" processor.
  • And finally, configure a service.
  • Start your OTEL collector to scrape and send metrics.
  • After a few minutes, try to simulate an endpoint crash. See "Actual result".

Expected Result

No error messages, but an info log to let you know that the collector is queuing the metrics due to endpoint downtime.

Actual Result

2024-03-22T14:35:06.566Z error exporterhelper/queue_sender.go:97 Exporting failed. Dropping data. {"kind": "exporter", "data_type": "metrics", "name": "prometheusremotewrite", "error": "Permanent error: Permanent error: context deadline exceeded", "dropped_items": 2353} go.opentelemetry.io/collector/exporter/exporterhelper.newQueueSender.func1 go.opentelemetry.io/collector/exporter@v0.96.0/exporterhelper/queue_sender.go:97 go.opentelemetry.io/collector/exporter/internal/queue.(*boundedMemoryQueue[...]).Consume go.opentelemetry.io/collector/exporter@v0.96.0/internal/queue/bounded_memory_queue.go:57 go.opentelemetry.io/collector/exporter/internal/queue.(*Consumers[...]).Start.func1 go.opentelemetry.io/collector/exporter@v0.96.0/internal/queue/consumers.go:43 2024-03-22T14:35:12.008Z error exporterhelper/queue_sender.go:97 Exporting failed. Dropping data. {"kind": "exporter", "data_type": "metrics", "name": "prometheusremotewrite", "error": "Permanent error: Permanent error: context deadline exceeded", "dropped_items": 24} go.opentelemetry.io/collector/exporter/exporterhelper.newQueueSender.func1 go.opentelemetry.io/collector/exporter@v0.96.0/exporterhelper/queue_sender.go:97 go.opentelemetry.io/collector/exporter/internal/queue.(*boundedMemoryQueue[...]).Consume go.opentelemetry.io/collector/exporter@v0.96.0/internal/queue/bounded_memory_queue.go:57 go.opentelemetry.io/collector/exporter/internal/queue.(*Consumers[...]).Start.func1 go.opentelemetry.io/collector/exporter@v0.96.0/internal/queue/consumers.go:43

Collector version

0.96.0

Environment information

Environment

Docker image: otel/opentelemetry-collector:0.96.0

OpenTelemetry Collector configuration

receivers:
  prometheus:
    config:
      scrape_configs:
        - job_name: node-exporter
          scrape_interval: 5s
          static_configs:
            - targets: [localhost:9100]
        - job_name: otel-collector
          scrape_interval: 5s
          static_configs:
            - targets: [localhost:8888]

exporters:
    prometheusremotewrite:
      endpoint: http://X.X.X.X/push
      headers:
        Authorization: "Bearer toto"
      remote_write_queue:
        enabled: True
        queue_size: 100000
        num_consumers: 5
    debug:
      verbosity: detailed

processors:
  batch:

service:
  pipelines:
    metrics:
      receivers: [prometheus]
      processors: [batch]
      exporters: [prometheusremotewrite]
  telemetry:
    metrics:
      level: detailed
      address: 0.0.0.0:8888

Log output

2024-03-22T14:35:06.566Z	error	exporterhelper/queue_sender.go:97	Exporting failed. Dropping data.	{"kind": "exporter", "data_type": "metrics", "name": "prometheusremotewrite", "error": "Permanent error: Permanent error: context deadline exceeded", "dropped_items": 2353}
go.opentelemetry.io/collector/exporter/exporterhelper.newQueueSender.func1
	go.opentelemetry.io/collector/exporter@v0.96.0/exporterhelper/queue_sender.go:97
go.opentelemetry.io/collector/exporter/internal/queue.(*boundedMemoryQueue[...]).Consume
	go.opentelemetry.io/collector/exporter@v0.96.0/internal/queue/bounded_memory_queue.go:57
go.opentelemetry.io/collector/exporter/internal/queue.(*Consumers[...]).Start.func1
	go.opentelemetry.io/collector/exporter@v0.96.0/internal/queue/consumers.go:43
2024-03-22T14:35:12.008Z	error	exporterhelper/queue_sender.go:97	Exporting failed. Dropping data.	{"kind": "exporter", "data_type": "metrics", "name": "prometheusremotewrite", "error": "Permanent error: Permanent error: context deadline exceeded", "dropped_items": 24}
go.opentelemetry.io/collector/exporter/exporterhelper.newQueueSender.func1
	go.opentelemetry.io/collector/exporter@v0.96.0/exporterhelper/queue_sender.go:97
go.opentelemetry.io/collector/exporter/internal/queue.(*boundedMemoryQueue[...]).Consume
	go.opentelemetry.io/collector/exporter@v0.96.0/internal/queue/bounded_memory_queue.go:57
go.opentelemetry.io/collector/exporter/internal/queue.(*Consumers[...]).Start.func1
	go.opentelemetry.io/collector/exporter@v0.96.0/internal/queue/consumers.go:43

Additional context

No response

@frakev frakev added bug Something isn't working needs triage New item requiring triage labels Mar 22, 2024
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@martinohansen
Copy link

I'm seeing something similar. I added some details over on this issue: open-telemetry/opentelemetry-collector#8217 (comment)

@martinohansen
Copy link

I managed to solve (or brute force?) the issue by setting this in the exporter, the default is 5 consumers.

remote_write_queue:
  num_consumers: 50

@frakev
Copy link
Author

frakev commented Apr 11, 2024

Thank you @martinohansen. I'll try, but it's just a workaround and maybe this issue needs to be fixed.

@dhilgarth
Copy link

I'm having a similar error, but increasing the num_consumers didn't help.
I'm running otel collector contrib 0.98.0 with global mode - so, 1 instance per node - in Docker Swarm.
My configuration is generally working, as in 4 out of 6 instances can properly send their metrics to the endpoint.

However, two instances can't do that and it fails with:

2024-04-14T21:40:38.279Z	error	exporterhelper/queue_sender.go:101	Exporting failed. Dropping data.	{"kind": "exporter", "data_type": "metrics", "name": "prometheusremotewrite/metrics-infrastructure", "error": "Permanent error: Permanent error: context deadline exceeded", "dropped_items": 1425}
go.opentelemetry.io/collector/exporter/exporterhelper.newQueueSender.func1
	go.opentelemetry.io/collector/exporter@v0.98.0/exporterhelper/queue_sender.go:101
go.opentelemetry.io/collector/exporter/internal/queue.(*boundedMemoryQueue[...]).Consume
	go.opentelemetry.io/collector/exporter@v0.98.0/internal/queue/bounded_memory_queue.go:57
go.opentelemetry.io/collector/exporter/internal/queue.(*Consumers[...]).Start.func1
	go.opentelemetry.io/collector/exporter@v0.98.0/internal/queue/consumers.go:43

No more info is available.

This is my config:

receivers:
  hostmetrics:
    collection_interval: 3s
    root_path: /hostfs
    scrapers:
      cpu:
        metrics:
          system.cpu.logical.count:
            enabled: true
          system.cpu.physical.count:
            enabled: true
          "system.cpu.frequency":
            enabled: true
          "system.cpu.utilization":
            enabled: true
      load: { }
      memory:
        metrics:
          "system.linux.memory.available":
            enabled: true
          "system.memory.limit":
            enabled: true
          "system.memory.utilization":
            enabled: true
      disk: { }
      filesystem:
        metrics:
          "system.filesystem.utilization":
            enabled: true
      paging:
        metrics:
          "system.paging.utilization":
            enabled: true
          "system.paging.usage":
            enabled: true
      network: { }
      process:
        mute_process_io_error: true
        mute_process_exe_error: true
        mute_process_user_error: true
        metrics:
          "process.cpu.utilization":
            enabled: true
          "process.memory.utilization":
            enabled: true
          "process.disk.io":
            enabled: true
          "process.disk.operations":
            enabled: true
          process.threads:
            enabled: true
          process.paging.faults:
            enabled: true
processors:
  batch:
    send_batch_size: 10000
    send_batch_max_size: 11000
    timeout: 10s
exporters:
  prometheusremotewrite/metrics-infrastructure:
    endpoint: http://mimir-lb:9010/api/v1/push
    tls:
      insecure: true
    headers:
      - "X-Scope-OrgID": "infrastructure"
    resource_to_telemetry_conversion:
      enabled: true
    remote_write_queue:
      enabled: true
      queue_size: 100000
      num_consumers: 50

service:
  telemetry:
    logs:
      level: debug
    metrics:
      level: detailed
      address: 0.0.0.0:8888
  pipelines:
    metrics/infrastructure:
      receivers: [ hostmetrics]
      processors: [ batch ]
      exporters: [ prometheusremotewrite/metrics-infrastructure ]

What can we do to further troubleshoot this issue?

@dhilgarth
Copy link

After 6 hours, I finally figured it out: An nginx config in mimir-lb that doesn't update the IP addresses of the upstream servers. One of the upstream containers restarted and got a new IP address, which wasn't reflected in nginx.
The two containers that exhibited this problem must've been routed to the restarted instance every single time.

Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Jun 14, 2024
@koenvandesande
Copy link

I'm encountering the same error. From the error messages, it is not clear to me whether writing to the remote endpoint is failing (i.e. does Permanent error: context deadline exceeded come from that server?), or is some local endpoint not being scraped properly, i.e. timeouts?

@juniorsdj
Copy link

I am getting a similar issue here. I am using the opentelemetry-collector-contrib: 0.103.1
my collector is processing 200k span/min
processing on spanmetrics connector and finally sending to prometheus instance.
After a couple of hours running, the collector begin to show this error message.

2024-06-26T20:06:15.292972825Z 2024-06-26T20:06:15.292Z	error	exporterhelper/queue_sender.go:90	Exporting failed. Dropping data.	{"kind": "exporter", "data_type": "metrics", "name": "prometheusremotewrite", "error": "Permanent error: Permanent error: context deadline exceeded; Permanent error: Permanent error: context deadline exceeded; Permanent error: Permanent error: context deadline exceeded; Permanent error: Permanent error: context deadline exceeded; Permanent error: Permanent error: context deadline exceeded; Permanent error: Permanent error: context deadline exceeded; Permanent error: Permanent error: context deadline exceeded; Permanent error: Permanent error: context deadline exceeded; Permanent error: Permanent error: context deadline exceeded; Permanent error: Permanent error: context deadline exceeded; Permanent error: Permanent error: context deadline exceeded; Permanent error: Permanent error: context deadline exceeded; Permanent error: Permanent error: context deadline exceeded; Permanent error: Permanent error: context deadline exceeded; Permanent error: Permanent error: context deadline exceeded", "errorCauses": [{"error": "Permanent error: Permanent error: context deadline exceeded"}, {"error": "Permanent error: Permanent error: context deadline exceeded"}, {"error": "Permanent error: Permanent error: context deadline exceeded"}, {"error": "Permanent error: Permanent error: context deadline exceeded"}, {"error": "Permanent error: Permanent error: context deadline exceeded"}, {"error": "Permanent error: Permanent error: context deadl
ine exceeded"}, {"error": "Permanent error: Permanent error: context deadline exceeded"}, {"error": "Permanent error: Permanent error: context deadline exceeded"}, {"error": "Permanent error: Permanent error: context deadline exceeded"}, {"error": "Permanent error: Permanent error: context deadline exceeded"}, {"error": "Permanent error: Permanent error: context deadline exceeded"}, {"error": "Permanent error: Permanent error: context deadline exceeded"}, {"error": "Permanent error: Permanent error: context deadline exceeded"}, {"error": "Permanent error: Permanent error: context deadline exceeded"}, {"error": "Permanent error: Permanent error: context deadline exceeded"}], "dropped_items": 12404}
2024-06-26T20:06:15.293013725Z go.opentelemetry.io/collector/exporter/exporterhelper.newQueueSender.func1
2024-06-26T20:06:15.293019456Z 	go.opentelemetry.io/collector/exporter@v0.103.0/exporterhelper/queue_sender.go:90
2024-06-26T20:06:15.293023534Z go.opentelemetry.io/collector/exporter/internal/queue.(*boundedMemoryQueue[...]).Consume
2024-06-26T20:06:15.293026425Z 	go.opentelemetry.io/collector/exporter@v0.103.0/internal/queue/bounded_memory_queue.go:52
2024-06-26T20:06:15.293029184Z go.opentelemetry.io/collector/exporter/internal/queue.(*Consumers[...]).Start.func1
2024-06-26T20:06:15.293031423Z 	go.opentelemetry.io/collector/exporter@v0.103.0/internal/queue/consumers.go:43```

Copy link
Contributor

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label Aug 26, 2024
@dashpole
Copy link
Contributor

triage:

  • Context deadline exceeded means the client timed out when trying to send to the backend.
  • The error is generic, so it is likely that people posting above have different root causes.
  • One potential way to mitigate would be to increase the timeout of the exporter. If this is a common issue, we may want to increase the default timeout, which is 5 seconds.
  • For people that want to debug this, I would recommend using the http self-observability metrics, but note that those will change over time. The PRW exporter will currently produce metrics if the service metrics level is set to detailed: https://opentelemetry.io/docs/collector/internal-telemetry/#additional-detailed-level-metrics

@cabrinha
Copy link

cabrinha commented Aug 27, 2024

I'm also getting this error when trying to use prometheusremotewrite... Can't figure out what the issue is. Error from prometheusremotewrite exporter:

2024-08-28T00:11:46.851Z        error   exporterhelper/queue_sender.go:92       Exporting failed. Dropping data.        {"kind": "exporter", "data_type": "metrics", "name": "prometheusremotewrite", "error": "Permanent error: Permanent error: context deadline exceeded", "dropped_items": 636}
go.opentelemetry.io/collector/exporter/exporterhelper.newQueueSender.func1
        go.opentelemetry.io/collector/exporter@v0.106.1/exporterhelper/queue_sender.go:92
go.opentelemetry.io/collector/exporter/internal/queue.(*boundedMemoryQueue[...]).Consume
        go.opentelemetry.io/collector/exporter@v0.106.1/internal/queue/bounded_memory_queue.go:52
go.opentelemetry.io/collector/exporter/internal/queue.(*Consumers[...]).Start.func1
        go.opentelemetry.io/collector/exporter@v0.106.1/internal/queue/consumers.go:43

Config:

    exporters:
      prometheusremotewrite:
        add_metric_suffixes: false
        endpoint: http://mimir-.../api/v1/push
        headers:
          Authorization: Bearer my-token-here
          X-Scope-OrgID: my-org-id
        max_batch_size_bytes: 30000000
        tls:
          insecure_skip_verify: true

Also, when trying to use the otlphttp exporter I'm getting a 499.

I've tried both exporters to send into Mimir.

UPDATE

I have solved this for the prometheusremotewrite exporter by simplifying the config:

    exporters:
      prometheusremotewrite:
        endpoint: http://mimir-.../api/v1/push
        headers:
          Authorization: Bearer my-token-here
        tls:
          insecure_skip_verify: true

@gauravphagrehpe
Copy link

gauravphagrehpe commented Aug 28, 2024

for me its worked by changing

from this:
`tls:

insecure: true`

to this:
`tls:

insecure_skip_verify: true`

@Priyanka-src
Copy link

I am having similar issue as:
error exporterhelper/common.go:95 Exporting failed. Dropping data. {"kind": "exporter", "data_type": "metrics", "name": "prometheusremotewrite", "error": "Permanent error: Permanent error: context deadline exceeded", "dropped_items": 5}
I am using adot-collector as my pod, in AWS. Any idea what can be done. Any help would be appreciated.

@Shindek77
Copy link

same for me..please help

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

10 participants