Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Deadline exceed in DataDog exporter #1409

Open
3miliano opened this issue Jun 29, 2024 · 2 comments
Open

Deadline exceed in DataDog exporter #1409

3miliano opened this issue Jun 29, 2024 · 2 comments
Labels
bug Something isn't working

Comments

@3miliano
Copy link

3miliano commented Jun 29, 2024

Describe the bug
I am experiencing a deadline exceed issue on the DataDog exporter, as evidenced by the logs. This issue results in failed export attempts and subsequent retries.

Steps to reproduce
1. Configure the custom Docker image with the custom collector based on opentelemetry-lambda that includes the DataDog exporter.
2. Initiate data export (traces, logs, metrics).
3. Observe the logs for errors related to context deadlines being exceeded.

What did you expect to see?
I expected the data to be exported successfully to DataDog without any timeout errors.

What did you see instead?
The export requests failed with “context deadline exceeded” errors, resulting in retries and eventual dropping of the payloads. Here are some excerpts from the logs:

1719687286935 {"level":"warn","ts":1719687286.9350078,"caller":"batchprocessor@v0.103.0/batch_processor.go:263","msg":"Sender failed","kind":"processor","name":"batch","pipeline":"logs","error":"no more retries left: Post \"https://http-intake.logs.datadoghq.com/api/v2/logs?ddtags=service%3Akognitos.book.yaml%2Cenv%3Amain%2Cregion%3Aus-west-2%2Ccloud_provider%3Aaws%2Cos.type%3Alinux%2Cotel_source%3Adatadog_exporter\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)"}
1719687286936 {"level":"error","ts":1719687286.9363096,"caller":"datadogexporter@v0.103.0/traces_exporter.go:181","msg":"Error posting hostname/tags series","kind":"exporter","data_type":"traces","name":"datadog","error":"max elapsed time expired Post \"https://api.datadoghq.com/api/v2/series\": context deadline exceeded (Client.Timeout exceeded while awaiting headers)","stacktrace":"github.com/open-telemetry/opentelemetry-collector-contrib/exporter/datadogexporter.(*traceExporter).exportUsageMetrics\n\t/root/go/pkg/mod/github.com/open-telemetry/opentelemetry-collector-contrib/exporter/datadogexporter@v0.103.0/traces_exporter.go:181\ngit.luolix.top/open-telemetry/opentelemetry-collector-contrib/exporter/datadogexporter.(*traceExporter).consumeTraces\n\t/root/go/pkg/mod/github.com/open-telemetry/opentelemetry-collector-contrib/exporter/datadogexporter@v0.103.0/traces_exporter.go:139\ngo.opentelemetry.io/collector/exporter/exporterhelper.(*tracesRequest).Export\n\t/root/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.103.0/exporterhelper/traces.go:59\ngo.opentelemetry.io/collector/exporter/exporterhelper.(*timeoutSender).send\n\t/root/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.103.0/exporterhelper/timeout_sender.go:43\ngo.opentelemetry.io/collector/exporter/exporterhelper.(*baseRequestSender).send\n\t/root/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.103.0/exporterhelper/common.go:37\ngo.opentelemetry.io/collector/exporter/exporterhelper.(*tracesExporterWithObservability).send\n\t/root/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.103.0/exporterhelper/traces.go:159\ngo.opentelemetry.io/collector/exporter/exporterhelper.(*baseRequestSender).send\n\t/root/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.103.0/exporterhelper/common.go:37\ngo.opentelemetry.io/collector/exporter/exporterhelper.(*baseRequestSender).send\n\t/root/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.103.0/exporterhelper/common.go:37\ngo.opentelemetry.io/collector/exporter/exporterhelper.(*baseExporter).send\n\t/root/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.103.0/exporterhelper/common.go:294\ngo.opentelemetry.io/collector/exporter/exporterhelper.NewTracesRequestExporter.func1\n\t/root/go/pkg/mod/go.opentelemetry.io/collector/exporter@v0.103.0/exporterhelper/traces.go:134\ngo.opentelemetry.io/collector/consumer.ConsumeTracesFunc.ConsumeTraces\n\t/root/go/pkg/mod/go.opentelemetry.io/collector/consumer@v0.103.0/traces.go:25\ngo.opentelemetry.io/collector/processor/batchprocessor.(*batchTraces).export\n\t/root/go/pkg/mod/go.opentelemetry.io/collector/processor/batchprocessor@v0.103.0/batch_processor.go:414\ngo.opentelemetry.io/collector/processor/batchprocessor.(*shard).sendItems\n\t/root/go/pkg/mod/go.opentelemetry.io/collector/processor/batchprocessor@v0.103.0/batch_processor.go:261\ngo.opentelemetry.io/collector/processor/batchprocessor.(*shard).startLoop\n\t/root/go/pkg/mod/go.opentelemetry.io/collector/processor/batchprocessor@v0.103.0/batch_processor.go:223"}

What version of collector/language SDK version did you use?
Version: Custom layer-collector/0.8.0 + datadogexporter from v0.103.0

What language layer did you use?
Config: None. It is a custom runtime that includes the binary in extensions.

Additional context
Here is my configuration file:

receivers:
  otlp:
    protocols:
      grpc:
        endpoint: "127.0.0.1:4317"
  hostmetrics:
    collection_interval: 60s
    scrapers:
      paging:
        metrics:
          system.paging.utilization:
            enabled: true
      cpu:
        metrics:
          system.cpu.utilization:
            enabled: true
      disk:
      filesystem:
        metrics:
          system.filesystem.utilization:
            enabled: true
      load:
      memory:
      network:
      processes:

exporters:
  datadog:
    api:
      key: ${secretsmanager:infrastructure/datadog_api_key}
    sending_queue:
      enabled: false
    tls:
      insecure: true
      insecure_skip_verify: true

connectors:
  datadog/connector:
      
processors:
  resourcedetection:
    detectors: ["lambda", "system"]
    system:
      hostname_sources: ["os"]
  transform:
    log_statements:
      - context: resource
        statements:
          - delete_key(attributes, "service.version")
          - set(attributes["service"], attributes["service.name"])
          - delete_key(attributes, "service.name")
      - context: log
        statements:
          - set(body, attributes["exception.message"]) where attributes["exception.message"] != nil
          - set(attributes["error.stack"], attributes["exception.stacktrace"]) where attributes["exception.stacktrace"] != nil
          - set(attributes["error.message"], attributes["exception.message"]) where attributes["exception.message"] != nil
          - set(attributes["error.kind"], attributes["exception.kind"]) where attributes["exception.kind"] != nil
service:
  telemetry:
    logs:
      level: "debug"
  pipelines:
    traces:
      receivers: [otlp]
      processors: [resourcedetection]
      exporters: [datadog/connector]
    traces/2:
      receivers: [datadog/connector]
      exporters: [datadog]
    metrics:
      receivers: [hostmetrics, otlp]
      processors: [resourcedetection]
      exporters: [datadog]
    logs:
      receivers: [otlp]
      processors: [resourcedetection, transform]
      exporters: [datadog]

Enabling/disabling sending_queue does not seem to do anything to prevent the errors. I did noticed that if I hit the service continuously some traces do get sent, but only a few.

What I discarded as potential solutions:

  1. Connectivity issues. DataDog has an API Key validation API calls that succeeds. If the service is hit constantly some traces get thru.
@3miliano 3miliano added the bug Something isn't working label Jun 29, 2024
@tylerbenson
Copy link
Member

Any reason you're not using the batch processor? That would probably help.

@serkan-ozal
Copy link
Contributor

@3miliano I think it is because of container freeze right after invocation complete and with those configs you have shared, collector is not aware of Lambda lifecycle. So as @tylerbenson suggested, using batch processor (so it will activate decouple processor by default) right before Datadog exported should resolve your problem.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants