Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

connectors/datadogconnector: Increasing Memory That Eventually Kills Collector Pods #30908

Closed
NickAnge opened this issue Jan 31, 2024 · 13 comments
Labels
bug Something isn't working connector/datadog needs triage New item requiring triage Stale waiting for author

Comments

@NickAnge
Copy link

Component(s)

connector/datadog

What happened?

Description

In our setup, we've activated both the Datadog connector and exporter to avoid APM stats sampling. We've been experiencing a continuous increase in memory, eventually leading to the pod reaching an Out-of-Memory (OOM) state after a few hours. We followed the suggested configuration from the README.md and have datadog/connector as the receiver for traces.

Steps to Reproduce

  1. Setup a OpenTelemetry Collector with the above configuration
  2. Publish trace telemetry data through the telemetry collector
  3. Evaluate memory through pprof

Expected Result

Not Memory increase that kills the pod

Actual Result

Memory increase that kills the pod eventually.

image

Collector version

opentelemetry-collector-contrib:0.88.0

Environment information

Environment

OpenTelemetry Collector configuration

receivers:
  otlp:
    protocols:
      grpc:
      http:

 probabilistic_sampler:
    sampling_percentage: 20

connectors:
    datadog/connector:

service:
  extensions: [ health_check, pprof ]
  pipelines:
    traces:
      receivers: [ otlp ]
      processors: [ batch ]
      exporters: [ datadog/connector ]
 
    traces/sampled:
      receivers: [ datadog/connector ]
      processors:[ probabilistic_sampler, batch ]
      exporters: [ datadog ]

Log output

No response

Additional context

No response

@NickAnge NickAnge added bug Something isn't working needs triage New item requiring triage labels Jan 31, 2024
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@mackjmr
Copy link
Member

mackjmr commented Feb 7, 2024

@NickAnge thanks for reporting. We were able to reproduce/ identify a memory leak in the code path for Datadog connector in the Trace to Trace pipeline, which is what is being used in your case. This memory leak was fixed in the following PR, which will be part of the next collector release.

@dmedinag
Copy link

dmedinag commented Feb 8, 2024

Hey we see the release 0.94.0 is available in github since 8 hours ago but the image is not yet present in dockerhub, are the schedules for the two artifacts different?

@mackjmr
Copy link
Member

mackjmr commented Feb 8, 2024

The docker image should be available once 0.94.0 gets released in https://github.com/open-telemetry/opentelemetry-collector-releases. See open-telemetry/opentelemetry-collector-releases#472.

@diogotorres97
Copy link

Still happening here too 😢

@mackjmr
Copy link
Member

mackjmr commented Feb 15, 2024

@diogotorres97 we aren't able to reproduce a memory leak in 0.94.0. Can you please clarify what behaviour you are seeing, your config, the collector version you are using ? Can you also please generate profiles and output traces in json format via the file exporter so we can attempt reproducing using your traces ?

@diogotorres97
Copy link

diogotorres97 commented Feb 16, 2024

We still receive traces but the stats are unavailable in Datadog after a few hours. If we have low load in the system, it can take 24h until we lost stats. But with a spike in requests we can lose that in less than that (yday was 12h).

Screenshot 2024-02-16 at 10 39 55

Screenshots from Memory Consuption:
Screenshot 2024-02-16 at 08 27 13
Screenshot 2024-02-16 at 08 27 59

The logs from deployments when we don't see more stats:

│ 2024-02-15T23:22:13.943Z    info    memorylimiter/memorylimiter.go:222    Memory usage is above soft limit. Forcing a GC.    {"kind": "processor", "name": "memory_limiter", "pipeline": "metrics", "cur_mem_mib": 1238}                          │
│ 2024-02-15T23:22:15.648Z    info    memorylimiter/memorylimiter.go:192    Memory usage after GC.    {"kind": "processor", "name": "memory_limiter", "pipeline": "metrics", "cur_mem_mib": 1228}                                                   │
│ 2024-02-15T23:22:15.648Z    warn    memorylimiter/memorylimiter.go:229    Memory usage is above soft limit. Refusing data.    {"kind": "processor", "name": "memory_limiter", "pipeline": "metrics", "cur_mem_mib": 1228}                         │
│ 2024-02-15T23:24:23.914Z    info    memorylimiter/memorylimiter.go:215    Memory usage back within limits. Resuming normal operation.    {"kind": "processor", "name": "memory_limiter", "pipeline": "metrics", "cur_mem_mib": 1018}              │
│ 2024-02-15T23:24:28.943Z    info    memorylimiter/memorylimiter.go:222    Memory usage is above soft limit. Forcing a GC.    {"kind": "processor", "name": "memory_limiter", "pipeline": "metrics", "cur_mem_mib": 1249}                          │
│ 2024-02-15T23:24:30.745Z    info    memorylimiter/memorylimiter.go:192    Memory usage after GC.    {"kind": "processor", "name": "memory_limiter", "pipeline": "metrics", "cur_mem_mib": 1241}                                                   │
│ 2024-02-15T23:24:30.745Z    warn    memorylimiter/memorylimiter.go:229    Memory usage is above soft limit. Refusing data.    {"kind": "processor", "name": "memory_limiter", "pipeline": "metrics", "cur_mem_mib": 1241}

Also, after a restart of the collector-deployment only we start receiving the stats again (but this is not a solution 😄 )

Config:

  - chart: opentelemetry-collector
      releaseName: opentelemetry-collector-deployment
      values: |
        config:
          connectors:
            datadog/connector:
          exporters:
            datadog:
              api:
                key: ${env:DD_API_KEY}
              traces:
                trace_buffer: 500
          processors:
            batch:
              timeout: 5s
              send_batch_max_size: 1000
              send_batch_size: 250
            memory_limiter:
              check_interval: 1s
              limit_percentage: 80
              spike_limit_percentage: 25
            tail_sampling/limit:
              decision_wait: 1s
              num_traces: 100000
              policies:
              - name: rate-limit
                rate_limiting:
                  spans_per_second: 10000
                type: rate_limiting
            tail_sampling/logic:
              num_traces: 100000
              policies:
              - name: http-server-errors
                numeric_attribute:
                  key: http.status_code
                  max_value: 599
                  min_value: 500
                type: numeric_attribute
              - name: grpc-unknown-errors
                numeric_attribute:
                  key: rpc.grpc.status_code
                  max_value: 2
                  min_value: 2
                type: numeric_attribute
              - name: grpc-server-errors
                numeric_attribute:
                  key: rpc.grpc.status_code
                  max_value: 15
                  min_value: 12
                type: numeric_attribute
              - latency:
                  threshold_ms: 400
                name: slow
                type: latency
          service:
            pipelines:
              traces:
                exporters: [datadog/connector]
                processors: [memory_limiter, tail_sampling/logic, tail_sampling/limit]
                receivers: [otlp]
              traces/2:
                exporters: [datadog]
                processors: [memory_limiter, batch]
                receivers: [datadog/connector]
              metrics:
                exporters: [datadog]
                processors: [memory_limiter, batch]
                receivers: [datadog/connector]
        image:
          tag: 0.94.0
        mode: deployment
        podAnnotations:
          ad.datadoghq.com/opentelemetry-collector.checks: |
            {
              "openmetrics": {
                "instances": [
                  {
                    "openmetrics_endpoint": "http://%%host%%:%%port_metrics%%/metrics",
                    "namespace": "monitoring",
                    "metrics": [
                      "otelcol_exporter_sent_spans",
                      "otelcol_process_runtime_total_alloc_bytes",
                      "otelcol_process_runtime_total_sys_memory_bytes",
                      "otelcol_processor_tail_sampling_count_traces_sampled",
                      "otelcol_processor_tail_sampling_sampling_decision_latency",
                      "otelcol_processor_tail_sampling_sampling_traces_on_memory"
                    ]
                  }
                ]
              }
            }
        ports:
          metrics:
            enabled: true
        replicaCount: 5
        resources:
          limits:
            cpu: '2'
            memory: 2Gi
          requests:
            cpu: 500m
            memory: 2Gi
        service:
          clusterIP: None
    repoURL: https://open-telemetry.github.io/opentelemetry-helm-charts
    targetRevision: "0.80.0"
  - chart: opentelemetry-collector
    helm:
      releaseName: opentelemetry-collector-agent
      values: |
        config:
          exporters:
            loadbalancing:
              protocol:
                otlp:
                  timeout: 1s
                  tls:
                    insecure: true
              resolver:
                k8s:
                  service: opentelemetry-collector-deployment.monitoring
            otlp:
              endpoint: http://opentelemetry-collector-deployment.monitoring.svc.cluster.local:4317
              tls:
                insecure: true
          processors:
            memory_limiter:
              check_interval: 1s
              limit_percentage: 80
              spike_limit_percentage: 25
          service:
            pipelines:
              traces:
                exporters: [loadbalancing]
                processors: [memory_limiter]
                receivers: [otlp]
        image:
          tag: 0.94.0
        mode: daemonset
        podAnnotations:
          ad.datadoghq.com/opentelemetry-collector.checks: |
            {
              "openmetrics": {
                "instances": [
                  {
                    "openmetrics_endpoint": "http://%%host%%:%%port_metrics%%/metrics",
                    "namespace": "monitoring",
                    "metrics": [
                      "otelcol_exporter_sent_spans",
                      "otelcol_loadbalancer_backend_latency",
                      "otelcol_loadbalancer_backend_outcome",
                      "otelcol_process_runtime_total_alloc_bytes",
                      "otelcol_process_runtime_total_sys_memory_bytes"
                    ]
                  }
                ]
              }
            }
        ports:
          metrics:
            enabled: true
        resources:
          limits:
            cpu: '2'
            memory: 1500Mi
          requests:
            cpu: 500m
            memory: 1500Mi
        service:
          enabled: true
    repoURL: https://open-telemetry.github.io/opentelemetry-helm-charts
    targetRevision: "0.80.0"

We are updating to the latest version everytime there is a new release in hope to fix the problem 😄

@mackjmr
Copy link
Member

mackjmr commented Feb 16, 2024

@diogotorres97 towards 21h20 in screenshot you shared was there an increase in data sent to the collectors ? Did that time correspond to the spike in requests you mentioned ?

@diogotorres97
Copy link

@diogotorres97 towards 21h20 in screenshot you shared was there an increase in data sent to the collectors ? Did that time correspond to the spike in requests you mentioned ?

yes. Usually without spikes it will increase the memory in one day or two, with spikes (it depends) but can grow very fast...

@mackjmr
Copy link
Member

mackjmr commented Feb 16, 2024

@diogotorres97 if higher data/ cardinality is being sent, higher memory consumption is expected.

Usually without spikes it will increase the memory in one day or two

Memory increasing with steady traffic/ cardinality is unexpected. We've been unable to reproduce a memory leak with 0.94.0 with tests of different cardinality/ different traffic.

In the scenario where the memory increases under steady traffic, can you please provide us with output traces in json format via the file exporter, graphs showing the steady increase in memory, as well as profiles. Ideally, having two profiles spaced out in a time where memory was increasing. Having these two profiles, we'll be able to see what is growing in memory.

Copy link
Contributor

github-actions bot commented May 8, 2024

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label May 8, 2024
@NickAnge NickAnge changed the title connectors/datadog/connector: Increasing Memory That Eventually Kills Collector Pods connectors/datadogconnector: Increasing Memory That Eventually Kills Collector Pods May 8, 2024
@NickAnge
Copy link
Author

NickAnge commented May 8, 2024

Hello @mackjmr . I just wanted to let you know that we have been using the new version and we do not see any memory leaks coming from this component. I am not sure if we can close this issue, or should we wait for more time. Thanks in advance

@mx-psi
Copy link
Member

mx-psi commented May 8, 2024

Thanks for getting back to us @NickAnge! I think we can close this for now. If the issue comes back, please comment on the issue and we can reopen :)

@mx-psi mx-psi closed this as completed May 8, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working connector/datadog needs triage New item requiring triage Stale waiting for author
Projects
None yet
Development

No branches or pull requests

5 participants