[exporter/datadog] Memory leak in trace stats module #30828

sirianni · 2024-01-29T15:36:10Z

Component(s)

exporter/datadog

What happened?

Description

We are observing a memory leak in the Datadog Exporter which appears to be in the trace/apm stats component.

The pattern we see is that when the collector restarts, memory grows steadily until GOMEMLIMIT is reached

We consider this a "leak" vs. a large working set because of the slow growth in heap usage.

Looking at a pprof of the collector process we see the majority of heap is used by the datadog-agent/trace/stats module.

$ go tool pprof -top -cum ~/tmp/mothership-pprof-heap.out
File: collector
Build ID: 38f9c6cdffde4b9f1487841c8b484fa5cf2787fb
Type: inuse_space
Time: Jan 29, 2024 at 9:59am (EST)
Showing nodes accounting for 376.01MB, 93.97% of 400.12MB total
Dropped 263 nodes (cum <= 2MB)
      flat  flat%   sum%        cum   cum%
         0     0%     0%   294.57MB 73.62%  github.com/DataDog/datadog-agent/pkg/trace/stats.(*Concentrator).Add
         0     0%     0%   294.57MB 73.62%  github.com/DataDog/datadog-agent/pkg/trace/stats.(*Concentrator).Run.func1
    0.67MB  0.17%  0.17%   294.57MB 73.62%  github.com/DataDog/datadog-agent/pkg/trace/stats.(*Concentrator).addNow
         0     0%  0.17%   293.90MB 73.45%  github.com/DataDog/datadog-agent/pkg/trace/stats.(*RawBucket).HandleSpan
   23.51MB  5.87%  6.04%   293.90MB 73.45%  github.com/DataDog/datadog-agent/pkg/trace/stats.(*RawBucket).add
         0     0%  6.04%   195.39MB 48.83%  github.com/DataDog/sketches-go/ddsketch.(*DDSketch).Add (inline)
         0     0%  6.04%   195.39MB 48.83%  github.com/DataDog/sketches-go/ddsketch.(*DDSketch).AddWithCount
         0     0%  6.04%   195.39MB 48.83%  github.com/DataDog/sketches-go/ddsketch/store.(*CollapsingLowestDenseStore).AddWithCount
  195.39MB 48.83% 54.88%   195.39MB 48.83%  github.com/DataDog/sketches-go/ddsketch/store.(*CollapsingLowestDenseStore).extendRange
         0     0% 54.88%   195.39MB 48.83%  github.com/DataDog/sketches-go/ddsketch/store.(*CollapsingLowestDenseStore).normalize
   10.50MB  2.62% 57.50%    75.01MB 18.75%  github.com/DataDog/datadog-agent/pkg/trace/stats.newGroupedStats
         0     0% 57.50%    64.50MB 16.12%  github.com/DataDog/sketches-go/ddsketch.LogCollapsingLowestDenseDDSketch
      37MB  9.25% 66.75%       37MB  9.25%  github.com/DataDog/sketches-go/ddsketch/store.NewCollapsingLowestDenseStore (inline)

We have already disabled these flags in the Datadog Exporter.

    traces:
      compute_stats_by_span_kind: false
      peer_tags_aggregation: false

Collector version

v0.90.1

Additional context

Seems similar to #15720

When we send the same workload of traces via the Datadog Agent, we do not see a similar memory leak. I thought this was worth noting since my understanding is that the two use the same library to calculate the trace stats.

The text was updated successfully, but these errors were encountered:

github-actions · 2024-01-29T15:38:31Z

Pinging code owners:

exporter/datadog: @mx-psi @dineshg13 @liustanley @songy23 @mackjmr

See Adding Labels via Comments if you do not have permissions to add labels yourself.

dineshg13 · 2024-01-29T18:00:13Z

Thanks for reporting. Can you please test this with collector after setting [trace buffer] (

opentelemetry-collector-contrib/exporter/datadogexporter/config.go

Line 297 in 4918e96

TraceBuffer int `mapstructure:"trace_buffer"`

) in datadog exporter to 1000.

Could also please share the full collector config.

sirianni · 2024-01-29T19:03:34Z

Our full collector config is massive. Here is the config for the Datadog exporter

  datadog:
    metrics:
      resource_attributes_as_tags: true
      instrumentation_scope_metadata_as_tags: true
      summaries:
        mode: noquantiles
    traces:
      compute_stats_by_span_kind: false
      peer_tags_aggregation: false
      # Buffer traces to avoid "Payload in channel full" errors
      trace_buffer: 10
    host_metadata:
      enabled: false
    sending_queue:
      queue_size: 200

songy23 · 2024-01-29T19:09:24Z

@sirianni 👋 QQ

When we send the same workload of traces via the Datadog Agent, we do not see a similar memory leak. I thought this was worth noting since my understanding is that the two use the same library to calculate the trace stats.

What's the version of Datadog Agent that you use?
You're right Datadog Agent and Datadog exporter use the same library to calculate the trace stats. We want to double check the versions

sirianni · 2024-01-29T19:24:24Z

We are currently running Datadog Agent 6.45.1.

sirianni · 2024-01-29T19:25:16Z

Thanks @dineshg13 - I will experiment with the change. Can you help me understand how increasing trace_buffer would affect memory usage of the DD APM stats?

sirianni · 2024-01-29T19:36:42Z

When we send the same workload of traces via the Datadog Agent, we do not see a similar memory leak.

I'm actually not able to confirm this is the case. We run the DD Agent with larger k8s memory limits so it's possible it's exhibiting a similar issue and we just haven't noticed.

sirianni · 2024-01-30T18:18:00Z

@dineshg13 setting trace_buffer: 1000 did not fix the leak or show any meaningful improvement in memory usage.

dineshg13 · 2024-01-30T21:39:11Z

@sirianni Are you using Datadog connector ? It would be helpful to have share the full collector config.

sirianni · 2024-01-30T22:42:17Z

@sirianni Are you using Datadog connector ? It would be helpful to have share the full collector config.

Yes we are. The full config is quite complicated because we have applications that dual write traces to both our OTel Collector and the DD Agent. We are doing this to allow for a graceful transition from DD Agent to OTel Collector.

What we are trying to accomplish for OpenCensus is:

Convert OpenCensus traces to OTel in the collector
Generate DD APM stats from these converted traces
Send only the APM stats for these traces to DD
We don't send the raw traces to DD because these would cause duplicate SpanIDs (since the applications themselves are dual-exporting to both DD Agent and our collector)

Effectively we're doing something like this.

piplelines:
  traces/opencensus:
    receivers: 
    - opencensus
    exporters:
    - datadog/trace_stats_connector
  traces/otel:
    receivers:
    - otlp
    exporters:
    - datadog
  metrics:
    receivers:
    - datadog/trace_stats_connector
    exporters:
    - datadog

I noticed that the datadogconnector does not support any configuration options compute_stats_by_span_kind or peer_tags_aggregation

dineshg13 · 2024-01-30T22:51:13Z

@sirianni Thanks for the details. We have fixed datadogconnector to support compute_stats_by_span_kind or peer_tags_aggregation. It should be available in the next release.

Are you using the feature gate in the connector ? We should enable connector.datadogconnector.performance if you are using connector for performance reasons. Please checkout the README

Do you expect to compute stats in ur traces/otel pipeline ?

sirianni · 2024-01-31T01:22:56Z

Do you expect to compute stats in ur traces/otel pipeline ?

Yes, but these are computed by the datadog exporter. We're only using the datadogconnector for OpenCensus traces because we only want APM metrics for those, and not to export the traces themselves.

sirianni · 2024-01-31T01:28:50Z

Are you using the feature gate in the connector ?

No, but the feature gate looks to make things more efficient by encoding metrics more efficiently as raw bytes. To be clear, the symptoms here look like a memory leak and not just "high usage". So while enabling the feature gate may make it take longer before the heap is exhausted, it would not seem to fix the root cause of the leak itself.

Also, in our case the heap was consumed by datadog-agent/pkg/trace/stats.(*Concentrator) which wouldn't seem to be affected by the featuregate. As I understand it, the featuregate controls the encoding of metrics to downstream pipelines and the Concentrator is the internal data structure used to maintain the APM stats.

) **Description:** <Describe what has changed.>  Datadog Connector is creating two instances of Trace Agent, one in the trace-to-metrics pipeline and another in the traces-to-traces pipeline. The PR separates the trace-to-trace connector, simplifying the logic, this avoid un-necessary serialization. **Link to tracking Issue:** <Issue number if applicable> #30828 #30487 **Testing:** <Describe what testing was performed and which tests were added.> **Documentation:** <Describe the documentation added.>

kamal-narayan · 2024-02-05T20:15:14Z

@dineshg13 Are there fixes other than #31026 identified for this issue?

mackjmr · 2024-02-06T10:20:36Z

@kamal-narayan, for now this is the only fix that has been shipped. We are still load testing this PR, and will report back here once we have more results.

mackjmr · 2024-02-07T12:36:55Z

Connector Not in Trace to Trace Pipeline

@sirianni the stats code (datadog-agent/pkg/trace/stats.(*Concentrator)) that we see in the profiles you shared will create aggregation keys/buckets based on env, host, version, service...etc. This means that with a higher cardinality being sent to a collector, we expect a higher memory usage.

We've load tested the collector with multiple cardinalities, but were not able to reproduce the memory leak. The two noteworthy points of the load tests were:

The profiles that we collected match the ones that you have sent above, where the stats code (datadog-agent/pkg/trace/stats.(*Concentrator)) / extendRange is the largest consumer of live heap memory.
The memory of the collector never seems to exhibit a leak. A higher cardinality does lead to a higher memory usage by the stats module, but the memory usage always stabilises.

We consider this a "leak" vs. a large working set because of the slow growth in heap usage.

During the slow growth in heap usage from the screenshot, were there any new hosts being spun up sending data to that collector ? I'm trying to understand if there was an increase in cardinality during that time period, which would lead to the stats module using more memory. In the graphs that you shared, the collector memory seems to stabilise at some point. At this point, did the cardinality being sent to this collector reach it's peak ?

I'm actually not able to confirm this is the case. We run the DD Agent with larger k8s memory limits so it's possible it's exhibiting a similar issue and we just haven't noticed.

Are you able to run a test with the collector by giving it more memory ?

I'm interested in whether the memory stabilises. If you experience symptoms of a memory leak, would you be able to generate new profiles and also output traces via the file exporter ? This would allow us to run our load tests using the traces that you shared to try to reproduce the memory leak.

Connector IN Trace to Trace Pipeline

In this case, we were able to identify and reproduce a memory leak when the connector is used in the trace to trace pipeline. This was fixed in the following following PR, which will be part of the next collector release.

@sirianni Based on the config you shared, you are not using connector in trace to trace pipeline so you should not be affected by this memory leak.

diogotorres97 · 2024-02-15T10:57:58Z

Still happening here too 😢

mackjmr · 2024-02-15T12:34:54Z

@diogotorres97 please see: #30908 (comment)

github-actions · 2024-05-08T03:29:17Z

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

exporter/datadog: @mx-psi @dineshg13 @liustanley @songy23 @mackjmr

See Adding Labels via Comments if you do not have permissions to add labels yourself.

github-actions · 2024-07-07T05:20:07Z

This issue has been closed as inactive because it has been stale for 120 days with no activity.

sirianni added bug Something isn't working needs triage New item requiring triage labels Jan 29, 2024

github-actions bot added the exporter/datadog Datadog components label Jan 29, 2024

github-actions bot mentioned this issue Jan 30, 2024

Weekly Report: 2024-01-23 - 2024-01-30 #30848

Closed

sirianni mentioned this issue Jan 31, 2024

Datadog Connector Memory Issues #29755

Closed

dineshg13 mentioned this issue Feb 4, 2024

[datadog/connector] Create a simplified Trace to Trace connector #31026

Merged

github-actions bot mentioned this issue Feb 6, 2024

Weekly Report: 2024-01-30 - 2024-02-06 #31055

Closed

songy23 removed the needs triage New item requiring triage label Feb 7, 2024

mx-psi added data:traces Trace related issues priority:p1 High labels Mar 8, 2024

mx-psi added the waiting for author label Mar 8, 2024

github-actions bot added the Stale label May 8, 2024

github-actions bot added the closed as inactive label Jul 7, 2024

github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jul 7, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[exporter/datadog] Memory leak in trace stats module #30828

[exporter/datadog] Memory leak in trace stats module #30828

sirianni commented Jan 29, 2024 •

edited

Loading

github-actions bot commented Jan 29, 2024

dineshg13 commented Jan 29, 2024 •

edited

Loading

sirianni commented Jan 29, 2024

songy23 commented Jan 29, 2024

sirianni commented Jan 29, 2024

sirianni commented Jan 29, 2024

sirianni commented Jan 29, 2024

sirianni commented Jan 30, 2024

dineshg13 commented Jan 30, 2024

sirianni commented Jan 30, 2024

dineshg13 commented Jan 30, 2024 •

edited

Loading

sirianni commented Jan 31, 2024

sirianni commented Jan 31, 2024 •

edited

Loading

kamal-narayan commented Feb 5, 2024

mackjmr commented Feb 6, 2024

mackjmr commented Feb 7, 2024

diogotorres97 commented Feb 15, 2024

mackjmr commented Feb 15, 2024

github-actions bot commented May 8, 2024

github-actions bot commented Jul 7, 2024

[exporter/datadog] Memory leak in trace stats module #30828

[exporter/datadog] Memory leak in trace stats module #30828

Comments

sirianni commented Jan 29, 2024 • edited Loading

Component(s)

What happened?

Description

Collector version

Additional context

github-actions bot commented Jan 29, 2024

dineshg13 commented Jan 29, 2024 • edited Loading

sirianni commented Jan 29, 2024

songy23 commented Jan 29, 2024

sirianni commented Jan 29, 2024

sirianni commented Jan 29, 2024

sirianni commented Jan 29, 2024

sirianni commented Jan 30, 2024

dineshg13 commented Jan 30, 2024

sirianni commented Jan 30, 2024

dineshg13 commented Jan 30, 2024 • edited Loading

sirianni commented Jan 31, 2024

sirianni commented Jan 31, 2024 • edited Loading

kamal-narayan commented Feb 5, 2024

mackjmr commented Feb 6, 2024

mackjmr commented Feb 7, 2024

diogotorres97 commented Feb 15, 2024

mackjmr commented Feb 15, 2024

github-actions bot commented May 8, 2024

github-actions bot commented Jul 7, 2024

sirianni commented Jan 29, 2024 •

edited

Loading

dineshg13 commented Jan 29, 2024 •

edited

Loading

dineshg13 commented Jan 30, 2024 •

edited

Loading

sirianni commented Jan 31, 2024 •

edited

Loading