Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[exporter/datadog] Memory leak in trace stats module #30828

Closed
sirianni opened this issue Jan 29, 2024 · 20 comments
Closed

[exporter/datadog] Memory leak in trace stats module #30828

sirianni opened this issue Jan 29, 2024 · 20 comments
Labels

Comments

@sirianni
Copy link
Contributor

sirianni commented Jan 29, 2024

Component(s)

exporter/datadog

What happened?

Description

We are observing a memory leak in the Datadog Exporter which appears to be in the trace/apm stats component.

The pattern we see is that when the collector restarts, memory grows steadily until GOMEMLIMIT is reached
image

We consider this a "leak" vs. a large working set because of the slow growth in heap usage.

Looking at a pprof of the collector process we see the majority of heap is used by the datadog-agent/trace/stats module.

$ go tool pprof -top -cum ~/tmp/mothership-pprof-heap.out
File: collector
Build ID: 38f9c6cdffde4b9f1487841c8b484fa5cf2787fb
Type: inuse_space
Time: Jan 29, 2024 at 9:59am (EST)
Showing nodes accounting for 376.01MB, 93.97% of 400.12MB total
Dropped 263 nodes (cum <= 2MB)
      flat  flat%   sum%        cum   cum%
         0     0%     0%   294.57MB 73.62%  github.com/DataDog/datadog-agent/pkg/trace/stats.(*Concentrator).Add
         0     0%     0%   294.57MB 73.62%  github.com/DataDog/datadog-agent/pkg/trace/stats.(*Concentrator).Run.func1
    0.67MB  0.17%  0.17%   294.57MB 73.62%  github.com/DataDog/datadog-agent/pkg/trace/stats.(*Concentrator).addNow
         0     0%  0.17%   293.90MB 73.45%  github.com/DataDog/datadog-agent/pkg/trace/stats.(*RawBucket).HandleSpan
   23.51MB  5.87%  6.04%   293.90MB 73.45%  github.com/DataDog/datadog-agent/pkg/trace/stats.(*RawBucket).add
         0     0%  6.04%   195.39MB 48.83%  github.com/DataDog/sketches-go/ddsketch.(*DDSketch).Add (inline)
         0     0%  6.04%   195.39MB 48.83%  github.com/DataDog/sketches-go/ddsketch.(*DDSketch).AddWithCount
         0     0%  6.04%   195.39MB 48.83%  github.com/DataDog/sketches-go/ddsketch/store.(*CollapsingLowestDenseStore).AddWithCount
  195.39MB 48.83% 54.88%   195.39MB 48.83%  github.com/DataDog/sketches-go/ddsketch/store.(*CollapsingLowestDenseStore).extendRange
         0     0% 54.88%   195.39MB 48.83%  github.com/DataDog/sketches-go/ddsketch/store.(*CollapsingLowestDenseStore).normalize
   10.50MB  2.62% 57.50%    75.01MB 18.75%  github.com/DataDog/datadog-agent/pkg/trace/stats.newGroupedStats
         0     0% 57.50%    64.50MB 16.12%  github.com/DataDog/sketches-go/ddsketch.LogCollapsingLowestDenseDDSketch
      37MB  9.25% 66.75%       37MB  9.25%  github.com/DataDog/sketches-go/ddsketch/store.NewCollapsingLowestDenseStore (inline)

image

We have already disabled these flags in the Datadog Exporter.

    traces:
      compute_stats_by_span_kind: false
      peer_tags_aggregation: false

Collector version

v0.90.1

Additional context

Seems similar to #15720

When we send the same workload of traces via the Datadog Agent, we do not see a similar memory leak. I thought this was worth noting since my understanding is that the two use the same library to calculate the trace stats.

@sirianni sirianni added bug Something isn't working needs triage New item requiring triage labels Jan 29, 2024
@github-actions github-actions bot added the exporter/datadog Datadog components label Jan 29, 2024
Copy link
Contributor

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@dineshg13
Copy link
Member

dineshg13 commented Jan 29, 2024

Thanks for reporting. Can you please test this with collector after setting [trace buffer] (

TraceBuffer int `mapstructure:"trace_buffer"`
) in datadog exporter to 1000.

Could also please share the full collector config.

@sirianni
Copy link
Contributor Author

Our full collector config is massive. Here is the config for the Datadog exporter

  datadog:
    metrics:
      resource_attributes_as_tags: true
      instrumentation_scope_metadata_as_tags: true
      summaries:
        mode: noquantiles
    traces:
      compute_stats_by_span_kind: false
      peer_tags_aggregation: false
      # Buffer traces to avoid "Payload in channel full" errors
      trace_buffer: 10
    host_metadata:
      enabled: false
    sending_queue:
      queue_size: 200

@songy23
Copy link
Member

songy23 commented Jan 29, 2024

@sirianni 👋 QQ

When we send the same workload of traces via the Datadog Agent, we do not see a similar memory leak. I thought this was worth noting since my understanding is that the two use the same library to calculate the trace stats.

What's the version of Datadog Agent that you use?
You're right Datadog Agent and Datadog exporter use the same library to calculate the trace stats. We want to double check the versions

@sirianni
Copy link
Contributor Author

We are currently running Datadog Agent 6.45.1.

@sirianni
Copy link
Contributor Author

Thanks @dineshg13 - I will experiment with the change. Can you help me understand how increasing trace_buffer would affect memory usage of the DD APM stats?

@sirianni
Copy link
Contributor Author

When we send the same workload of traces via the Datadog Agent, we do not see a similar memory leak.

I'm actually not able to confirm this is the case. We run the DD Agent with larger k8s memory limits so it's possible it's exhibiting a similar issue and we just haven't noticed.

@sirianni
Copy link
Contributor Author

@dineshg13 setting trace_buffer: 1000 did not fix the leak or show any meaningful improvement in memory usage.

@dineshg13
Copy link
Member

@sirianni Are you using Datadog connector ? It would be helpful to have share the full collector config.

@sirianni
Copy link
Contributor Author

@sirianni Are you using Datadog connector ? It would be helpful to have share the full collector config.

Yes we are. The full config is quite complicated because we have applications that dual write traces to both our OTel Collector and the DD Agent. We are doing this to allow for a graceful transition from DD Agent to OTel Collector.

What we are trying to accomplish for OpenCensus is:

  • Convert OpenCensus traces to OTel in the collector
  • Generate DD APM stats from these converted traces
  • Send only the APM stats for these traces to DD
  • We don't send the raw traces to DD because these would cause duplicate SpanIDs (since the applications themselves are dual-exporting to both DD Agent and our collector)

Effectively we're doing something like this.

piplelines:
  traces/opencensus:
    receivers: 
    - opencensus
    exporters:
    - datadog/trace_stats_connector
  traces/otel:
    receivers:
    - otlp
    exporters:
    - datadog
  metrics:
    receivers:
    - datadog/trace_stats_connector
    exporters:
    - datadog

I noticed that the datadogconnector does not support any configuration options compute_stats_by_span_kind or peer_tags_aggregation

@dineshg13
Copy link
Member

dineshg13 commented Jan 30, 2024

@sirianni Thanks for the details. We have fixed datadogconnector to support compute_stats_by_span_kind or peer_tags_aggregation. It should be available in the next release.

Are you using the feature gate in the connector ? We should enable connector.datadogconnector.performance if you are using connector for performance reasons. Please checkout the README

Do you expect to compute stats in ur traces/otel pipeline ?

@sirianni
Copy link
Contributor Author

Do you expect to compute stats in ur traces/otel pipeline ?

Yes, but these are computed by the datadog exporter. We're only using the datadogconnector for OpenCensus traces because we only want APM metrics for those, and not to export the traces themselves.

@sirianni
Copy link
Contributor Author

sirianni commented Jan 31, 2024

Are you using the feature gate in the connector ?

No, but the feature gate looks to make things more efficient by encoding metrics more efficiently as raw bytes. To be clear, the symptoms here look like a memory leak and not just "high usage". So while enabling the feature gate may make it take longer before the heap is exhausted, it would not seem to fix the root cause of the leak itself.

Also, in our case the heap was consumed by datadog-agent/pkg/trace/stats.(*Concentrator) which wouldn't seem to be affected by the featuregate. As I understand it, the featuregate controls the encoding of metrics to downstream pipelines and the Concentrator is the internal data structure used to maintain the APM stats.

mx-psi pushed a commit that referenced this issue Feb 5, 2024
)

**Description:** <Describe what has changed.>
<!--Ex. Fixing a bug - Describe the bug and how this fixes the issue.
Ex. Adding a feature - Explain what this achieves.-->

Datadog Connector is creating two instances of Trace Agent, one in the
trace-to-metrics pipeline and another in the traces-to-traces pipeline.
The PR separates the trace-to-trace connector, simplifying the logic,
this avoid un-necessary serialization.

**Link to tracking Issue:** <Issue number if applicable>

#30828

#30487

**Testing:** <Describe what testing was performed and which tests were
added.>

**Documentation:** <Describe the documentation added.>
@kamal-narayan
Copy link

@dineshg13 Are there fixes other than #31026 identified for this issue?

@mackjmr
Copy link
Member

mackjmr commented Feb 6, 2024

@kamal-narayan, for now this is the only fix that has been shipped. We are still load testing this PR, and will report back here once we have more results.

@mackjmr
Copy link
Member

mackjmr commented Feb 7, 2024

Connector Not in Trace to Trace Pipeline

@sirianni the stats code (datadog-agent/pkg/trace/stats.(*Concentrator)) that we see in the profiles you shared will create aggregation keys/buckets based on env, host, version, service...etc. This means that with a higher cardinality being sent to a collector, we expect a higher memory usage.

We've load tested the collector with multiple cardinalities, but were not able to reproduce the memory leak. The two noteworthy points of the load tests were:

  • The profiles that we collected match the ones that you have sent above, where the stats code (datadog-agent/pkg/trace/stats.(*Concentrator)) / extendRange is the largest consumer of live heap memory.
  • The memory of the collector never seems to exhibit a leak. A higher cardinality does lead to a higher memory usage by the stats module, but the memory usage always stabilises.

We consider this a "leak" vs. a large working set because of the slow growth in heap usage.

During the slow growth in heap usage from the screenshot, were there any new hosts being spun up sending data to that collector ? I'm trying to understand if there was an increase in cardinality during that time period, which would lead to the stats module using more memory. In the graphs that you shared, the collector memory seems to stabilise at some point. At this point, did the cardinality being sent to this collector reach it's peak ?

I'm actually not able to confirm this is the case. We run the DD Agent with larger k8s memory limits so it's possible it's exhibiting a similar issue and we just haven't noticed.

Are you able to run a test with the collector by giving it more memory ?

I'm interested in whether the memory stabilises. If you experience symptoms of a memory leak, would you be able to generate new profiles and also output traces via the file exporter ? This would allow us to run our load tests using the traces that you shared to try to reproduce the memory leak.

Connector IN Trace to Trace Pipeline

In this case, we were able to identify and reproduce a memory leak when the connector is used in the trace to trace pipeline. This was fixed in the following following PR, which will be part of the next collector release.

@sirianni Based on the config you shared, you are not using connector in trace to trace pipeline so you should not be affected by this memory leak.

@songy23 songy23 removed the needs triage New item requiring triage label Feb 7, 2024
@diogotorres97
Copy link

Still happening here too 😢

@mackjmr
Copy link
Member

mackjmr commented Feb 15, 2024

@diogotorres97 please see: #30908 (comment)

@mx-psi mx-psi added data:traces Trace related issues priority:p1 High labels Mar 8, 2024
Copy link
Contributor

github-actions bot commented May 8, 2024

This issue has been inactive for 60 days. It will be closed in 60 days if there is no activity. To ping code owners by adding a component label, see Adding Labels via Comments, or if you are unsure of which component this issue relates to, please ping @open-telemetry/collector-contrib-triagers. If this issue is still relevant, please ping the code owners or leave a comment explaining why it is still relevant. Otherwise, please close it.

Pinging code owners:

See Adding Labels via Comments if you do not have permissions to add labels yourself.

@github-actions github-actions bot added the Stale label May 8, 2024
Copy link
Contributor

github-actions bot commented Jul 7, 2024

This issue has been closed as inactive because it has been stale for 120 days with no activity.

@github-actions github-actions bot closed this as not planned Won't fix, can't repro, duplicate, stale Jul 7, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants