Metrics memory leak v1.12.0/v0.35.0 and up #3765

bthomee · 2023-02-22T14:34:29Z

Description

After upgrading from v1.11.2/v.0.34.0 to v1.12.0/v0.35.0 I have witnessed our pods to have a never-ending increase in memory consumption over time. Restoring to the former version brought stable memory use back. Once v1.13.0/v0.36.0 had been released I tried upgrading to that version, but got the same result. Once again, restoring to v1.11.2/v.0.34.0 stopped the memory increase.

Refer to the screenshot below that shows memory use of one of our pods on AWS EKS, specifically:

On 1/31/23 I upgraded to v1.12.0/v0.35.0 and instantly you can see the memory grow.
On 2/6/23 I made an unrelated release, which reset memory use but it immediately started growing again.
On 2/8/23 I downgraded to v1.11.2/v.0.34.0, after which memory use stayed stable again.
On 2/14/23 I upgraded to v1.13.0/v0.36.0, and once again saw memory increase. A few new releases made afterwards temporarily reset memory use, as before.
On 2/21/23 I downgraded v1.11.2/v.0.34.0, restoring the memory use.

Environment

OS: Linux
Architecture: AWS EKS
Go Version: 1.19
opentelemetry-go version: v1.12.0/v0.35.0 and up

Steps To Reproduce

Upgrade to v1.12.0/v0.35.0 or higher.
Update metrics from syncint64 etc. to their corresponding instrument versions.
See memory grow over time.

Expected behavior

Memory stays stable.

The text was updated successfully, but these errors were encountered:

MrAlias · 2023-02-22T15:41:16Z

What telemetry are you generating over these time periods? What instruments are you using and what attributes are you using?

bthomee · 2023-02-22T15:59:49Z

Every 5s a health/live check is performed, which uses an int64 counter (+1) and an int64 histogram (+duration). Each function call does the same, with differently named counters/histograms.

The non-health/live check functions add a few constant attributes, such as package name, but those functions are not currently called frequently - perhaps once a day. The health/live check function, which is the one contributing 99.9% of metrics, adds just the request path (either /health or /live) as attribute.

Aneurysm9 · 2023-02-22T16:56:04Z

Are you able to obtain and share pprof heap dumps from the process when memory use is elevated?

bthomee · 2023-02-22T18:53:40Z

I'll need a few days to find some time to enable pprof, update the package version, and capture the heap dumps - stay tuned.

bthomee · 2023-02-24T20:13:22Z

Over the course of some 6 hours I took 5 heap dumps using pprof, see attachment. In those 6 hours only the live and health checks were called every 5 seconds, which resulted in ~4 counters and ~histograms being updated with constant attributes (namely, a string containing the suffix of the URL - either /livez or /healthz - and a boolean whether the call was a success - which it always was).

There are some functions worth looking at it seems: prometheus.MakeLabelPairs and attribute.computeDistinctFixed.

heap.zip

MrAlias · 2023-02-24T21:33:20Z

Image of the above data using go tool pprof:

bthomee · 2023-02-24T22:19:03Z

I deployed a fresh instance and will let it run until Monday so I can collect a new sample to give you even more to work with. If there are other pprof extracts you're interested in let me know, and I'll see what I can do.

MrAlias · 2023-02-24T22:22:12Z

If you could get data with -inuse_objects as well that could be helpful.

MrAlias · 2023-02-24T22:25:17Z

My initial thoughts

We should look into if we can provide a reusable Sortable during the attribute set collection
We might need to double check we are correctly creating and recording prometheus data efficiently
The old SDK ability to record data with for a batch of attributes might need to be re-added (this is something Java kept)

I'm interested to see what the inuse_objects returns. That should help determine what things are allocated.

bthomee · 2023-02-25T18:16:13Z

I called the tool with inuse_objects followed by top to get the highest memory consumers, and got the following:

go tool pprof --inuse_objects https://[host]:[port]/debug/pprof/heap
Fetching profile over HTTP from https://[host]:[port]/debug/pprof/heap
/pprof.main_image.binary.alloc_objects.alloc_space.inuse_objects.inuse_space.001.pb.gz
File: main_image.binary
Type: inuse_objects
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top
Showing nodes accounting for 704254, 99.21% of 709865 total
Dropped 46 nodes (cum <= 3549)
Showing top 10 nodes out of 47
      flat  flat%   sum%        cum   cum%
    496994 70.01% 70.01%     496994 70.01%  github.com/prometheus/client_golang/prometheus.MakeLabelPairs
     76459 10.77% 80.78%      76459 10.77%  strings.(*Builder).grow
     32770  4.62% 85.40%      32770  4.62%  go.opentelemetry.io/otel/sdk/metric/internal.newBuckets
     32768  4.62% 90.02%      32768  4.62%  strconv.formatBits
     24588  3.46% 93.48%      24588  3.46%  go.opentelemetry.io/otel/attribute.computeDistinctFixed
     14044  1.98% 95.46%      14044  1.98%  github.com/prometheus/client_golang/prometheus.processMetric
     10923  1.54% 97.00%      10923  1.54%  github.com/prometheus/client_golang/prometheus.populateMetric
      6150  0.87% 97.86%       6150  0.87%  runtime.allocm
      5461  0.77% 98.63%       5461  0.77%  runtime.main
      4097  0.58% 99.21%       4097  0.58%  go.opentelemetry.io/otel/sdk/metric/internal.(*cumulativeHistogram[...]).Aggregation
(pprof)

bthomee · 2023-02-27T14:53:52Z

Another update:

Showing nodes accounting for 1653226, 98.38% of 1680435 total
Dropped 54 nodes (cum <= 8402)
Showing top 10 nodes out of 45
      flat  flat%   sum%        cum   cum%
    988529 58.83% 58.83%     988529 58.83%  github.com/prometheus/client_golang/prometheus.MakeLabelPairs
    316763 18.85% 77.68%     316763 18.85%  strings.(*Builder).grow
     73730  4.39% 82.06%      73730  4.39%  github.com/prometheus/client_golang/prometheus.(*constHistogram).Write
     65537  3.90% 85.96%      65537  3.90%  net/textproto.(*Reader).ReadMIMEHeader
     65536  3.90% 89.86%      65536  3.90%  strconv.formatBits
     45059  2.68% 92.54%      45059  2.68%  go.opentelemetry.io/otel/sdk/metric/internal.newBuckets
     41892  2.49% 95.04%      41892  2.49%  go.opentelemetry.io/otel/attribute.computeDistinctFixed
     23412  1.39% 96.43%      97145  5.78%  github.com/prometheus/client_golang/prometheus.processMetric
     21845  1.30% 97.73%      21845  1.30%  sort.Strings
     10923  0.65% 98.38%      10923  0.65%  github.com/prometheus/client_golang/prometheus.populateMetric

MrAlias · 2023-02-27T15:50:58Z

This seems to be indicating that the memory use is coming from the Prometheus client used by the Prometheus exporter. I'm not sure how much we can do to alleviate that given we do not control that code.

We should double check we are supporting as best we can the optimal path of label creation.

@dashpole is there anything that seems obvious to you?

dashpole · 2023-02-27T16:16:20Z

Are you able to provide the contents of curling the prometheus endpoint? Reading the implementation, I don't see anything that would obviously be the culprit for this.

bthomee · 2023-02-27T17:24:35Z

I think we're getting close to the solution. Curling the /metrics endpoint gave me "http_server" count and histogram metrics repeated thousands upon thousands of times (until I quit the command). See the attached text file where I grabbed the first 1000 records.

The main difference between each repeated block of the "http_server" metrics is in the attributes: The net _sock_peer_port keeps increasing. These metrics are not created by my app - they seem to be built-in.

metrics.txt

MrAlias · 2023-02-27T17:33:55Z

This looks related to #3744

The peer port should not be on a server metric¹.

https://github.com/open-telemetry/opentelemetry-specification/blob/361c9491d1a8c66757739d533162aee68272f629/specification/metrics/semantic_conventions/http-metrics.md#metric-httpserverduration ↩

bthomee · 2023-03-02T03:41:29Z

What is your current suggestion in my situation? Hold off on updating until #3744 is fixed, or do I need to change something on my end?

MrAlias · 2023-03-02T16:18:28Z

You could create a view for the instrument that filters out the attribute. Otherwise, yeah, I think the fix to #3744 is in order to fully remove the high cardinality attribute.

bitomaxsp · 2023-04-30T16:19:40Z

@bthomee Would it be possible for you to provide setup code you use in your service which creates exporter and meter provider?

bthomee · 2023-05-15T00:59:25Z

The following extract shows how we set up the exporter and meter provider:

// Create the metric exporter.
logger.Info("Creating metric exporter.")
metricExporter, err := prometheus.New()
if err != nil {
  logger.Panics(err)
}
// Create a view that filters out a number of dynamic attributes.
filteredView := metric.NewView(
  metric.Instrument{Name: "http.server.*"},
  metric.Stream{
    AttributeFilter: HttpAttributeFilter,
  },
)
// Create a metric provider using the exporter and filtered view.
metricProvider := metric.NewMeterProvider(
  metric.WithReader(metricExporter),
  metric.WithView(filteredView),
)
// Register the meter provider, so elsewhere we can call `global.MeterProvider()` to access it.
global.SetMeterProvider(metricProvider)
// Wrap the registry by a handler, so that collected metrics can be exported via HTTP.
metricHandler := promhttp.Handler()

The new addition we made, based on @MrAlias' suggestion, is adding that filtered view, which has worked like a charm.

aboryslawski · 2023-08-28T17:00:20Z

Was this resolved in 1.17.0?

bthomee added the bug Something isn't working label Feb 22, 2023

MrAlias added the response needed Waiting on user input before progress can be made label Feb 22, 2023

MrAlias removed the response needed Waiting on user input before progress can be made label Feb 24, 2023

MrAlias added this to the v1.15.0 milestone Mar 2, 2023

MrAlias modified the milestones: v1.15.0, v1.16.0 Apr 18, 2023

youngbupark mentioned this issue May 3, 2023

Remove remoteaddr to reduce attribute cardinality radius-project/radius#5514

Merged

5 tasks

MrAlias modified the milestones: v1.16.0, v1.17.0 May 18, 2023

This was referenced Aug 11, 2023

View attribute filtering is not compliant with the specification #4156

Closed

Switch Stream back to having an AttributeFilter field and add New*Filter functions #4444

Merged

MrAlias mentioned this issue Aug 15, 2023

TC Feedback Request: View attribute filter definition in Go open-telemetry/opentelemetry-specification#3664

Closed

MrAlias modified the milestones: v1.17.0/v0.40.0, v1.18.0/v0.41.0 Aug 21, 2023

MrAlias mentioned this issue Sep 7, 2023

Remove high cardanility metrics from otelhttp open-telemetry/opentelemetry-go-contrib#4277

Merged

MrAlias closed this as completed in open-telemetry/opentelemetry-go-contrib#4277 Sep 7, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Metrics memory leak v1.12.0/v0.35.0 and up #3765

Metrics memory leak v1.12.0/v0.35.0 and up #3765

bthomee commented Feb 22, 2023

MrAlias commented Feb 22, 2023

bthomee commented Feb 22, 2023

Aneurysm9 commented Feb 22, 2023

bthomee commented Feb 22, 2023

bthomee commented Feb 24, 2023

MrAlias commented Feb 24, 2023

bthomee commented Feb 24, 2023

MrAlias commented Feb 24, 2023

MrAlias commented Feb 24, 2023

bthomee commented Feb 25, 2023

bthomee commented Feb 27, 2023

MrAlias commented Feb 27, 2023

dashpole commented Feb 27, 2023

bthomee commented Feb 27, 2023

MrAlias commented Feb 27, 2023

bthomee commented Mar 2, 2023

MrAlias commented Mar 2, 2023

bitomaxsp commented Apr 30, 2023

bthomee commented May 15, 2023

aboryslawski commented Aug 28, 2023

Metrics memory leak v1.12.0/v0.35.0 and up #3765

Metrics memory leak v1.12.0/v0.35.0 and up #3765

Comments

bthomee commented Feb 22, 2023

Description

Environment

Steps To Reproduce

Expected behavior

MrAlias commented Feb 22, 2023

bthomee commented Feb 22, 2023

Aneurysm9 commented Feb 22, 2023

bthomee commented Feb 22, 2023

bthomee commented Feb 24, 2023

MrAlias commented Feb 24, 2023

bthomee commented Feb 24, 2023

MrAlias commented Feb 24, 2023

MrAlias commented Feb 24, 2023

bthomee commented Feb 25, 2023

bthomee commented Feb 27, 2023

MrAlias commented Feb 27, 2023

dashpole commented Feb 27, 2023

bthomee commented Feb 27, 2023

MrAlias commented Feb 27, 2023

Footnotes

bthomee commented Mar 2, 2023

MrAlias commented Mar 2, 2023

bitomaxsp commented Apr 30, 2023

bthomee commented May 15, 2023

aboryslawski commented Aug 28, 2023