Memory leak in go-metrics library #530

galen0624 · 2018-08-09T14:50:23Z

We are running Fabio 1.5.9 in a production environment that has 1,200+ services. We have observed what appears to be a memory leak (see graph of example). The graph displays a lab ENV but in production the growth rate is the same. We have to redeploy/restart the Fabio instance every couple of days due to this on going issue. If the memory consumption grows above 60% (8G total on the box) Fabio starts to experience delays in route updates over 1 min.

We did a pprof memory profile of Fabio (5 min and 12 hour). The profile seems to indicate that the issue is within Metrics. See attached. We are currently in the process of running Fabio without Metrics to see if that has any impact. I will update when I gather more data.
mem-5min.pdf
mem-12h.pdf

@murphymj25

galen0624 · 2018-08-10T13:18:30Z

We disabled Metrics and let Fabio run for 24 hours. The memory consumption was level.

aaronhurt · 2018-08-10T13:41:31Z

Thank you for that detailed analysis. I'll try to dig in and review the metrics code and see if I can find the leak over the next few days.

aaronhurt · 2018-08-10T14:28:10Z

Doing a bit more digging I think this relates to #476 and provides a bit of a push to get that going.

galen0624 · 2018-08-10T14:46:22Z

I agree with #476 being pushed up. I think it's slated for 1.6. We would rather see the raw metrics instead of rolled up ones. I was looking into rcrowley/go-metrics and my colleague and I are thinking it's related to the churn of node health. We have over 1200 services registered in Consul so the node health is always updating. Meaning the blocking query from Fabio to Consul is sort of pointless in our ENV because the Index is always being updated. This leads to the metric sets being updated constantly. We are wondering if the go-metrics doesn't clean up the old sets if a node health refreshes.
To put it another way -- If a node health flaps and the Metrics package doesn't reuse the old data set for that node it might be keeping those stale records. In our ENV that would create a lot of stale data and would explain the memory leak.

aaronhurt · 2018-08-10T20:18:04Z

I agree it's related to go-metrics. I was looking through their issue feed and found a few older issues that may be related:

rcrowley/go-metrics#201
rcrowley/go-metrics#163
rcrowley/go-metrics#197
rcrowley/go-metrics#206

Other project referenced in go-metrics issues ...
IBM/sarama#897

The upstream fixes in 197 and 206 look to be added to a later version of go-metrics than is currently vendored into fabio. We may be able to get some resolution of the leak by updating the vendored version.

galen0624 · 2018-08-13T18:30:11Z

I vendored in the latest from rcrowley/go-metrics and turned metrics back on. I'll watch the memory consumption over the next day and update the issue with findings. I'll PR my fork/branch if it looks good and Fabio performs as expected.

{"path":"github.com/rcrowley/go-metrics","checksumSHA1":"an5RM8wjgPPloolUUYkvEncbHu4=","revision":"e2704e165165ec55d062f5919b4b29494e9fa790","revisionTime":"2018-05-03T17:46:38Z"},

…revision:e2704e165165ec55d062f5919b4b29494e9fa790. Issue - fabiolb#530

galen0624 · 2018-08-14T16:11:40Z

After Vendoring in the latest go-metrics it appears that the memory leak is still persistent. Digging further into the issue the NewMeter() function is consuming most of the memory. github.com/rcrowley/go-metrics/meter.go

The "constructor" function has a new comment stipulating that Stop() should be called once the meter isn't needed any more. -- see below.

// NewMeter constructs a new StandardMeter and launches a goroutine.
// Be sure to call Stop() once the meter is of no use to allow for garbage collection.
func NewMeter() Meter {
-- SNIP
	m := newStandardMeter()
-- SNIP
}

Looking through the Fabio Code I don't see where the Metrics Stop() func is called. If Stop isn't called then our assumption about route/node health churn might hold true based on the comments in the NewMeter function.

I'm wondering if that can be added to the func (t Table) delRoute(d *RouteDef) error {} in route/table.go? I haven't dug enough into the metrics code to see where it potentially would be added.

aaronhurt · 2018-08-14T17:14:34Z

Right, that was in-line with what I was reading. The Stop() func for the return from NewMeter() was added after go-metrics was pulled into Fabio.

galen0624 · 2018-08-14T17:15:59Z

Digging further it looks like the rcrowley/go-metrics Registry Interface has been updated in the lastest release with new method signatures. So metrics/gometrics.go gmRegistry struct methods will have to be updated.

One of the new method signatures GetOrRegister(string, interface{}) interface{} Can return a Meter Interface that has the Stop() func that I believe should be used. I'm a little new to GoLang so I can take a shot at this but I'm looking for confirmation that I'm heading down the right path.

aaronhurt · 2018-08-14T17:26:18Z

I definitely think you're on the right path.

galen0624 · 2018-08-14T18:27:41Z

I was totally wrong on updating the gmRegistry struct as that implements the fabio Metrics Registry not the rcwroley/go-metrics Registry.

Looks like the appropriate spot to do this is in route/table.go func syncRegistry(t Table) {}
Just before the ServiceRegistry.Unregister(name)

aaronhurt · 2018-08-14T18:28:44Z

That sounds logical. Unregistering the service should definitely stop collecting metrics on said service.

galen0624 · 2018-08-14T18:31:46Z

So it already does that but I don't see where the Stop() func is called in the Unregister function. So my guess is that the data set is still hanging around and not being allowed to be GCed.
// Be sure to call Stop() once the meter is of no use to allow for garbage collection.

We may have to add it in addition to the Unregister.

galen0624 · 2018-08-14T20:28:44Z

Again I was wrong with my previous statement. Unregister eventually does call a Stop() func so I'm not sure what's wrong. I did update the code with

metrics/registry.go

// Timer defines a metric for counting and timing durations for events.
type Timer interface {
	// Stop allows the metrics to be garbage collected.
	Stop()
}

And route/table.go



func syncRegistry(t Table) {
--SNIP

	// unregister inactive timers
	for name, active := range timers {
		if !active {
			t := ServiceRegistry.GetTimer(name)
			t.Stop()
			ServiceRegistry.Unregister(name)
			log.Printf("[INFO] Unregistered timer %s", name)
		}
	}
}

I redeployed it in the lab to see if this has any impact. I'll update tomorrow if it does.

…ackage to version - revision:e2704e165165ec55d062f5919b4b29494e9fa790

galen0624 · 2018-08-15T18:18:03Z

I realized earlier today that I had a mistake in my GOPATH. I am currently re-testing the updated github.com/rcrowley/go-metrics package. After a few hours things look promising. I will update tomorrow with a definitive answer. If all goes well I will generate a PR.

galen0624 · 2018-08-16T13:26:36Z

After a roughly 24 hour run it looks like the Memory Leak is fixed by vendoring in the lastest from github.com/rcrowley/go-metrics. Below is an example graph. I'll submit a PR later today.

aaronhurt · 2018-08-16T14:35:58Z

Awesome, great news. Thank you for all the work and analytics on this issue.

magiconair · 2018-08-16T15:28:30Z

Thanks @galen0624 for the digging and the subsequent fix. Time to pick up the slack on fabio again. I need to back out the FastCGI change which currently breaks the build and then we can push a new release.

galen0624 · 2018-08-16T15:31:49Z

No worries. Glad to help.

murphymj25 mentioned this issue Aug 9, 2018

Fabio Has Stale Routes #481

Open

galen0624 added a commit to galen0624/fabio that referenced this issue Aug 13, 2018

Updated Vendored github.com/rcrowley/go-metrics package to version - …

e4cdb41

…revision:e2704e165165ec55d062f5919b4b29494e9fa790. Issue - fabiolb#530

galen0624 added a commit to galen0624/fabio that referenced this issue Aug 15, 2018

Issue fabiolb#530 - Updated Vendored github.com/rcrowley/go-metrics p…

26c555e

…ackage to version - revision:e2704e165165ec55d062f5919b4b29494e9fa790

galen0624 mentioned this issue Aug 16, 2018

Issue #530 - Vendored in updated go-metrics package #535

Merged

aaronhurt closed this as completed in 7b9ee7c Aug 16, 2018

aaronhurt changed the title ~~Memory Leak In Metrics~~ Memory leak in go-metrics library Aug 16, 2018

magiconair added this to the 1.5.10 milestone Sep 19, 2018

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory leak in go-metrics library #530

Memory leak in go-metrics library #530

galen0624 commented Aug 9, 2018

galen0624 commented Aug 10, 2018

aaronhurt commented Aug 10, 2018

aaronhurt commented Aug 10, 2018

galen0624 commented Aug 10, 2018

aaronhurt commented Aug 10, 2018 •

edited

Loading

galen0624 commented Aug 13, 2018 •

edited

Loading

galen0624 commented Aug 14, 2018 •

edited

Loading

aaronhurt commented Aug 14, 2018

galen0624 commented Aug 14, 2018

aaronhurt commented Aug 14, 2018

galen0624 commented Aug 14, 2018

aaronhurt commented Aug 14, 2018

galen0624 commented Aug 14, 2018

galen0624 commented Aug 14, 2018

galen0624 commented Aug 15, 2018

galen0624 commented Aug 16, 2018

aaronhurt commented Aug 16, 2018

magiconair commented Aug 16, 2018

galen0624 commented Aug 16, 2018

Memory leak in go-metrics library #530

Memory leak in go-metrics library #530

Comments

galen0624 commented Aug 9, 2018

galen0624 commented Aug 10, 2018

aaronhurt commented Aug 10, 2018

aaronhurt commented Aug 10, 2018

galen0624 commented Aug 10, 2018

aaronhurt commented Aug 10, 2018 • edited Loading

galen0624 commented Aug 13, 2018 • edited Loading

galen0624 commented Aug 14, 2018 • edited Loading

aaronhurt commented Aug 14, 2018

galen0624 commented Aug 14, 2018

aaronhurt commented Aug 14, 2018

galen0624 commented Aug 14, 2018

aaronhurt commented Aug 14, 2018

galen0624 commented Aug 14, 2018

galen0624 commented Aug 14, 2018

galen0624 commented Aug 15, 2018

galen0624 commented Aug 16, 2018

aaronhurt commented Aug 16, 2018

magiconair commented Aug 16, 2018

galen0624 commented Aug 16, 2018

aaronhurt commented Aug 10, 2018 •

edited

Loading

galen0624 commented Aug 13, 2018 •

edited

Loading

galen0624 commented Aug 14, 2018 •

edited

Loading