Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Memory leak in go-metrics library #530

Closed
galen0624 opened this issue Aug 9, 2018 · 19 comments
Closed

Memory leak in go-metrics library #530

galen0624 opened this issue Aug 9, 2018 · 19 comments
Milestone

Comments

@galen0624
Copy link
Collaborator

We are running Fabio 1.5.9 in a production environment that has 1,200+ services. We have observed what appears to be a memory leak (see graph of example). The graph displays a lab ENV but in production the growth rate is the same. We have to redeploy/restart the Fabio instance every couple of days due to this on going issue. If the memory consumption grows above 60% (8G total on the box) Fabio starts to experience delays in route updates over 1 min.

We did a pprof memory profile of Fabio (5 min and 12 hour). The profile seems to indicate that the issue is within Metrics. See attached. We are currently in the process of running Fabio without Metrics to see if that has any impact. I will update when I gather more data.
mem-5min.pdf
mem-12h.pdf
example-memory-growth

@murphymj25

@galen0624
Copy link
Collaborator Author

We disabled Metrics and let Fabio run for 24 hours. The memory consumption was level.

@aaronhurt
Copy link
Member

Thank you for that detailed analysis. I'll try to dig in and review the metrics code and see if I can find the leak over the next few days.

@aaronhurt
Copy link
Member

Doing a bit more digging I think this relates to #476 and provides a bit of a push to get that going.

@galen0624
Copy link
Collaborator Author

I agree with #476 being pushed up. I think it's slated for 1.6. We would rather see the raw metrics instead of rolled up ones. I was looking into rcrowley/go-metrics and my colleague and I are thinking it's related to the churn of node health. We have over 1200 services registered in Consul so the node health is always updating. Meaning the blocking query from Fabio to Consul is sort of pointless in our ENV because the Index is always being updated. This leads to the metric sets being updated constantly. We are wondering if the go-metrics doesn't clean up the old sets if a node health refreshes.
To put it another way -- If a node health flaps and the Metrics package doesn't reuse the old data set for that node it might be keeping those stale records. In our ENV that would create a lot of stale data and would explain the memory leak.

@aaronhurt
Copy link
Member

aaronhurt commented Aug 10, 2018

I agree it's related to go-metrics. I was looking through their issue feed and found a few older issues that may be related:

rcrowley/go-metrics#201
rcrowley/go-metrics#163
rcrowley/go-metrics#197
rcrowley/go-metrics#206

Other project referenced in go-metrics issues ...
IBM/sarama#897

The upstream fixes in 197 and 206 look to be added to a later version of go-metrics than is currently vendored into fabio. We may be able to get some resolution of the leak by updating the vendored version.

@galen0624
Copy link
Collaborator Author

galen0624 commented Aug 13, 2018

I vendored in the latest from rcrowley/go-metrics and turned metrics back on. I'll watch the memory consumption over the next day and update the issue with findings. I'll PR my fork/branch if it looks good and Fabio performs as expected.

{"path":"github.com/rcrowley/go-metrics","checksumSHA1":"an5RM8wjgPPloolUUYkvEncbHu4=","revision":"e2704e165165ec55d062f5919b4b29494e9fa790","revisionTime":"2018-05-03T17:46:38Z"},

galen0624 added a commit to galen0624/fabio that referenced this issue Aug 13, 2018
…revision:e2704e165165ec55d062f5919b4b29494e9fa790. Issue - fabiolb#530
@galen0624
Copy link
Collaborator Author

galen0624 commented Aug 14, 2018

After Vendoring in the latest go-metrics it appears that the memory leak is still persistent. Digging further into the issue the NewMeter() function is consuming most of the memory. github.com/rcrowley/go-metrics/meter.go

The "constructor" function has a new comment stipulating that Stop() should be called once the meter isn't needed any more. -- see below.

// NewMeter constructs a new StandardMeter and launches a goroutine.
// Be sure to call Stop() once the meter is of no use to allow for garbage collection.
func NewMeter() Meter {
-- SNIP
	m := newStandardMeter()
-- SNIP
}

Looking through the Fabio Code I don't see where the Metrics Stop() func is called. If Stop isn't called then our assumption about route/node health churn might hold true based on the comments in the NewMeter function.

I'm wondering if that can be added to the func (t Table) delRoute(d *RouteDef) error {} in route/table.go? I haven't dug enough into the metrics code to see where it potentially would be added.

@aaronhurt
Copy link
Member

Right, that was in-line with what I was reading. The Stop() func for the return from NewMeter() was added after go-metrics was pulled into Fabio.

@galen0624
Copy link
Collaborator Author

Digging further it looks like the rcrowley/go-metrics Registry Interface has been updated in the lastest release with new method signatures. So metrics/gometrics.go gmRegistry struct methods will have to be updated.

One of the new method signatures GetOrRegister(string, interface{}) interface{} Can return a Meter Interface that has the Stop() func that I believe should be used. I'm a little new to GoLang so I can take a shot at this but I'm looking for confirmation that I'm heading down the right path.

@aaronhurt
Copy link
Member

I definitely think you're on the right path.

@galen0624
Copy link
Collaborator Author

I was totally wrong on updating the gmRegistry struct as that implements the fabio Metrics Registry not the rcwroley/go-metrics Registry.

Looks like the appropriate spot to do this is in route/table.go func syncRegistry(t Table) {}
Just before the ServiceRegistry.Unregister(name)

@aaronhurt
Copy link
Member

That sounds logical. Unregistering the service should definitely stop collecting metrics on said service.

@galen0624
Copy link
Collaborator Author

So it already does that but I don't see where the Stop() func is called in the Unregister function. So my guess is that the data set is still hanging around and not being allowed to be GCed.
// Be sure to call Stop() once the meter is of no use to allow for garbage collection.

We may have to add it in addition to the Unregister.

@galen0624
Copy link
Collaborator Author

Again I was wrong with my previous statement. Unregister eventually does call a Stop() func so I'm not sure what's wrong. I did update the code with

metrics/registry.go

// Timer defines a metric for counting and timing durations for events.
type Timer interface {
	// Stop allows the metrics to be garbage collected.
	Stop()
}

And route/table.go



func syncRegistry(t Table) {
--SNIP

	// unregister inactive timers
	for name, active := range timers {
		if !active {
			t := ServiceRegistry.GetTimer(name)
			t.Stop()
			ServiceRegistry.Unregister(name)
			log.Printf("[INFO] Unregistered timer %s", name)
		}
	}
}

I redeployed it in the lab to see if this has any impact. I'll update tomorrow if it does.

galen0624 added a commit to galen0624/fabio that referenced this issue Aug 15, 2018
…ackage to version - revision:e2704e165165ec55d062f5919b4b29494e9fa790
@galen0624
Copy link
Collaborator Author

I realized earlier today that I had a mistake in my GOPATH. I am currently re-testing the updated github.com/rcrowley/go-metrics package. After a few hours things look promising. I will update tomorrow with a definitive answer. If all goes well I will generate a PR.

@galen0624
Copy link
Collaborator Author

After a roughly 24 hour run it looks like the Memory Leak is fixed by vendoring in the lastest from github.com/rcrowley/go-metrics. Below is an example graph. I'll submit a PR later today.

screen shot 2018-08-16 at 8 24 14 am

@aaronhurt
Copy link
Member

Awesome, great news. Thank you for all the work and analytics on this issue.

@magiconair
Copy link
Contributor

Thanks @galen0624 for the digging and the subsequent fix. Time to pick up the slack on fabio again. I need to back out the FastCGI change which currently breaks the build and then we can push a new release.

@aaronhurt aaronhurt changed the title Memory Leak In Metrics Memory leak in go-metrics library Aug 16, 2018
@galen0624
Copy link
Collaborator Author

No worries. Glad to help.

@magiconair magiconair added this to the 1.5.10 milestone Sep 19, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants