Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Tuning performance and prevent leak due to time.After #484

Closed
wants to merge 7 commits into from

Conversation

linxGnu
Copy link

@linxGnu linxGnu commented Jan 7, 2022

First of all, thank you so much for incredible library.

This PR is an attempt to improve the performance and prevent high memory usage due to timer leak without calling Stop.

Motivation

Many hot paths in Raft library introduce time.After usage. While getting out of scope (but not timeout), those timers are not stopped/recycled by GC yet (until event is fired and pushed to channel); thus pressure to runtime with unnecessary ops.

Moreover, time.After actually allocates new time.Timer. There is no benefit of using time.After over time.Timer.

The patch

Objectives:

  • To make viewer easier to read
  • Reduce code changes as much as possible
  • Reusable

The patch introduces wrapper for timer which acts similar as time.After. Just a note that we need to stop it when getting out of scope (manually or using defer)

I would like to avoid defer timer.Stop() as much as possible. It's fine for some case with recent optimization introduced in go1.16, but not all.

Benchmark

go version: 1.17.6
CPU: AMD Ryzen 5850U
goos: linux
goarch: amd64
pkg: github.com/hashicorp/raft
BenchmarkTimeAfter-16                295           3968892 ns/op         2100006 B/op      30000 allocs/op
BenchmarkTimer-16                    451           2738084 ns/op         2000056 B/op      30000 allocs/op
const timerBenchAttempt = 10000

func BenchmarkTimeAfter(b *testing.B) {
	b.ReportAllocs()
	b.ResetTimer()
	for i := 0; i < b.N; i++ {
		for attempt := 0; attempt < timerBenchAttempt; attempt++ {
			select {
			case <-time.After(50 * time.Millisecond):
			default:
			}
		}
	}
}

func BenchmarkTimer(b *testing.B) {
	b.ReportAllocs()
	b.ResetTimer()
	for i := 0; i < b.N; i++ {
		for attempt := 0; attempt < timerBenchAttempt; attempt++ {
			timer := newTimer(50 * time.Millisecond)
			select {
			case <-timer.C:
			default:
				timer.Stop()
			}
		}
	}
}

@hashicorp-cla
Copy link

hashicorp-cla commented Jan 7, 2022

CLA assistant check
All committers have signed the CLA.

Copy link
Contributor

@dnephin dnephin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for working on this improvement!

Many of these calls to time.After are safe in practice, because the only way for the function to exit without the timer firing is when raft shuts down. I think in general we expect the entire process to exit soon after raft is shutdown, so in most cases that leak is not a concern, and not a risk. But that assumption may not always be true, so it's probably good to fix all of them. I think there is also a good argument to be made for fixing these cases as a way of demonstrating good practices.

There are definitely a couple cases of time.After that are concerning, specifically the ones snapshot.go, replication.go, and raft.go.

My main concern with the changes in this PR is the new wrapper around time.Timer. I think we can fix these calls without introducing the new type, which will be easier to maintain in the long term. The first comment suggests how that might be possible.

I'd also like to request that you split up all the cleanup changes into at least a separate commit, and possibly a separate PR. Those clean ups are appreciated, but it makes the core changes harder to review when they are mixed into the same commit.

api.go Outdated
Comment on lines 740 to 741
timer := newTimer(timeout)
defer timer.Stop()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can accomplish the same thing without wrapping time.Timer

Suggested change
timer := newTimer(timeout)
defer timer.Stop()
var timer <-chan time.Time
if timeout > 0 {
t := time.Newtimer(timeout)
defer t.Stop()
timer = t.C
}

This is a few extra lines, but may actually result in a smaller diff. I think in the long term not having a wrapper around time.Timer will be easier to maintain.

Copy link
Author

@linxGnu linxGnu Jan 13, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hello @dnephin Thank you for your careful review.

Using std timer in the first place is also my consider. However this would result in lot of code duplication. Without new wrapper type, we always have to care about:

if timeout > 0

and

timer = t.C // or something similar

My approach is that: using a wrapper but keep using of channel, by exposing t.C. In the future, new logic code could reuse this pattern instead of implement its own timeout handling and instead of cloning the base logic.

How do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When reviewing this PR, I had to pause and locally test the behaviour of timerWrapper to gain confidence that it would behave how I expect (and found a potential bug with (timerWrapper).Reset).

From a maintainer's perspective, I think code duplication is an acceptable cost here. The wrapper certainly saves many lines of code but may increase developer time spent reading and understanding the wrapper type.

I'd like to request that we go with @dnephin's suggestion for using the std lib directly

err = fmt.Errorf("command timed out")
}

timer.Stop()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's currently safe to call Stop here, but in the future we could add other returns. It might be good to keep this as a defer after creating the timer, both for consistency, and to prevent future regressions. Is there a reason that defer won't work in this case?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with you.

Not using defer here is just a trade-off between safety (future changes) and performance (tiny improvement). Obviously, not calling timer.Stop would result in the same manner as leaking time.After.

I would fix that as your suggestion. Thank you so much.

Comment on lines 140 to 141
if e := c.GetLog(uint64(i), &out); e != nil {
t.Fatalf("err: %v", e)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What's the reason for this change? Generally err is the correct variable name to use for errors. If it's because it's masking err from the parent scope, I think that's not really a concern in this test.

Comment on lines 92 to 96
atomic.AddInt32(&errCounter, 1)
t.Errorf("err: %v", err)
} else if !reflect.DeepEqual(resp, out) { // Verify the response
atomic.AddInt32(&errCounter, 1)
t.Errorf("command mismatch: %#v %#v", resp, out)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for cleaning up the calls to t.Fatal in a goroutine!

While this is good cleanup, I think it would be better to do this cleanup in a separate PR, and at the very least in a separate commit. Keeping commits isolated to one change makes the git history much easier to read, and can make it easier to cherry-pick changes.

I'd also like to suggest a different approach to fixing this problem. Instead of an atomic counter, how about returning an error, and using https://pkg.go.dev/golang.org/x/sync/errgroup to run the goroutines. That is a pattern I've used before, and it works really well. That might look something like this:

	appendFunc := func() error {
		// return errors instead of t.Fatalf
	}

	g := new(errgroup.Group)
	for i = 0; i < 5; i++ {
		g.Go(appendFunc())
	}
	require.NoError(t, g.Wait())

This will remove the need for a sync.WaitGroup as well.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's great idea to use ErrGroup. I would make the fix in another PR as your suggestion.

util.go Outdated
Comment on lines 31 to 36
func randomTimeout(minVal time.Duration) timerWrapper {
if minVal == 0 {
return nil
return newTimer(0)
}
extra := (time.Duration(rand.Int63()) % minVal)
return time.After(minVal + extra)
return newTimer(minVal + extra)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can make this function work without the timerWrapper by returning a noopTimer implementation instead of nil. The noopTimer can have a method Stop that does nothing, and a field C that is a nil channel. Then we can return the noopTimer or a *time.Timer

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I love the idea of noopTimer. I would fix that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this more, I'm not sure how to return this from randomTimeout. We could use an interface, but interfaces don't allow fields. I'm not sure if there is any way to make this work.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's possible to use interface and follow SOLID principle here. But the trade-off might look like:

t.C -> t.C() // return channel through func 

If it looks okay to you, then I think interface is a good choice rather than concrete type

@linxGnu
Copy link
Author

linxGnu commented Jan 15, 2022

@dnephin Follow up your comments, I have reverted changes and keep the PR small as it should be. I believe the main concern left is timerWrapper. Could you please take a look.

Thank you very much.

@kisunji kisunji self-assigned this Jan 25, 2022
@@ -287,6 +288,9 @@ func (i *inmemPipeline) decodeResponses() {
case <-i.shutdownCh:
return
}

timer.Stop()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be deferred on L267 in case of an early return in the select block above

Copy link
Author

@linxGnu linxGnu Jan 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@kisunji if we put defer in L267, it will be defer in loop and absolutely bad practice. How do you think?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, that's a good catch. We may have to call timer.Stop in all of the branches that return early then

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It depends on each logic. As far as I look at the source code. We just need to invoke stop at the end of select or just inside a branch.

util.go Outdated
Comment on lines 167 to 171
func (t *timerWrapper) Reset(d time.Duration) {
if t.t != nil {
t.t.Reset(d)
}
}
Copy link
Contributor

@kisunji kisunji Jan 26, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think Reset will need to handle the case where t == nil and call newTimer

t := newTimer(0)
t.Reset(1000) // no-op

t2 := time.Timer(0)
t2.Reset(1000) // resets timer to 1s

Currently, calling Reset will not behave like (*time.Timer).Reset in this case

api.go Outdated
Comment on lines 740 to 741
timer := newTimer(timeout)
defer timer.Stop()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When reviewing this PR, I had to pause and locally test the behaviour of timerWrapper to gain confidence that it would behave how I expect (and found a potential bug with (timerWrapper).Reset).

From a maintainer's perspective, I think code duplication is an acceptable cost here. The wrapper certainly saves many lines of code but may increase developer time spent reading and understanding the wrapper type.

I'd like to request that we go with @dnephin's suggestion for using the std lib directly

@linxGnu
Copy link
Author

linxGnu commented Jan 26, 2022

@kisunji imho, code duplication cost might be a lot. We have to change not only one place in my PR but many (and in many files).

I dont think the wrapper breaks any std lib way. And its quite small. Easy to read and understand, no third party lib.

Dup code might introduce error prone or miss required specs.

@linxGnu
Copy link
Author

linxGnu commented Feb 7, 2022

Kindly ping @dnephin @kisunji, there is a fix and merged similar issue at Nomad. Should we continue on this or you wanna apply same mechanism (newSafeTimer)?

imho its no different between newsafetimer and newtimer here. Both of them require stop func invoking.

@kisunji
Copy link
Contributor

kisunji commented Feb 22, 2022

Hi @linxGnu, while this fix is valuable, more than one maintainer has expressed concerns around timerWrapper. We understand Nomad has implemented a wrapper for this use-case, but for Raft (which doesn't change frequently and is maintained by volunteers) we would like to minimize the introduction of new types to this codebase.

Internally we prefer that this PR uses stdlib directly to make the code obvious at a glance.

@linxGnu
Copy link
Author

linxGnu commented Mar 14, 2022

@kisunji @dnephin I addressed your comments although I am not convinced by copy-paste approach, no-reusing.

My patch also fix some small bugs of missing return when stop/terminate channel is notified. Please take a look.

if timeout > 0 {
timeoutCh = time.After(timeout)
timer = time.NewTimer(timeout)
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Inside a loop -> won't use defer here

@@ -402,9 +401,9 @@ func (r *Raft) heartbeat(s *followerReplication, stopCh chan struct{}) {
r.logger.Error("failed to heartbeat to", "peer", peer.Address, "error", err)
r.observe(FailedHeartbeatObservation{PeerID: peer.ID, LastContact: s.LastContact()})
failures++
select {
case <-time.After(backoff(failureWait, failures, maxFailureScale)):
case <-stopCh:
Copy link
Author

@linxGnu linxGnu Mar 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a bug when received from stopCh but does not return

case <-time.After(backoff(failureWait, s.failures, maxFailureScale)):
case <-r.shutdownCh:
if isTerminated(r.shutdownCh, backoff(failureWait, s.failures, maxFailureScale)) {
return
Copy link
Author

@linxGnu linxGnu Mar 14, 2022

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this is a bug when received from shutdownCh but does not return

@kisunji kisunji dismissed their stale review April 26, 2022 17:48

Concerns were addressed

@stale
Copy link

stale bot commented Sep 21, 2022

Hey there,
We wanted to check in on this request since it has been inactive for at least 90 days.
Have you reviewed the latest godocs?
If you think this is still an important issue in the latest version of the Raft library or
its documentation please feel let us know and we'll keep it open for investigation.
If there is still no activity on this request in 30 days, we will go ahead and close it.
Thank you!

@stale stale bot added the waiting-reply label Sep 21, 2022
@banks
Copy link
Member

banks commented Jan 15, 2024

Hi 👋

Just cleaning up some stale issues and PRs. We want to thank you again @linxGnu for the time you spent on this PR and responding to feedback.

At this point it's not very clear to us that these changes fix any real issues in this library - not ones that our usage has ever seen as a problem anyway. So on balance the cost and risk of checking, testing and merging the PR and releasing it into our products seems to outweigh any benefit we might get from it.

It always feels bad saying no to a contribution but we'd rather not leave this open forever if we're not likely to merge it!

Thanks again!

@banks banks closed this Jan 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants