Tuning performance and prevent leak due to time.After #484

linxGnu · 2022-01-07T05:38:56Z

First of all, thank you so much for incredible library.

This PR is an attempt to improve the performance and prevent high memory usage due to timer leak without calling Stop.

Motivation

Many hot paths in Raft library introduce time.After usage. While getting out of scope (but not timeout), those timers are not stopped/recycled by GC yet (until event is fired and pushed to channel); thus pressure to runtime with unnecessary ops.

Moreover, time.After actually allocates new time.Timer. There is no benefit of using time.After over time.Timer.

The patch

Objectives:

To make viewer easier to read
Reduce code changes as much as possible
Reusable

The patch introduces wrapper for timer which acts similar as time.After. Just a note that we need to stop it when getting out of scope (manually or using defer)

I would like to avoid defer timer.Stop() as much as possible. It's fine for some case with recent optimization introduced in go1.16, but not all.

Benchmark

go version: 1.17.6
CPU: AMD Ryzen 5850U
goos: linux
goarch: amd64
pkg: github.com/hashicorp/raft
BenchmarkTimeAfter-16                295           3968892 ns/op         2100006 B/op      30000 allocs/op
BenchmarkTimer-16                    451           2738084 ns/op         2000056 B/op      30000 allocs/op

const timerBenchAttempt = 10000

func BenchmarkTimeAfter(b *testing.B) {
	b.ReportAllocs()
	b.ResetTimer()
	for i := 0; i < b.N; i++ {
		for attempt := 0; attempt < timerBenchAttempt; attempt++ {
			select {
			case <-time.After(50 * time.Millisecond):
			default:
			}
		}
	}
}

func BenchmarkTimer(b *testing.B) {
	b.ReportAllocs()
	b.ResetTimer()
	for i := 0; i < b.N; i++ {
		for attempt := 0; attempt < timerBenchAttempt; attempt++ {
			timer := newTimer(50 * time.Millisecond)
			select {
			case <-timer.C:
			default:
				timer.Stop()
			}
		}
	}
}

hashicorp-cla · 2022-01-07T05:38:59Z

All committers have signed the CLA.

dnephin

Thank you for working on this improvement!

Many of these calls to time.After are safe in practice, because the only way for the function to exit without the timer firing is when raft shuts down. I think in general we expect the entire process to exit soon after raft is shutdown, so in most cases that leak is not a concern, and not a risk. But that assumption may not always be true, so it's probably good to fix all of them. I think there is also a good argument to be made for fixing these cases as a way of demonstrating good practices.

There are definitely a couple cases of time.After that are concerning, specifically the ones snapshot.go, replication.go, and raft.go.

My main concern with the changes in this PR is the new wrapper around time.Timer. I think we can fix these calls without introducing the new type, which will be easier to maintain in the long term. The first comment suggests how that might be possible.

I'd also like to request that you split up all the cleanup changes into at least a separate commit, and possibly a separate PR. Those clean ups are appreciated, but it makes the core changes harder to review when they are mixed into the same commit.

dnephin · 2022-01-13T21:13:21Z

api.go

+	timer := newTimer(timeout)
+	defer timer.Stop()


I think we can accomplish the same thing without wrapping time.Timer

Suggested change

timer := newTimer(timeout)

defer timer.Stop()

var timer <-chan time.Time

if timeout > 0 {

t := time.Newtimer(timeout)

defer t.Stop()

timer = t.C

}

This is a few extra lines, but may actually result in a smaller diff. I think in the long term not having a wrapper around time.Timer will be easier to maintain.

Hello @dnephin Thank you for your careful review.

Using std timer in the first place is also my consider. However this would result in lot of code duplication. Without new wrapper type, we always have to care about:

if timeout > 0

and

timer = t.C // or something similar

My approach is that: using a wrapper but keep using of channel, by exposing t.C. In the future, new logic code could reuse this pattern instead of implement its own timeout handling and instead of cloning the base logic.

How do you think?

When reviewing this PR, I had to pause and locally test the behaviour of timerWrapper to gain confidence that it would behave how I expect (and found a potential bug with (timerWrapper).Reset).

From a maintainer's perspective, I think code duplication is an acceptable cost here. The wrapper certainly saves many lines of code but may increase developer time spent reading and understanding the wrapper type.

I'd like to request that we go with @dnephin's suggestion for using the std lib directly

dnephin · 2022-01-13T21:17:04Z

inmem_transport.go

 		err = fmt.Errorf("command timed out")
 	}
+
+	timer.Stop()


I think it's currently safe to call Stop here, but in the future we could add other returns. It might be good to keep this as a defer after creating the timer, both for consistency, and to prevent future regressions. Is there a reason that defer won't work in this case?

I agree with you.

Not using defer here is just a trade-off between safety (future changes) and performance (tiny improvement). Obviously, not calling timer.Stop would result in the same manner as leaking time.After.

I would fix that as your suggestion. Thank you so much.

dnephin · 2022-01-13T21:20:00Z

log_cache_test.go

+		if e := c.GetLog(uint64(i), &out); e != nil {
+			t.Fatalf("err: %v", e)


What's the reason for this change? Generally err is the correct variable name to use for errors. If it's because it's masking err from the parent scope, I think that's not really a concern in this test.

dnephin · 2022-01-13T21:27:07Z

net_transport_test.go

+				atomic.AddInt32(&errCounter, 1)
+				t.Errorf("err: %v", err)
+			} else if !reflect.DeepEqual(resp, out) { // Verify the response
+				atomic.AddInt32(&errCounter, 1)
+				t.Errorf("command mismatch: %#v %#v", resp, out)


Thank you for cleaning up the calls to t.Fatal in a goroutine!

While this is good cleanup, I think it would be better to do this cleanup in a separate PR, and at the very least in a separate commit. Keeping commits isolated to one change makes the git history much easier to read, and can make it easier to cherry-pick changes.

I'd also like to suggest a different approach to fixing this problem. Instead of an atomic counter, how about returning an error, and using https://pkg.go.dev/golang.org/x/sync/errgroup to run the goroutines. That is a pattern I've used before, and it works really well. That might look something like this:

appendFunc := func() error { // return errors instead of t.Fatalf } g := new(errgroup.Group) for i = 0; i < 5; i++ { g.Go(appendFunc()) } require.NoError(t, g.Wait())

This will remove the need for a sync.WaitGroup as well.

That's great idea to use ErrGroup. I would make the fix in another PR as your suggestion.

dnephin · 2022-01-13T21:31:37Z

util.go

+func randomTimeout(minVal time.Duration) timerWrapper {
 	if minVal == 0 {
-		return nil
+		return newTimer(0)
 	}
 	extra := (time.Duration(rand.Int63()) % minVal)
-	return time.After(minVal + extra)
+	return newTimer(minVal + extra)


I think we can make this function work without the timerWrapper by returning a noopTimer implementation instead of nil. The noopTimer can have a method Stop that does nothing, and a field C that is a nil channel. Then we can return the noopTimer or a *time.Timer

I love the idea of noopTimer. I would fix that.

Thinking about this more, I'm not sure how to return this from randomTimeout. We could use an interface, but interfaces don't allow fields. I'm not sure if there is any way to make this work.

It's possible to use interface and follow SOLID principle here. But the trade-off might look like:

t.C -> t.C() // return channel through func

If it looks okay to you, then I think interface is a good choice rather than concrete type

linxGnu · 2022-01-15T06:47:54Z

@dnephin Follow up your comments, I have reverted changes and keep the PR small as it should be. I believe the main concern left is timerWrapper. Could you please take a look.

Thank you very much.

kisunji · 2022-01-26T21:15:07Z

inmem_transport.go

@@ -287,6 +288,9 @@ func (i *inmemPipeline) decodeResponses() {
 			case <-i.shutdownCh:
 				return
 			}
+
+			timer.Stop()


This should be deferred on L267 in case of an early return in the select block above

@kisunji if we put defer in L267, it will be defer in loop and absolutely bad practice. How do you think?

Ah, that's a good catch. We may have to call timer.Stop in all of the branches that return early then

It depends on each logic. As far as I look at the source code. We just need to invoke stop at the end of select or just inside a branch.

kisunji · 2022-01-26T21:30:50Z

util.go

+func (t *timerWrapper) Reset(d time.Duration) {
+	if t.t != nil {
+		t.t.Reset(d)
+	}
+}


I think Reset will need to handle the case where t == nil and call newTimer

t := newTimer(0) t.Reset(1000) // no-op t2 := time.Timer(0) t2.Reset(1000) // resets timer to 1s

Currently, calling Reset will not behave like (*time.Timer).Reset in this case

kisunji · 2022-01-26T21:57:52Z

api.go

+	timer := newTimer(timeout)
+	defer timer.Stop()


When reviewing this PR, I had to pause and locally test the behaviour of timerWrapper to gain confidence that it would behave how I expect (and found a potential bug with (timerWrapper).Reset).

From a maintainer's perspective, I think code duplication is an acceptable cost here. The wrapper certainly saves many lines of code but may increase developer time spent reading and understanding the wrapper type.

I'd like to request that we go with @dnephin's suggestion for using the std lib directly

linxGnu · 2022-01-26T22:17:29Z

@kisunji imho, code duplication cost might be a lot. We have to change not only one place in my PR but many (and in many files).

I dont think the wrapper breaks any std lib way. And its quite small. Easy to read and understand, no third party lib.

Dup code might introduce error prone or miss required specs.

linxGnu · 2022-02-07T22:39:32Z

Kindly ping @dnephin @kisunji, there is a fix and merged similar issue at Nomad. Should we continue on this or you wanna apply same mechanism (newSafeTimer)?

imho its no different between newsafetimer and newtimer here. Both of them require stop func invoking.

kisunji · 2022-02-22T16:55:36Z

Hi @linxGnu, while this fix is valuable, more than one maintainer has expressed concerns around timerWrapper. We understand Nomad has implemented a wrapper for this use-case, but for Raft (which doesn't change frequently and is maintained by volunteers) we would like to minimize the introduction of new types to this codebase.

Internally we prefer that this PR uses stdlib directly to make the code obvious at a glance.

linxGnu · 2022-03-14T00:48:36Z

@kisunji @dnephin I addressed your comments although I am not convinced by copy-paste approach, no-reusing.

My patch also fix some small bugs of missing return when stop/terminate channel is notified. Please take a look.

linxGnu · 2022-03-14T00:40:35Z

inmem_transport.go

 			if timeout > 0 {
-				timeoutCh = time.After(timeout)
+				timer = time.NewTimer(timeout)


Inside a loop -> won't use defer here

linxGnu · 2022-03-14T00:44:56Z

replication.go

@@ -402,9 +401,9 @@ func (r *Raft) heartbeat(s *followerReplication, stopCh chan struct{}) {
 			r.logger.Error("failed to heartbeat to", "peer", peer.Address, "error", err)
 			r.observe(FailedHeartbeatObservation{PeerID: peer.ID, LastContact: s.LastContact()})
 			failures++
-			select {
-			case <-time.After(backoff(failureWait, failures, maxFailureScale)):
-			case <-stopCh:


this is a bug when received from stopCh but does not return

linxGnu · 2022-03-14T00:45:10Z

replication.go

-		case <-time.After(backoff(failureWait, s.failures, maxFailureScale)):
-		case <-r.shutdownCh:
+		if isTerminated(r.shutdownCh, backoff(failureWait, s.failures, maxFailureScale)) {
+			return


this is a bug when received from shutdownCh but does not return

Concerns were addressed

stale · 2022-09-21T04:45:30Z

Hey there,
We wanted to check in on this request since it has been inactive for at least 90 days.
Have you reviewed the latest godocs?
If you think this is still an important issue in the latest version of the Raft library or
its documentation please feel let us know and we'll keep it open for investigation.
If there is still no activity on this request in 30 days, we will go ahead and close it.
Thank you!

banks · 2024-01-15T20:24:21Z

Hi 👋

Just cleaning up some stale issues and PRs. We want to thank you again @linxGnu for the time you spent on this PR and responding to feedback.

At this point it's not very clear to us that these changes fix any real issues in this library - not ones that our usage has ever seen as a problem anyway. So on balance the cost and risk of checking, testing and merging the PR and releasing it into our products seems to outweigh any benefit we might get from it.

It always feels bad saying no to a contribution but we'd rather not leave this open forever if we're not likely to merge it!

Thanks again!

linxGnu added 2 commits January 7, 2022 13:55

Deprecate time.After to prevent leak

8b159c4

Adding Benchmark

ca11d3e

Update test

8050442

dnephin suggested changes Jan 13, 2022

View reviewed changes

linxGnu added 3 commits January 15, 2022 15:26

Revert various file changes

26fe50a

Update

5811c98

Update

070379f

kisunji self-assigned this Jan 25, 2022

kisunji previously requested changes Jan 26, 2022

View reviewed changes

shoenig mentioned this pull request Feb 2, 2022

Prevent leaks from time.After hashicorp/nomad#11982

Closed

Address comments

bfd6a4f

linxGnu commented Mar 14, 2022

View reviewed changes

stale bot added the waiting-reply label Sep 21, 2022

banks closed this Jan 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Tuning performance and prevent leak due to time.After #484

Tuning performance and prevent leak due to time.After #484

linxGnu commented Jan 7, 2022 •

edited

Loading

hashicorp-cla commented Jan 7, 2022 •

edited

Loading

dnephin left a comment •

edited

Loading

dnephin Jan 13, 2022

linxGnu Jan 13, 2022 •

edited

Loading

kisunji Jan 26, 2022

dnephin Jan 13, 2022

linxGnu Jan 13, 2022

dnephin Jan 13, 2022

dnephin Jan 13, 2022

linxGnu Jan 13, 2022

dnephin Jan 13, 2022

linxGnu Jan 13, 2022

dnephin Jan 13, 2022

linxGnu Jan 13, 2022

linxGnu commented Jan 15, 2022

kisunji Jan 26, 2022

linxGnu Jan 26, 2022 •

edited

Loading

kisunji Jan 26, 2022

linxGnu Jan 26, 2022

kisunji Jan 26, 2022 •

edited

Loading

kisunji Jan 26, 2022

linxGnu commented Jan 26, 2022 •

edited

Loading

linxGnu commented Feb 7, 2022 •

edited

Loading

kisunji commented Feb 22, 2022 •

edited

Loading

linxGnu commented Mar 14, 2022

linxGnu Mar 14, 2022

linxGnu Mar 14, 2022 •

edited

Loading

linxGnu Mar 14, 2022 •

edited

Loading

stale bot commented Sep 21, 2022

banks commented Jan 15, 2024

-	timer := newTimer(timeout)
-	defer timer.Stop()
+	var timer <-chan time.Time
+	if timeout > 0 {
+		t := time.Newtimer(timeout)
+		defer t.Stop()
+		timer = t.C
+	}

		if e := c.GetLog(uint64(i), &out); e != nil {
		t.Fatalf("err: %v", e)

Tuning performance and prevent leak due to time.After #484

Tuning performance and prevent leak due to time.After #484

Conversation

linxGnu commented Jan 7, 2022 • edited Loading

hashicorp-cla commented Jan 7, 2022 • edited Loading

dnephin left a comment • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

linxGnu Jan 13, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

linxGnu commented Jan 15, 2022

Choose a reason for hiding this comment

linxGnu Jan 26, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

kisunji Jan 26, 2022 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

linxGnu commented Jan 26, 2022 • edited Loading

linxGnu commented Feb 7, 2022 • edited Loading

kisunji commented Feb 22, 2022 • edited Loading

linxGnu commented Mar 14, 2022

Choose a reason for hiding this comment

linxGnu Mar 14, 2022 • edited Loading

Choose a reason for hiding this comment

linxGnu Mar 14, 2022 • edited Loading

Choose a reason for hiding this comment

stale bot commented Sep 21, 2022

banks commented Jan 15, 2024

linxGnu commented Jan 7, 2022 •

edited

Loading

hashicorp-cla commented Jan 7, 2022 •

edited

Loading

dnephin left a comment •

edited

Loading

linxGnu Jan 13, 2022 •

edited

Loading

linxGnu Jan 26, 2022 •

edited

Loading

kisunji Jan 26, 2022 •

edited

Loading

linxGnu commented Jan 26, 2022 •

edited

Loading

linxGnu commented Feb 7, 2022 •

edited

Loading

kisunji commented Feb 22, 2022 •

edited

Loading

linxGnu Mar 14, 2022 •

edited

Loading

linxGnu Mar 14, 2022 •

edited

Loading