Consider run time performance #40

thockin · 2021-03-26T21:04:27Z

If we want to push this to 1.0 (#38), we really need to make some intentional decisions about performance. The API as it stands was designed largely without considering performance, and (surprise!) it shows.

glogr_benchmark_test.go:

package bench

import (
	"flag"
	"os"
	"testing"

	"github.com/go-logr/glogr"
	"github.com/go-logr/logr"
	"github.com/golang/glog"
)

func init() {
	flag.Set("v", "1")
	flag.Set("logtostderr", "true")
	os.Stderr, _ = os.Open("/dev/null")
}

func BenchmarkGlogInfoOneArg(b *testing.B) {
	for i := 0; i < b.N; i++ {
		glog.Infof("this is a %s", "string")
	}
}

func BenchmarkGlogrInfoOneArg(b *testing.B) {
	var log logr.Logger = glogr.New()

	for i := 0; i < b.N; i++ {
		log.Info("this is", "a", "string")
	}
}

func BenchmarkGlogInfoSeveralArgs(b *testing.B) {
	for i := 0; i < b.N; i++ {
		glog.Infof("multi: bool %t, string %s, int %d, float %f, struct %v",
			true, "str", 42, 3.14, struct{ X, Y int }{93, 76})
	}
}

func BenchmarkGlogrInfoSeveralArgs(b *testing.B) {
	var log logr.Logger = glogr.New()

	for i := 0; i < b.N; i++ {
		log.Info("multi",
			"bool", true, "string", "str", "int", 42,
			"float", 3.14, "struct", struct{ X, Y int }{93, 76})
	}
}

func BenchmarkGlogInfoV0(b *testing.B) {
	for i := 0; i < b.N; i++ {
		glog.V(0).Infof("multi: bool %t, string %s, int %d, float %f, struct %v",
			true, "str", 42, 3.14, struct{ X, Y int }{93, 76})
	}
}

func BenchmarkGlogrInfoV0(b *testing.B) {
	var log logr.Logger = glogr.New()

	for i := 0; i < b.N; i++ {
		log.V(0).Info("multi",
			"bool", true, "string", "str", "int", 42,
			"float", 3.14, "struct", struct{ X, Y int }{93, 76})
	}
}

func BenchmarkGlogInfoV9(b *testing.B) {
	for i := 0; i < b.N; i++ {
		glog.V(9).Infof("multi: bool %t, string %s, int %d, float %f, struct %v",
			true, "str", 42, 3.14, struct{ X, Y int }{93, 76})
	}
}

func BenchmarkGlogrInfoV9(b *testing.B) {
	var log logr.Logger = glogr.New()

	for i := 0; i < b.N; i++ {
		log.V(9).Info("multi",
			"bool", true, "string", "str", "int", 42,
			"float", 3.14, "struct", struct{ X, Y int }{93, 76})
	}
}

Running this:

goos: linux
goarch: amd64
pkg: github.com/go-logr/glogr/bench
cpu: Intel(R) Xeon(R) W-2135 CPU @ 3.70GHz
BenchmarkGlogInfoOneArg-6         	  713682	      1675 ns/op
BenchmarkGlogrInfoOneArg-6        	  251154	      4761 ns/op
BenchmarkGlogInfoSeveralArgs-6    	  527288	      2285 ns/op
BenchmarkGlogrInfoSeveralArgs-6   	  167934	      7114 ns/op
BenchmarkGlogInfoV0-6             	  529798	      2286 ns/op
BenchmarkGlogrInfoV0-6            	  164413	      7293 ns/op
BenchmarkGlogInfoV9-6             	46018132	        25.70 ns/op
BenchmarkGlogrInfoV9-6            	 8412883	       142.2 ns/op

So it's notably slower. All of the variadic args escape to the heap, including the string keys (which regular glog does not suffer). But doesn't account for enough.

V() calls that are not taken also expand all their variadic args used.

Some of this is attributable to glogr being very dumb and wasteful (clone() on every call) but it's not clear how much. Before we call it 1.0 we need to do some homework.

The text was updated successfully, but these errors were encountered:

wojas · 2021-04-01T10:36:38Z

As another datapoint, there are some some benchmarks in https://github.com/wojas/genericr and the results there don't look too bad:

goos: darwin
goarch: amd64
pkg: github.com/wojas/genericr
cpu: Intel(R) Xeon(R) W-2150B CPU @ 3.00GHz
BenchmarkLogger_basic-20                        	37195687	        31.29 ns/op	       0 B/op	       0 allocs/op
BenchmarkLogger_basic_with_caller-20            	 1640784	       716.2 ns/op	     216 B/op	       2 allocs/op
BenchmarkLogger_2vars-20                        	19593140	        61.64 ns/op	      64 B/op	       1 allocs/op
BenchmarkLogger_clone-20                        	25827868	        43.97 ns/op	       0 B/op	       0 allocs/op
BenchmarkLogger_complicated-20                  	 3593434	       331.9 ns/op	     432 B/op	       6 allocs/op
BenchmarkLogger_complicated_precalculated-20    	 8682727	       139.2 ns/op	     160 B/op	       2 allocs/op
BenchmarkLogger_2vars_tostring-20               	  975183	      1276 ns/op	     600 B/op	      13 allocs/op
PASS
ok  	github.com/wojas/genericr	10.014s

Note that this benchmarks the logr interface and genericr without actual logging to a backend, so just the overhead introduced by the abstraction.

These do not include more complicated objects that could escape to the heap. I will add those when I have time.

thockin · 2021-04-01T16:48:37Z

I am sure that some of the overhead can be mitigated by good intra-impl optimization and some will be endemic to the API style. Concretely, high V() level calls which are almost always skipped still build the alice of variadic args. Russ Cox suggested that maybe making the return of V() be a concrete type could elide that. E.g. totally not compiled or tried, something like: `V()` returns a `logr.VLogger`, which is a struct carrying `enabled bool` and a `logr.Logger`, and implements logr.Logger itself. All of the standard methods first check `enabled`. That doesn't seem like a win (because the variadic args are still passed) but he suggested the inliner can do better and probably not do the variadics until after the `enabled` check. I have not yet made time to prove this out. Of course that WOULD be a breaking API change for implementations (but not for callers). Russ had some other ideas which I will condense here when I get some time.

…

On Thu, Apr 1, 2021 at 3:36 AM wojas ***@***.***> wrote: As another datapoint, there are some some benchmarks in https://github.com/wojas/genericr and the results there don't look too bad: goos: darwin goarch: amd64 pkg: github.com/wojas/genericr cpu: Intel(R) Xeon(R) W-2150B CPU @ 3.00GHz BenchmarkLogger_basic-20 37195687 31.29 ns/op 0 B/op 0 allocs/op BenchmarkLogger_basic_with_caller-20 1640784 716.2 ns/op 216 B/op 2 allocs/op BenchmarkLogger_2vars-20 19593140 61.64 ns/op 64 B/op 1 allocs/op BenchmarkLogger_clone-20 25827868 43.97 ns/op 0 B/op 0 allocs/op BenchmarkLogger_complicated-20 3593434 331.9 ns/op 432 B/op 6 allocs/op BenchmarkLogger_complicated_precalculated-20 8682727 139.2 ns/op 160 B/op 2 allocs/op BenchmarkLogger_2vars_tostring-20 975183 1276 ns/op 600 B/op 13 allocs/op PASS ok github.com/wojas/genericr 10.014s Note that this benchmarks the logr interface and genericr without actual logging to a backend, so just the overhead introduced by the abstraction. These do not include more complicated objects that could escape to the heap. I will add those when I have time. — You are receiving this because you were assigned. Reply to this email directly, view it on GitHub <#40 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ABKWAVF2Q62IYXLUAIEVFQ3TGREEJANCNFSM4Z4C345A> .

thockin · 2021-04-02T06:18:23Z

I spent some time playing with it tonight. Part of the problem is glogr making copies. I implemented as Russ suggested and the performance does seem better - all of the overhead is in the variadic args.

The problem is that it is a significant change in the API - logr.Logger becomes a concrete type and implementations must provide a slightly different API. It would be breaking change here, but it would enable even more optimization later.

I made a big mess, so I will have to clean it up before I can share it properly :)

wojas · 2021-04-22T08:08:17Z

So just to wrap my head around this, the main goal here is to make log calls as cheap as possible if the actual logging at that level is disabled. I think that's a goal worthy of some breakage before a 1.0.

The concrete struct result for V() does sound like something that would avoid the variadic call thanks to mid stack inlining added in Go 1.12, but it would be good to verify that this actually works as desired.

PR #42 is a bigger breaking change, completely replacing the interface by a concrete struct, but it may have other performance benefits. From a first look it is similar to what genericr is doing under the hood.

thockin · 2021-04-22T16:27:00Z

the main goal here is to make log calls as cheap as possible if the actual logging at that level is disabled.

Yes. Other wins are possible, but this is a big one.

I think that's a goal worthy of some breakage before a 1.0.

I agree :)

it would be good to verify that this actually works as desired.

Without #42:

BenchmarkInfoOneArg-6        	 3013077	       400.9 ns/op
BenchmarkInfoSeveralArgs-6   	 1162088	      1039 ns/op
BenchmarkV0Info-6            	  754347	      1418 ns/op
BenchmarkV9Info-6            	 4605950	       267.8 ns/op
BenchmarkError-6             	  935028	      1135 ns/op
BenchmarkWithValues-6        	 5127631	       232.5 ns/op
BenchmarkWithName-6          	12850569	       106.5 ns/op

With #42:

BenchmarkInfoOneArg-6        	 2465788	       475.7 ns/op
BenchmarkInfoSeveralArgs-6   	  893026	      1226 ns/op
BenchmarkV0Info-6            	  817473	      1250 ns/op
BenchmarkV9Info-6            	 8595180	       155.1 ns/op
BenchmarkError-6             	 1000000	      1371 ns/op
BenchmarkWithValues-6        	 3902214	       292.0 ns/op
BenchmarkWithName-6          	11271037	       106.6 ns/op

So, weirdly, everything got slower except calls through V(). I guess that has to do with the extra variadic-slice pack/unpack.

Changing Info/Error to pass the slice (instead of the slice...) to the LogSink doesn't make perf better.

I'll have to make time to disassemble it all and see what I can find.

thockin · 2021-04-22T16:39:25Z

Switching benchmark to Discard():

Before:

BenchmarkInfoOneArg-6        	309439507	         3.628 ns/op
BenchmarkInfoSeveralArgs-6   	103684642	        11.17 ns/op
BenchmarkV0Info-6            	10411196	       115.8 ns/op
BenchmarkV9Info-6            	 9963486	       118.2 ns/op
BenchmarkError-6             	82650194	        12.14 ns/op
BenchmarkWithValues-6        	220570576	         5.585 ns/op
BenchmarkWithName-6          	723292029	         1.441 ns/op

After:

BenchmarkInfoOneArg-6        	28276839	        46.68 ns/op
BenchmarkInfoSeveralArgs-6   	 9974653	       114.4 ns/op
BenchmarkV0Info-6            	11343361	       118.7 ns/op
BenchmarkV9Info-6            	10160247	       117.6 ns/op
BenchmarkError-6             	10033260	       113.5 ns/op
BenchmarkWithValues-6        	21537408	        59.31 ns/op
BenchmarkWithName-6          	502713001	         2.189 ns/op

thockin · 2021-04-22T17:33:22Z

Hmm, I think a big part of this is the benchmark benefitting from optimizations that probably are not realistic in the wild. I'll look into it more later

thockin · 2021-04-22T18:19:27Z

Forcing noinline in the benchmark makes it more representative.

Before:

$ GO111MODULE=off go test -bench=. ./benchmark/
goos: linux
goarch: amd64
pkg: github.com/go-logr/logr/benchmark
cpu: Intel(R) Xeon(R) W-2135 CPU @ 3.70GHz
BenchmarkDiscardInfoOneArg-6        	25409384	        54.15 ns/op
BenchmarkDiscardInfoSeveralArgs-6   	 8511532	       146.8 ns/op
BenchmarkDiscardV0Info-6            	 8260953	       152.7 ns/op
BenchmarkDiscardV9Info-6            	 8753974	       147.9 ns/op
BenchmarkDiscardError-6             	 8291468	       146.7 ns/op
BenchmarkDiscardWithValues-6        	18875984	        67.67 ns/op
BenchmarkDiscardWithName-6          	528824234	         2.077 ns/op
BenchmarkFuncrInfoOneArg-6          	 2606715	       476.2 ns/op
BenchmarkFuncrInfoSeveralArgs-6     	  823752	      1302 ns/op
BenchmarkFuncrV0Info-6              	  915470	      1422 ns/op
BenchmarkFuncrV9Info-6              	 4300730	       277.6 ns/op
BenchmarkFuncrError-6               	  837588	      1369 ns/op
BenchmarkFuncrWithValues-6          	 4193628	       294.4 ns/op
BenchmarkFuncrWithName-6            	10532635	       106.4 ns/op

After:

$ GO111MODULE=off go test -bench=. ./benchmark/
goos: linux
goarch: amd64
pkg: github.com/go-logr/logr/benchmark
cpu: Intel(R) Xeon(R) W-2135 CPU @ 3.70GHz
BenchmarkDiscardInfoOneArg-6        	23285353	        52.58 ns/op
BenchmarkDiscardInfoSeveralArgs-6   	 8481985	       146.9 ns/op
BenchmarkDiscardV0Info-6            	 7851495	       144.8 ns/op
BenchmarkDiscardV9Info-6            	 8797778	       144.8 ns/op
BenchmarkDiscardError-6             	 7889066	       148.1 ns/op
BenchmarkDiscardWithValues-6        	19125424	        68.10 ns/op
BenchmarkDiscardWithName-6          	500619270	         2.165 ns/op
BenchmarkFuncrInfoOneArg-6          	 2841159	       490.9 ns/op
BenchmarkFuncrInfoSeveralArgs-6     	  887854	      1266 ns/op
BenchmarkFuncrV0Info-6              	  885891	      1221 ns/op
BenchmarkFuncrV9Info-6              	 7902538	       145.6 ns/op
BenchmarkFuncrError-6               	  852745	      1401 ns/op
BenchmarkFuncrWithValues-6          	 4401705	       295.1 ns/op
BenchmarkFuncrWithName-6            	11857100	       105.0 ns/op

Notably BenchmarkFuncrV9Info was cut almost in half, and there's room for
more optimization.

I will make a new push with this change.

thockin self-assigned this Mar 26, 2021

This was referenced Apr 2, 2021

Add a simple benchmark #41

Merged

Rearch for performance: BREAKING CHANGES #42

Merged

thockin closed this as completed May 28, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Consider run time performance #40

Consider run time performance #40

thockin commented Mar 26, 2021

wojas commented Apr 1, 2021

thockin commented Apr 1, 2021 via email

thockin commented Apr 2, 2021

wojas commented Apr 22, 2021

thockin commented Apr 22, 2021

thockin commented Apr 22, 2021

thockin commented Apr 22, 2021

thockin commented Apr 22, 2021

Consider run time performance #40

Consider run time performance #40

Comments

thockin commented Mar 26, 2021

wojas commented Apr 1, 2021

thockin commented Apr 1, 2021 via email

thockin commented Apr 2, 2021

wojas commented Apr 22, 2021

thockin commented Apr 22, 2021

thockin commented Apr 22, 2021

thockin commented Apr 22, 2021

thockin commented Apr 22, 2021