Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Consider run time performance #40

Closed
thockin opened this issue Mar 26, 2021 · 8 comments
Closed

Consider run time performance #40

thockin opened this issue Mar 26, 2021 · 8 comments
Assignees

Comments

@thockin
Copy link
Contributor

thockin commented Mar 26, 2021

If we want to push this to 1.0 (#38), we really need to make some intentional decisions about performance. The API as it stands was designed largely without considering performance, and (surprise!) it shows.

glogr_benchmark_test.go:

package bench

import (
	"flag"
	"os"
	"testing"

	"github.com/go-logr/glogr"
	"github.com/go-logr/logr"
	"github.com/golang/glog"
)

func init() {
	flag.Set("v", "1")
	flag.Set("logtostderr", "true")
	os.Stderr, _ = os.Open("/dev/null")
}

func BenchmarkGlogInfoOneArg(b *testing.B) {
	for i := 0; i < b.N; i++ {
		glog.Infof("this is a %s", "string")
	}
}

func BenchmarkGlogrInfoOneArg(b *testing.B) {
	var log logr.Logger = glogr.New()

	for i := 0; i < b.N; i++ {
		log.Info("this is", "a", "string")
	}
}

func BenchmarkGlogInfoSeveralArgs(b *testing.B) {
	for i := 0; i < b.N; i++ {
		glog.Infof("multi: bool %t, string %s, int %d, float %f, struct %v",
			true, "str", 42, 3.14, struct{ X, Y int }{93, 76})
	}
}

func BenchmarkGlogrInfoSeveralArgs(b *testing.B) {
	var log logr.Logger = glogr.New()

	for i := 0; i < b.N; i++ {
		log.Info("multi",
			"bool", true, "string", "str", "int", 42,
			"float", 3.14, "struct", struct{ X, Y int }{93, 76})
	}
}

func BenchmarkGlogInfoV0(b *testing.B) {
	for i := 0; i < b.N; i++ {
		glog.V(0).Infof("multi: bool %t, string %s, int %d, float %f, struct %v",
			true, "str", 42, 3.14, struct{ X, Y int }{93, 76})
	}
}

func BenchmarkGlogrInfoV0(b *testing.B) {
	var log logr.Logger = glogr.New()

	for i := 0; i < b.N; i++ {
		log.V(0).Info("multi",
			"bool", true, "string", "str", "int", 42,
			"float", 3.14, "struct", struct{ X, Y int }{93, 76})
	}
}

func BenchmarkGlogInfoV9(b *testing.B) {
	for i := 0; i < b.N; i++ {
		glog.V(9).Infof("multi: bool %t, string %s, int %d, float %f, struct %v",
			true, "str", 42, 3.14, struct{ X, Y int }{93, 76})
	}
}

func BenchmarkGlogrInfoV9(b *testing.B) {
	var log logr.Logger = glogr.New()

	for i := 0; i < b.N; i++ {
		log.V(9).Info("multi",
			"bool", true, "string", "str", "int", 42,
			"float", 3.14, "struct", struct{ X, Y int }{93, 76})
	}
}

Running this:

goos: linux
goarch: amd64
pkg: github.com/go-logr/glogr/bench
cpu: Intel(R) Xeon(R) W-2135 CPU @ 3.70GHz
BenchmarkGlogInfoOneArg-6         	  713682	      1675 ns/op
BenchmarkGlogrInfoOneArg-6        	  251154	      4761 ns/op
BenchmarkGlogInfoSeveralArgs-6    	  527288	      2285 ns/op
BenchmarkGlogrInfoSeveralArgs-6   	  167934	      7114 ns/op
BenchmarkGlogInfoV0-6             	  529798	      2286 ns/op
BenchmarkGlogrInfoV0-6            	  164413	      7293 ns/op
BenchmarkGlogInfoV9-6             	46018132	        25.70 ns/op
BenchmarkGlogrInfoV9-6            	 8412883	       142.2 ns/op

So it's notably slower. All of the variadic args escape to the heap, including the string keys (which regular glog does not suffer). But doesn't account for enough.

V() calls that are not taken also expand all their variadic args used.

Some of this is attributable to glogr being very dumb and wasteful (clone() on every call) but it's not clear how much. Before we call it 1.0 we need to do some homework.

@thockin thockin self-assigned this Mar 26, 2021
@wojas
Copy link
Contributor

wojas commented Apr 1, 2021

As another datapoint, there are some some benchmarks in https://github.com/wojas/genericr and the results there don't look too bad:

goos: darwin
goarch: amd64
pkg: github.com/wojas/genericr
cpu: Intel(R) Xeon(R) W-2150B CPU @ 3.00GHz
BenchmarkLogger_basic-20                        	37195687	        31.29 ns/op	       0 B/op	       0 allocs/op
BenchmarkLogger_basic_with_caller-20            	 1640784	       716.2 ns/op	     216 B/op	       2 allocs/op
BenchmarkLogger_2vars-20                        	19593140	        61.64 ns/op	      64 B/op	       1 allocs/op
BenchmarkLogger_clone-20                        	25827868	        43.97 ns/op	       0 B/op	       0 allocs/op
BenchmarkLogger_complicated-20                  	 3593434	       331.9 ns/op	     432 B/op	       6 allocs/op
BenchmarkLogger_complicated_precalculated-20    	 8682727	       139.2 ns/op	     160 B/op	       2 allocs/op
BenchmarkLogger_2vars_tostring-20               	  975183	      1276 ns/op	     600 B/op	      13 allocs/op
PASS
ok  	github.com/wojas/genericr	10.014s

Note that this benchmarks the logr interface and genericr without actual logging to a backend, so just the overhead introduced by the abstraction.

These do not include more complicated objects that could escape to the heap. I will add those when I have time.

@thockin
Copy link
Contributor Author

thockin commented Apr 1, 2021 via email

@thockin
Copy link
Contributor Author

thockin commented Apr 2, 2021

I spent some time playing with it tonight. Part of the problem is glogr making copies. I implemented as Russ suggested and the performance does seem better - all of the overhead is in the variadic args.

The problem is that it is a significant change in the API - logr.Logger becomes a concrete type and implementations must provide a slightly different API. It would be breaking change here, but it would enable even more optimization later.

I made a big mess, so I will have to clean it up before I can share it properly :)

@wojas
Copy link
Contributor

wojas commented Apr 22, 2021

So just to wrap my head around this, the main goal here is to make log calls as cheap as possible if the actual logging at that level is disabled. I think that's a goal worthy of some breakage before a 1.0.

The concrete struct result for V() does sound like something that would avoid the variadic call thanks to mid stack inlining added in Go 1.12, but it would be good to verify that this actually works as desired.

PR #42 is a bigger breaking change, completely replacing the interface by a concrete struct, but it may have other performance benefits. From a first look it is similar to what genericr is doing under the hood.

@thockin
Copy link
Contributor Author

thockin commented Apr 22, 2021

the main goal here is to make log calls as cheap as possible if the actual logging at that level is disabled.

Yes. Other wins are possible, but this is a big one.

I think that's a goal worthy of some breakage before a 1.0.

I agree :)

it would be good to verify that this actually works as desired.

Without #42:

BenchmarkInfoOneArg-6        	 3013077	       400.9 ns/op
BenchmarkInfoSeveralArgs-6   	 1162088	      1039 ns/op
BenchmarkV0Info-6            	  754347	      1418 ns/op
BenchmarkV9Info-6            	 4605950	       267.8 ns/op
BenchmarkError-6             	  935028	      1135 ns/op
BenchmarkWithValues-6        	 5127631	       232.5 ns/op
BenchmarkWithName-6          	12850569	       106.5 ns/op

With #42:

BenchmarkInfoOneArg-6        	 2465788	       475.7 ns/op
BenchmarkInfoSeveralArgs-6   	  893026	      1226 ns/op
BenchmarkV0Info-6            	  817473	      1250 ns/op
BenchmarkV9Info-6            	 8595180	       155.1 ns/op
BenchmarkError-6             	 1000000	      1371 ns/op
BenchmarkWithValues-6        	 3902214	       292.0 ns/op
BenchmarkWithName-6          	11271037	       106.6 ns/op

So, weirdly, everything got slower except calls through V(). I guess that has to do with the extra variadic-slice pack/unpack.

Changing Info/Error to pass the slice (instead of the slice...) to the LogSink doesn't make perf better.

I'll have to make time to disassemble it all and see what I can find.

@thockin
Copy link
Contributor Author

thockin commented Apr 22, 2021

Switching benchmark to Discard():

Before:

BenchmarkInfoOneArg-6        	309439507	         3.628 ns/op
BenchmarkInfoSeveralArgs-6   	103684642	        11.17 ns/op
BenchmarkV0Info-6            	10411196	       115.8 ns/op
BenchmarkV9Info-6            	 9963486	       118.2 ns/op
BenchmarkError-6             	82650194	        12.14 ns/op
BenchmarkWithValues-6        	220570576	         5.585 ns/op
BenchmarkWithName-6          	723292029	         1.441 ns/op

After:

BenchmarkInfoOneArg-6        	28276839	        46.68 ns/op
BenchmarkInfoSeveralArgs-6   	 9974653	       114.4 ns/op
BenchmarkV0Info-6            	11343361	       118.7 ns/op
BenchmarkV9Info-6            	10160247	       117.6 ns/op
BenchmarkError-6             	10033260	       113.5 ns/op
BenchmarkWithValues-6        	21537408	        59.31 ns/op
BenchmarkWithName-6          	502713001	         2.189 ns/op

@thockin
Copy link
Contributor Author

thockin commented Apr 22, 2021

Hmm, I think a big part of this is the benchmark benefitting from optimizations that probably are not realistic in the wild. I'll look into it more later

@thockin
Copy link
Contributor Author

thockin commented Apr 22, 2021

Forcing noinline in the benchmark makes it more representative.

Before:

$ GO111MODULE=off go test -bench=. ./benchmark/
goos: linux
goarch: amd64
pkg: github.com/go-logr/logr/benchmark
cpu: Intel(R) Xeon(R) W-2135 CPU @ 3.70GHz
BenchmarkDiscardInfoOneArg-6        	25409384	        54.15 ns/op
BenchmarkDiscardInfoSeveralArgs-6   	 8511532	       146.8 ns/op
BenchmarkDiscardV0Info-6            	 8260953	       152.7 ns/op
BenchmarkDiscardV9Info-6            	 8753974	       147.9 ns/op
BenchmarkDiscardError-6             	 8291468	       146.7 ns/op
BenchmarkDiscardWithValues-6        	18875984	        67.67 ns/op
BenchmarkDiscardWithName-6          	528824234	         2.077 ns/op
BenchmarkFuncrInfoOneArg-6          	 2606715	       476.2 ns/op
BenchmarkFuncrInfoSeveralArgs-6     	  823752	      1302 ns/op
BenchmarkFuncrV0Info-6              	  915470	      1422 ns/op
BenchmarkFuncrV9Info-6              	 4300730	       277.6 ns/op
BenchmarkFuncrError-6               	  837588	      1369 ns/op
BenchmarkFuncrWithValues-6          	 4193628	       294.4 ns/op
BenchmarkFuncrWithName-6            	10532635	       106.4 ns/op

After:

$ GO111MODULE=off go test -bench=. ./benchmark/
goos: linux
goarch: amd64
pkg: github.com/go-logr/logr/benchmark
cpu: Intel(R) Xeon(R) W-2135 CPU @ 3.70GHz
BenchmarkDiscardInfoOneArg-6        	23285353	        52.58 ns/op
BenchmarkDiscardInfoSeveralArgs-6   	 8481985	       146.9 ns/op
BenchmarkDiscardV0Info-6            	 7851495	       144.8 ns/op
BenchmarkDiscardV9Info-6            	 8797778	       144.8 ns/op
BenchmarkDiscardError-6             	 7889066	       148.1 ns/op
BenchmarkDiscardWithValues-6        	19125424	        68.10 ns/op
BenchmarkDiscardWithName-6          	500619270	         2.165 ns/op
BenchmarkFuncrInfoOneArg-6          	 2841159	       490.9 ns/op
BenchmarkFuncrInfoSeveralArgs-6     	  887854	      1266 ns/op
BenchmarkFuncrV0Info-6              	  885891	      1221 ns/op
BenchmarkFuncrV9Info-6              	 7902538	       145.6 ns/op
BenchmarkFuncrError-6               	  852745	      1401 ns/op
BenchmarkFuncrWithValues-6          	 4401705	       295.1 ns/op
BenchmarkFuncrWithName-6            	11857100	       105.0 ns/op

Notably BenchmarkFuncrV9Info was cut almost in half, and there's room for
more optimization.

I will make a new push with this change.

@thockin thockin closed this as completed May 28, 2021
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants