Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

look into garble functions showing up in "perf report" #475

Open
mvdan opened this issue Jan 22, 2022 · 12 comments
Open

look into garble functions showing up in "perf report" #475

mvdan opened this issue Jan 22, 2022 · 12 comments
Labels
enhancement New feature or request

Comments

@mvdan
Copy link
Member

mvdan commented Jan 22, 2022

I just captured a Linux perf CPU profile from the build in #472 with an empty cache:

$ rm -rf /tmp/gocache; GOCACHE=/tmp/gocache GOMAXPROCS=1 GOGARBLE=* perf record -g garble build testdata/bench/main.go
$ perf report

If we just look at the main.main funcs, we see:

+   69.97%     0.00%  compile  compile  [.] main.main                                                                                                                                              
+    8.17%     0.00%  garble   garble   [.] main.main                                                                                                                                              
+    0.59%     0.00%  asm      asm      [.] main.main                                                                                                                                              
     0.25%     0.00%  link     link     [.] main.main                                                                                                                                              
     0.23%     0.00%  go       go       [.] main.main                                                                                                                                              
     0.02%     0.00%  cgo      cgo      [.] main.main                                                                                                                                             

The first column is cumulative CPU cost as a percentage. It seems like compile takes ~70% of the CPU, garble takes ~8%, ~1% goes to the other toolchain binaries, and presumably the other ~20% is other overhead like blocking I/O or syscalls.

This seems like a good start; garble itself is not being horribly wasteful with CPU.

Then, zooming into garble with d to skip the CPU cost from compile and link, we see:

+    8.75%     0.00%  garble   [.] runtime.goexit.abi0                                                                                                                                             
+    8.35%     0.00%  garble   [.] runtime.main                                                                                                                                                    
+    8.17%     0.00%  garble   [.] main.main                                                                                                                                                       
+    8.17%     0.00%  garble   [.] main.main1                                                                                                                                                      
+    8.17%     0.00%  garble   [.] main.mainErr                                                                                                                                                    
+    7.92%     0.00%  garble   [.] main.transformCompile                                                                                                                                           
+    2.58%     0.02%  garble   [.] main.(*transformer).typecheck                                                                                                                                   
+    2.56%     0.01%  garble   [.] main.printFile                                                                                                                                                  
+    2.46%     2.46%  garble   [.] go/printer.walkBinary                                                                                                                                           
+    2.41%     0.00%  garble   [.] go/types.(*Config).Check                                                                                                                                        
+    2.41%     0.00%  garble   [.] go/types.(*Checker).checkFiles                                                                                                                                  
+    2.28%     0.72%  garble   [.] runtime.mallocgc                                                                                                                                                
+    1.83%     0.01%  garble   [.] runtime.systemstack.abi0                                                                                                                                        
+    1.68%     0.01%  garble   [.] go/printer.(*Config).fprint                                                                                                                                     
+    1.53%     0.01%  garble   [.] go/printer.(*printer).printNode                                                                                                                                 
+    1.53%     0.00%  garble   [.] go/printer.(*printer).file                                                                                                                                      
+    1.52%     0.00%  garble   [.] go/printer.(*printer).declList                                                                                                                                  
+    1.51%     0.00%  garble   [.] go/printer.(*printer).decl                                                                                                                                      
+    1.46%     0.00%  garble   [.] go/types.(*Checker).rawExpr                                                                                                                                     
+    1.42%     0.06%  garble   [.] golang.org/x/tools/go/ast/astutil.(*application).apply                                                                                                          
+    1.42%     0.00%  garble   [.] main.(*transformer).transformGo                                                                                                                                 
+    1.42%     0.00%  garble   [.] golang.org/x/tools/go/ast/astutil.Apply                                                                                                                         
+    1.42%     0.02%  garble   [.] golang.org/x/tools/go/ast/astutil.(*application).applyList                                                                                                      
+    1.37%     0.00%  garble   [.] go/parser.ParseFile                                                                                                                                             
+    1.37%     0.03%  garble   [.] go/types.(*Checker).exprInternal                                                                                                                                
+    1.35%     0.00%  garble   [.] go/parser.(*parser).parseFile                                                                                                                                   
+    1.30%     0.72%  garble   [.] runtime.scanobject                                                                                                                                              
+    1.29%     0.01%  garble   [.] go/types.(*Checker).processDelayed                                                                                                                              
+    1.28%     0.00%  garble   [.] go/types.(*Checker).funcDecl.func1                                                                                                                              
+    1.28%     0.00%  garble   [.] go/types.(*Checker).funcBody                                                                                                                                    
+    1.26%     0.01%  garble   [.] go/types.(*Checker).stmtList    
[...]
+    0.91%     0.91%  garble   [.] crypto/sha256.block.abi0       
+    0.76%     0.76%  garble   [.] go/ast.(*BinaryExpr).Pos                                                                                                                                                                                                                                                                         

Some interesting bits:

  • 2.5% goes to typechecking alone. stop loading type information of obfuscated dependencies #456 should help.
  • We generate a significant amount of garbage, with mallocgc at 2.3% and scanobject at 1.3%. We should try to do better, though I imagine most of the allocs happen in go/parser and go/types.
  • printFile alone is responsible for 2.6%, most of which comes from go/printer.WalkBinary. That looks suspicious.
  • 1% is to be blamed on sha256. Should we reuse some of the hashes, or perhaps look at more efficient cryptographic hashes?
  • 0.8% also goes to BinaryExpr, this time for looking up its position. Do we have some massive binary expressions?

This issue is to post this information and act as a reminder to look into these potential optimizations.

@capnspacehook
Copy link
Contributor

A simple suggestion would be to cache hashes of names so we don't waste cpu cycles rehashing the same inputs, but that may or may not help. The extra allocs might hurt perf, will have to test

@mvdan
Copy link
Member Author

mvdan commented Jan 22, 2022

There are multiple things we could try there:

  • use a different hash func, perhaps one that doesn't output as many bits
  • memoize hashes, either all of them or just the ones likely to get reused (e.g. short names)
  • avoid hashing names we don't need to (e.g. when building binaries, obfuscating local variables is unnecessary, if you keep in mind the potential name clashes with code inserted by literal obfuscation)

@mvdan
Copy link
Member Author

mvdan commented Jan 22, 2022

Note to self: it would also be useful to obtain a flamegraph representation of where garble is spending its time, to then look at what parts of a build could be parallelized. I imagine a good candidate would be obfuscating the Go files in transformCompile.

@pagran
Copy link
Member

pagran commented Jan 22, 2022

Of the cryptographic hashes sha1 usually shows good speeds (although it is obsolete)
Bench: https://github.com/SimonWaldherr/golang-benchmarks#hash

But we can use non-cryptographic hashes, such as fnv (example implementation: https://github.com/segmentio/fasthash) 64 bits should be enough, right?

@mvdan
Copy link
Member Author

mvdan commented Jan 22, 2022

I'd prefer to stick to decent cryptographic hashes; if the hashes are non-cryptographic, then presumably they are easier to bruteforce or break.

One candidate could be blake2, which overall seems slightly better and faster than sha2. I reckon that switch might be worthwhile even if we implement some of the other improvements; they could all help.

@capnspacehook
Copy link
Contributor

I would suggest blake3, as it's much faster than blake2 from what I've seen, but there's not an official/reviewed implementation in Go AFAIK

@mvdan
Copy link
Member Author

mvdan commented Jan 24, 2022

I've found why binary expressions were taking up so much CPU time. I initially thought that the go/printer or go/ast packages were too inefficient, but it turns out they are okay. They are simply dealing with, let's say, a horribly long binary expression:

https://github.com/golang/go/blob/0ef6dd74409506eb084bd8d2fe61e0e70ed9e5a4/src/crypto/elliptic/p256_asm_table.go

With some napkin math, we can see ~1400 lines with 8 + binary expressions per line. That's a binary expression with beyond 10k elements. Which is made worse by how binary expressions are essentially linked lists; walking the entire expression jumps through 10k+ pointers.

I'll be sending a patch upstream to remedy this somewhat. This is a regression from 1.17 via https://go-review.googlesource.com/c/go/+/339591, and I think this was an unintended effect.

@mvdan
Copy link
Member Author

mvdan commented Jan 24, 2022

I would suggest blake3, as it's much faster than blake2 from what I've seen, but there's not an official/reviewed implementation in Go AFAIK

I slightly lean towards blake2, given that it's had more time to prove itself, and there's an official Go implementation that has likely been well reviewed and optimized. There are a few blake3 implementations out there, but it's hard to judge them.

@mvdan
Copy link
Member Author

mvdan commented Jan 24, 2022

It's also a good idea to subscribe to golang/go#36632. My guess is that within a couple of years, one of those implementations will be promoted to x/crypto.

mvdan added a commit to mvdan/garble-fork that referenced this issue Jan 27, 2022
If package P1 imports package P2, P1 needs to know which names from P2
weren't obfuscated. For instance, if P2 declares T2 and does
"reflect.TypeOf(T2{...})", then P2 won't obfuscate the name T2, and
neither should P1.

This information should flow from P2 to P1, as P2 builds before
P1. We do this via obfuscatedTypesPackage; P1 loads the type information
of the obfuscated version of P2, and does a lookup for T2. If T2 exists,
then it wasn't obfuscated.

This mechanism has served us well, but it has downsides:

1) It wastes CPU; we load the type information for the entire package.

2) It's complex; for instance, we need KnownObjectFiles as an extra.

3) It makes our code harder to understand, as we load both the original
   and obfuscated type informaiton.

Instead, we now have each package record what names were not obfuscated
as part of its cachedOuput file. Much like KnownObjectFiles, the map
records incrementally through the import graph, to avoid having to load
cachedOutput files for indirect dependencies.

We shouldn't need to worry about those maps getting large;
we only skip obfuscating declared names in a few uncommon scenarios,
such as the use of reflection or cgo's "//export".

Since go/types is relatively allocation-heavy, and the export files
contain a lot of data, we get a nice speed-up:

	name      old time/op         new time/op         delta
	Build-16          11.5s ± 2%          11.1s ± 3%  -3.77%  (p=0.008 n=5+5)

	name      old bin-B           new bin-B           delta
	Build-16          5.15M ± 0%          5.15M ± 0%    ~     (all equal)

	name      old cached-time/op  new cached-time/op  delta
	Build-16          375ms ± 3%          341ms ± 6%  -8.96%  (p=0.008 n=5+5)

	name      old sys-time/op     new sys-time/op     delta
	Build-16          283ms ±17%          289ms ±13%    ~     (p=0.841 n=5+5)

	name      old user-time/op    new user-time/op    delta
	Build-16          687ms ± 6%          664ms ± 7%    ~     (p=0.548 n=5+5)

Fixes burrowers#456.
Updates burrowers#475.
mvdan added a commit to mvdan/garble-fork that referenced this issue Jan 27, 2022
If package P1 imports package P2, P1 needs to know which names from P2
weren't obfuscated. For instance, if P2 declares T2 and does
"reflect.TypeOf(T2{...})", then P2 won't obfuscate the name T2, and
neither should P1.

This information should flow from P2 to P1, as P2 builds before
P1. We do this via obfuscatedTypesPackage; P1 loads the type information
of the obfuscated version of P2, and does a lookup for T2. If T2 exists,
then it wasn't obfuscated.

This mechanism has served us well, but it has downsides:

1) It wastes CPU; we load the type information for the entire package.

2) It's complex; for instance, we need KnownObjectFiles as an extra.

3) It makes our code harder to understand, as we load both the original
   and obfuscated type informaiton.

Instead, we now have each package record what names were not obfuscated
as part of its cachedOuput file. Much like KnownObjectFiles, the map
records incrementally through the import graph, to avoid having to load
cachedOutput files for indirect dependencies.

We shouldn't need to worry about those maps getting large;
we only skip obfuscating declared names in a few uncommon scenarios,
such as the use of reflection or cgo's "//export".

Since go/types is relatively allocation-heavy, and the export files
contain a lot of data, we get a nice speed-up:

	name      old time/op         new time/op         delta
	Build-16          11.5s ± 2%          11.1s ± 3%  -3.77%  (p=0.008 n=5+5)

	name      old bin-B           new bin-B           delta
	Build-16          5.15M ± 0%          5.15M ± 0%    ~     (all equal)

	name      old cached-time/op  new cached-time/op  delta
	Build-16          375ms ± 3%          341ms ± 6%  -8.96%  (p=0.008 n=5+5)

	name      old sys-time/op     new sys-time/op     delta
	Build-16          283ms ±17%          289ms ±13%    ~     (p=0.841 n=5+5)

	name      old user-time/op    new user-time/op    delta
	Build-16          687ms ± 6%          664ms ± 7%    ~     (p=0.548 n=5+5)

Fixes burrowers#456.
Updates burrowers#475.
mvdan added a commit to mvdan/garble-fork that referenced this issue Jan 27, 2022
If package P1 imports package P2, P1 needs to know which names from P2
weren't obfuscated. For instance, if P2 declares T2 and does
"reflect.TypeOf(T2{...})", then P2 won't obfuscate the name T2, and
neither should P1.

This information should flow from P2 to P1, as P2 builds before
P1. We do this via obfuscatedTypesPackage; P1 loads the type information
of the obfuscated version of P2, and does a lookup for T2. If T2 exists,
then it wasn't obfuscated.

This mechanism has served us well, but it has downsides:

1) It wastes CPU; we load the type information for the entire package.

2) It's complex; for instance, we need KnownObjectFiles as an extra.

3) It makes our code harder to understand, as we load both the original
   and obfuscated type informaiton.

Instead, we now have each package record what names were not obfuscated
as part of its cachedOuput file. Much like KnownObjectFiles, the map
records incrementally through the import graph, to avoid having to load
cachedOutput files for indirect dependencies.

We shouldn't need to worry about those maps getting large;
we only skip obfuscating declared names in a few uncommon scenarios,
such as the use of reflection or cgo's "//export".

Since go/types is relatively allocation-heavy, and the export files
contain a lot of data, we get a nice speed-up:

	name      old time/op         new time/op         delta
	Build-16          11.5s ± 2%          11.1s ± 3%  -3.77%  (p=0.008 n=5+5)

	name      old bin-B           new bin-B           delta
	Build-16          5.15M ± 0%          5.15M ± 0%    ~     (all equal)

	name      old cached-time/op  new cached-time/op  delta
	Build-16          375ms ± 3%          341ms ± 6%  -8.96%  (p=0.008 n=5+5)

	name      old sys-time/op     new sys-time/op     delta
	Build-16          283ms ±17%          289ms ±13%    ~     (p=0.841 n=5+5)

	name      old user-time/op    new user-time/op    delta
	Build-16          687ms ± 6%          664ms ± 7%    ~     (p=0.548 n=5+5)

Fixes burrowers#456.
Updates burrowers#475.
mvdan added a commit to mvdan/garble-fork that referenced this issue Jan 28, 2022
If package P1 imports package P2, P1 needs to know which names from P2
weren't obfuscated. For instance, if P2 declares T2 and does
"reflect.TypeOf(T2{...})", then P2 won't obfuscate the name T2, and
neither should P1.

This information should flow from P2 to P1, as P2 builds before
P1. We do this via obfuscatedTypesPackage; P1 loads the type information
of the obfuscated version of P2, and does a lookup for T2. If T2 exists,
then it wasn't obfuscated.

This mechanism has served us well, but it has downsides:

1) It wastes CPU; we load the type information for the entire package.

2) It's complex; for instance, we need KnownObjectFiles as an extra.

3) It makes our code harder to understand, as we load both the original
   and obfuscated type informaiton.

Instead, we now have each package record what names were not obfuscated
as part of its cachedOuput file. Much like KnownObjectFiles, the map
records incrementally through the import graph, to avoid having to load
cachedOutput files for indirect dependencies.

We shouldn't need to worry about those maps getting large;
we only skip obfuscating declared names in a few uncommon scenarios,
such as the use of reflection or cgo's "//export".

Since go/types is relatively allocation-heavy, and the export files
contain a lot of data, we get a nice speed-up:

	name      old time/op         new time/op         delta
	Build-16          11.5s ± 2%          11.1s ± 3%  -3.77%  (p=0.008 n=5+5)

	name      old bin-B           new bin-B           delta
	Build-16          5.15M ± 0%          5.15M ± 0%    ~     (all equal)

	name      old cached-time/op  new cached-time/op  delta
	Build-16          375ms ± 3%          341ms ± 6%  -8.96%  (p=0.008 n=5+5)

	name      old sys-time/op     new sys-time/op     delta
	Build-16          283ms ±17%          289ms ±13%    ~     (p=0.841 n=5+5)

	name      old user-time/op    new user-time/op    delta
	Build-16          687ms ± 6%          664ms ± 7%    ~     (p=0.548 n=5+5)

Fixes burrowers#456.
Updates burrowers#475.
lu4p pushed a commit that referenced this issue Jan 31, 2022
If package P1 imports package P2, P1 needs to know which names from P2
weren't obfuscated. For instance, if P2 declares T2 and does
"reflect.TypeOf(T2{...})", then P2 won't obfuscate the name T2, and
neither should P1.

This information should flow from P2 to P1, as P2 builds before
P1. We do this via obfuscatedTypesPackage; P1 loads the type information
of the obfuscated version of P2, and does a lookup for T2. If T2 exists,
then it wasn't obfuscated.

This mechanism has served us well, but it has downsides:

1) It wastes CPU; we load the type information for the entire package.

2) It's complex; for instance, we need KnownObjectFiles as an extra.

3) It makes our code harder to understand, as we load both the original
   and obfuscated type informaiton.

Instead, we now have each package record what names were not obfuscated
as part of its cachedOuput file. Much like KnownObjectFiles, the map
records incrementally through the import graph, to avoid having to load
cachedOutput files for indirect dependencies.

We shouldn't need to worry about those maps getting large;
we only skip obfuscating declared names in a few uncommon scenarios,
such as the use of reflection or cgo's "//export".

Since go/types is relatively allocation-heavy, and the export files
contain a lot of data, we get a nice speed-up:

	name      old time/op         new time/op         delta
	Build-16          11.5s ± 2%          11.1s ± 3%  -3.77%  (p=0.008 n=5+5)

	name      old bin-B           new bin-B           delta
	Build-16          5.15M ± 0%          5.15M ± 0%    ~     (all equal)

	name      old cached-time/op  new cached-time/op  delta
	Build-16          375ms ± 3%          341ms ± 6%  -8.96%  (p=0.008 n=5+5)

	name      old sys-time/op     new sys-time/op     delta
	Build-16          283ms ±17%          289ms ±13%    ~     (p=0.841 n=5+5)

	name      old user-time/op    new user-time/op    delta
	Build-16          687ms ± 6%          664ms ± 7%    ~     (p=0.548 n=5+5)

Fixes #456.
Updates #475.
@mvdan
Copy link
Member Author

mvdan commented Feb 8, 2022

The crypto/elliptic issue is now fixed with https://go-review.googlesource.com/c/go/+/380475.

@mvdan
Copy link
Member Author

mvdan commented Feb 12, 2022

Some updated numbers:

  • main.(*transformer).typecheck remains at 2.8%; unlikely that we can bring it down significantly
  • main.printFile remains at 2.7%; we can probably reduce the amount of work there
  • mallocgc remains at 2.3%; we can probably reduce some amount of garbage
  • Nothing related to binary expressions shows up near or above 1% anymore, as expected
  • crypto/sha256.block.abi0 is now at 0.2%; unclear if the change is just noise.

So it seems to me like the only two places we need to pay attention to in the very short term is printFile and unnecessary garbage allocation. I'll do the former; any help with the latter is appreciated.

@mvdan
Copy link
Member Author

mvdan commented Jun 15, 2022

go/printer shows up as 3.3% total CPU cost as of today, so I took a look and sent a few CLs upstream to make it ~15% faster. Reviews or thoughts welcome: https://go-review.googlesource.com/c/go/+/412557

Pretty sure I could squeeze another 10-20% out of it once these CLs are merged. I don't want to go any further for now, before reviews.

@lu4p lu4p added the enhancement New feature or request label Oct 28, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Development

No branches or pull requests

4 participants