Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Profile Horizon #4022

Closed
2opremio opened this issue Oct 21, 2021 · 14 comments
Closed

Profile Horizon #4022

2opremio opened this issue Oct 21, 2021 · 14 comments
Labels

Comments

@2opremio
Copy link
Contributor

2opremio commented Oct 21, 2021

I don't think we have profiled the Time (CPU and Wallclock) and Memory performance of Horizon in a while.

We should try doing that and and improve any low hanging fruit we see. Particularly Horizon's ingestion pipeline.

For starters, this is a CPU profile I took from Horizon's staging pubnet deployment:
pprof.stellar-horizon.samples.cpu.001.pb.gz

@2opremio
Copy link
Contributor Author

Unsurprisingly Horizon spends most of its CPU time decoding base64 and XDR. Most of which is spent making allocations:

Screenshot 2021-10-21 at 18 24 11

@2opremio
Copy link
Contributor Author

2opremio commented Oct 21, 2021

I think we can really improve decoding speed by:

  1. Generating custom decoders as opposed to reflection
  2. Using memory pools (for ledgers, transactions and operations)

@2opremio
Copy link
Contributor Author

2opremio commented Nov 2, 2021

I took a quick look at the decoding object allocation and performance ... and I found that most (if not all) the intermediary buffers used by the xdr decoding library end up allocated in the heap.

Here are the top-most object allocating functions of Horizon:

Screenshot 2021-11-02 at 17 18 22

If we ignore the calls from the reflect package we can see that calls from xdr.(*Decoder) account for ~15% of the objects allocated.

Let's take a a peek at DecodeInt():


pprof
stellar-horizon alloc_objects

      File: stellar-horizon
Type: alloc_objects
Time: Oct 22, 2021 at 1:30pm (CEST)
Active filters:
   focus=github\.com/stellar/go-xdr/xdr3\.\(\*Decoder\)\.DecodeInt
Showing nodes accounting for 8585334, 5.06% of 169782541 total
----------------------------------------------------------+-------------
      flat  flat%   sum%        cum   cum%   calls calls% + context 	 	 
----------------------------------------------------------+-------------
                                           7110754 82.82% |   github.com/stellar/go-xdr/xdr3.(*Decoder).decodeUnion github.com/stellar/go-xdr@v0.0.0-20201028102745-f80a23dac78a/xdr3/decode.go:530
                                            753674  8.78% |   github.com/stellar/go-xdr/xdr3.(*Decoder).DecodeBool github.com/stellar/go-xdr@v0.0.0-20201028102745-f80a23dac78a/xdr3/decode.go:175
                                            720906  8.40% |   github.com/stellar/go-xdr/xdr3.(*Decoder).decode github.com/stellar/go-xdr@v0.0.0-20201028102745-f80a23dac78a/xdr3/decode.go:801
   8585334  5.06%  5.06%    8585334  5.06%                | github.com/stellar/go-xdr/xdr3.(*Decoder).DecodeInt github.com/stellar/go-xdr@v0.0.0-20201028102745-f80a23dac78a/xdr3/decode.go:105
----------------------------------------------------------+-------------

All allocations come from line 105, which corresponds to a var buf [4]byte array declaration:

func (d *Decoder) DecodeInt() (int32, int, error) {
	var buf [4]byte
	n, err := io.ReadFull(d.r, buf[:])

One would think that buf would be kept in the stack, but unfortunately that's not the case, due to the call to ReadFull using a slice from created from buf. From the escape analysis output of go 17.2

sweepey:xdr3 fons$ pwd
/Users/fons/go/pkg/mod/github.com/stellar/go-xdr@v0.0.0-20200331223602-71a1e6d555f2/xdr3
sweepey:xdr3 fons$ go build -gcflags '-m -m -l' . |& grep decode.go
[...]
./decode.go:105:6: buf escapes to heap:
./decode.go:105:6:   flow: {heap} = &buf:
./decode.go:105:6:     from buf (address-of) at ./decode.go:106:32
./decode.go:105:6:     from buf[:] (slice) at ./decode.go:106:32
./decode.go:105:6:     from io.ReadFull(d.r, buf[:]) (call parameter) at ./decode.go:106:23
[...]

The same applies to the other decoding functions.

The problem is that the reader passed to io.ReadFull could keep a copy of buf around and thus Go decides to keep it in the heap ( Actually, I think that Go just lets the variable escape the stack due to the call, regardless of what the io.ReadFull actually does with it )

This means that, in order to dramatically reduce allocations we should change the Decoder's design so that it reads from a raw buffer (not requiring external IO calls causing stack escapes).

@leighmcculloch
Copy link
Member

This is really interesting finding. 👏🏻 Combined with that finding we could also rewrite the xdrgen generated Go code to use the same pattern as stellar/xdrgen#65 does for encoding, and apply it to decoding so that no reflection is necessary.

2opremio added a commit to 2opremio/go-xdr that referenced this issue Nov 2, 2021
I reduced decoding allocations by

1. Adding a scratch buffer to the Encoder/decoder. The scratch
   buffer is used as a temporary buffer. Before that, temporary
   buffers were allocated (in the heap since they escaped the stack)
   for basic type encoding/decoding (e.g. `DecodeInt()` `EncodeInt()`

2. Making `DecodeFixedOpaque()` decode in-place instead of
   allocating the result. Apart from reducing allocations,
   this removes the need of copying (as shown by `decodeFixedArray()`.

The pre-existing (and admitedly very limited) benchmarks show a sizeable improvement:

Before:

```
goos: darwin
goarch: amd64
pkg: github.com/stellar/go-xdr/xdr3
cpu: Intel(R) Core(TM) i7-1068NG7 CPU @ 2.30GHz
BenchmarkUnmarshal
BenchmarkUnmarshal-8   	    3000	      1039 ns/op	  15.40 MB/s	     152 B/op	      11 allocs/op
BenchmarkMarshal
BenchmarkMarshal-8     	    3000	       806.7 ns/op	  19.83 MB/s	     104 B/op	      10 allocs/op
PASS
```

After:
```
goos: darwin
goarch: amd64
pkg: github.com/stellar/go-xdr/xdr3
cpu: Intel(R) Core(TM) i7-1068NG7 CPU @ 2.30GHz
BenchmarkUnmarshal
BenchmarkUnmarshal-8   	    3000	       679.6 ns/op	  23.54 MB/s	     144 B/op	       8 allocs/op
BenchmarkMarshal
BenchmarkMarshal-8     	    3000	       609.1 ns/op	  26.27 MB/s	     104 B/op	       8 allocs/op
PASS
```

Both decoding and encoding both go down to 8 allocations from 11 (decoding) and 10 (encoding) allocations per operation.

More context at stellar/go#4022 (comment)
@2opremio
Copy link
Contributor Author

2opremio commented Nov 2, 2021

The problems I identified above are fixed by stellar/go-xdr#16

2opremio added a commit to 2opremio/go-xdr that referenced this issue Nov 2, 2021
I reduced the allocation count by

1. Adding a scratch buffer to the Encoder/decoder. The scratch
   buffer is used as a temporary buffer. Before that, temporary
   buffers were allocated (in the heap since they escaped the stack)
   for basic type encoding/decoding (e.g. `DecodeInt()` `EncodeInt()`

2. Making `DecodeFixedOpaque()` decode in-place instead of
   allocating the result. Apart from reducing allocations,
   this removes the need of copying (as shown by `decodeFixedArray()`.

The pre-existing (and admitedly very limited) benchmarks show a sizeable improvement:

Before:

```
goos: darwin
goarch: amd64
pkg: github.com/stellar/go-xdr/xdr3
cpu: Intel(R) Core(TM) i7-1068NG7 CPU @ 2.30GHz
BenchmarkUnmarshal
BenchmarkUnmarshal-8   	    3000	      1039 ns/op	  15.40 MB/s	     152 B/op	      11 allocs/op
BenchmarkMarshal
BenchmarkMarshal-8     	    3000	       806.7 ns/op	  19.83 MB/s	     104 B/op	      10 allocs/op
PASS
```

After:
```
goos: darwin
goarch: amd64
pkg: github.com/stellar/go-xdr/xdr3
cpu: Intel(R) Core(TM) i7-1068NG7 CPU @ 2.30GHz
BenchmarkUnmarshal
BenchmarkUnmarshal-8   	    3000	       679.6 ns/op	  23.54 MB/s	     144 B/op	       8 allocs/op
BenchmarkMarshal
BenchmarkMarshal-8     	    3000	       609.1 ns/op	  26.27 MB/s	     104 B/op	       8 allocs/op
PASS
```

Both decoding and encoding both go down to 8 allocations from 11 (decoding) and 10 (encoding) allocations per operation.

More context at stellar/go#4022 (comment)
2opremio added a commit to 2opremio/go-xdr that referenced this issue Nov 2, 2021
I reduced the allocation count by

1. Adding a scratch buffer to the Encoder/decoder. The scratch
   buffer is used as a temporary buffer. Before that, temporary
   buffers were allocated (in the heap since they escaped the stack)
   for basic type encoding/decoding (e.g. `DecodeInt()` `EncodeInt()`

2. Making `DecodeFixedOpaque()` decode in-place instead of
   allocating the result. Apart from reducing allocations,
   this removes the need of copying (as shown by `decodeFixedArray()`.

The pre-existing (and admitedly very limited) benchmarks show a sizeable improvement:

Before:

```
goos: darwin
goarch: amd64
pkg: github.com/stellar/go-xdr/xdr3
cpu: Intel(R) Core(TM) i7-1068NG7 CPU @ 2.30GHz
BenchmarkUnmarshal
BenchmarkUnmarshal-8   	 1445005	       853.5 ns/op	  18.75 MB/s	     152 B/op	      11 allocs/op
BenchmarkMarshal
BenchmarkMarshal-8     	 1591068	       643.0 ns/op	  24.88 MB/s	     104 B/op	      10 allocs/op
PASS
```

After:

```
goos: darwin
goarch: amd64
pkg: github.com/stellar/go-xdr/xdr3
cpu: Intel(R) Core(TM) i7-1068NG7 CPU @ 2.30GHz
BenchmarkUnmarshal
BenchmarkUnmarshal-8   	 1753954	       661.4 ns/op	  24.19 MB/s	     144 B/op	       8 allocs/op
BenchmarkMarshal
BenchmarkMarshal-8     	 2103310	       537.1 ns/op	  29.79 MB/s	     104 B/op	       8 allocs/op
PASS
```

Both decoding and encoding both go down to 8 allocations from 11 (decoding) and 10 (encoding) allocations per operation.

More context at stellar/go#4022 (comment)
@2opremio 2opremio self-assigned this Nov 9, 2021
@2opremio
Copy link
Contributor Author

I think that after all the recent PRs we can consider XDR encoding and decoding sufficiently optimized for now.

If we decide to do more work in that regard, I think we should try to reduce decoding allocations, which will be hard but not impossible. For instance we could try to intern the Assets much like string interning

@2opremio
Copy link
Contributor Author

2opremio commented Nov 16, 2021

Now, with all the codec improvements in place, I have profiled a staging captive core ingestion instance (for 240 seconds).

It seems that ingestion is dominated by the state verifier:

Screenshot 2021-11-16 at 13 14 45

@2opremio
Copy link
Contributor Author

While the CPU execution is dominated by the state verifier, Go's profiler only measures CPU time and not wallclock time.

Go's profiler also provides tracing, which AFAIU can be used to infer IO wait time

Finally https://github.com/felixge/fgprof combines both, providing wallclock profiling.

I am going to evaluate both in staging and see what I get. See #4079

@2opremio
Copy link
Contributor Author

Here is the fgprof time profile: pprof.samples.time.001.pb.gz

Unfortunately, although it looks very different at the top level:
Screenshot 2021-11-16 at 18 34 16

When focusing on RunAllProcessorsOnLedger, it looks very similar to what it looked before:

Screenshot 2021-11-16 at 18 35 09

So I am not sure fgprof is working well. It seems that, for some reason, it's missing DB queries.

@bartekn
Copy link
Contributor

bartekn commented Nov 16, 2021

Add added metrics to specific batch inserts in #4080 and it turns out that Exec takes 34.91/37.35 = 93% of all time spent in TransactionProcessor (sample of 280 pubnet ledgers). So there's definitely something about the query itself (or some preprocessing of the query in Exec).

@2opremio
Copy link
Contributor Author

I think it would also be useful to add tracing to db queries in general

This golang issue includes some alternatives: golang/go#18080

@bartekn
Copy link
Contributor

bartekn commented Nov 19, 2021

It looks like all the XDR improvements and #4087 made TransactionProcessor fast again. Should we close this or move to backlog?

@2opremio
Copy link
Contributor Author

Move it to the backlog. I think we will need go keep working on this.

Also (without downplaying the improvements) the transaction's processor is still the bottleneck, right?

I still think there is a lot to be gained by improving the performance of the batch insert builder in general.

@bartekn
Copy link
Contributor

bartekn commented Dec 20, 2021

What do you think about closing this issue and creating specific issues connected to performance improvements? This one is very general and I guess could live in the backlog forever.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

5 participants