-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace scratch buffers with pools #65
Conversation
Adds pools for small buffers used for reading cbor headers and for larger buffers used when reading strings. This trades off some cpu for less pressure on the garbage collector. The benchmarks show a notable increase in cpu but allocations are amortized to near zero in many cases. Internalising the management of scratch buffers simplifies the code and allows removal/deprecation of duplicate implementations for several functions. Users will need to re-generate marshaling methods to benefit from the removal of scratch buffers from those methods. Benchstat comparison with master: name old time/op new time/op delta Marshaling-8 564ns ± 1% 1123ns ± 3% +99.16% (p=0.000 n=9+10) Unmarshaling-8 2.75µs ± 5% 3.53µs ± 4% +28.63% (p=0.000 n=10+10) LinkScan-8 744ns ± 0% 1694ns ± 1% +127.69% (p=0.000 n=10+9) Deferred-8 1.68µs ± 1% 3.90µs ± 0% +131.76% (p=0.000 n=10+9) MapMarshaling-8 317ns ± 0% 667ns ± 2% +110.55% (p=0.000 n=9+10) MapUnmarshaling-8 2.71µs ±10% 3.33µs ± 3% +23.26% (p=0.000 n=10+10) name old alloc/op new alloc/op delta Marshaling-8 160B ± 0% 0B -100.00% (p=0.000 n=10+10) Unmarshaling-8 3.44kB ± 0% 1.96kB ± 0% -42.94% (p=0.000 n=10+10) LinkScan-8 112B ± 0% 0B -100.00% (p=0.000 n=10+10) Deferred-8 88.0B ± 0% 72.0B ± 0% -18.18% (p=0.000 n=10+10) MapMarshaling-8 48.0B ± 0% 2.0B ± 0% -95.83% (p=0.000 n=10+10) MapUnmarshaling-8 2.53kB ± 0% 1.54kB ± 0% -39.15% (p=0.000 n=10+10) name old allocs/op new allocs/op delta Marshaling-8 10.0 ± 0% 0.0 -100.00% (p=0.000 n=10+10) Unmarshaling-8 43.0 ± 0% 21.0 ± 0% -51.16% (p=0.000 n=10+10) LinkScan-8 1.00 ± 0% 0.00 -100.00% (p=0.000 n=10+10) Deferred-8 3.00 ± 0% 2.00 ± 0% -33.33% (p=0.000 n=10+10) MapMarshaling-8 5.00 ± 0% 2.00 ± 0% -60.00% (p=0.000 n=10+10) MapUnmarshaling-8 56.0 ± 0% 27.0 ± 0% -51.79% (p=0.000 n=10+10)
@@ -336,6 +293,12 @@ func CborReadHeaderBuf(br io.Reader, scratch []byte) (byte, uint64, error) { | |||
return 0, 0, err | |||
} | |||
|
|||
defer func() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Note that the defer in CborReadHeaderBuf
was missed in #64 - it will need to be fixed if this PR is not accepted.
Note that allocating buffers like |
Given that there are so few allocations anyways, I'm somewhat skeptical of this. Have you run any pprof profiles with |
Some perhaps. But escape analysis shows that, in general, these scratch buffers escape:
|
Yes. The general approach in this kind of optimization is to profile step by step, analyzing the top allocating functions, optimizing and then repeating. I'm following this up today with profiling clients of this package to see the effects against real usage. |
I compared this branch against the current master for a couple of downstream packages. That is I updated each package to the use latest cbor-gen (485c475) re-ran the generation and benchmarked, then repeated for this branch. Note for go-hamt-ipld
go-amt-ipld
Note: |
More comparisons between current master and this pool change. lotus chain/state package
lotus chain/types package
|
I'll take a further look to see if I can reduce the slowdown. It may be that the addition of the header buffer pool is not worth the cpu overhead, but the string buffer may well be. |
I'd really start by profiling the target program to see if this is a source of allocation issues. |
Yes. This is what I already have from a full history lily node. I've rebuilt with a version of CBOR unmarshalling is right up there, so I believe optimization is worth investing in.
(note: this node is simply syncing the chain, not running any lily specific work) |
After running lily with the modified
|
Awesome! @whyrusleeping or @Kubuxu can you remember why we didn't do this before? |
I was optimizing through just for minimizing alloc count and CPU cost. @iand what happens if we leave the buffers to get passed through all the |
Lily profile after 12 hours with the pool changes
Added svg versions of the alloc_objects profiles which are surprisingly readable: lily master: lily with pools: |
This is what I tried originally. I had the generated files include a pool that the marshaling functions would use to create a buffer to pass to the cbor-gen library code. I didn't profile it much because I felt it was quite messy. For example you have to deal with the client generating more than one file in the same package which means either holding some state so you only generate one pool or generating a separate pool for each file. Then you have to make the generation functions aware of the scope so they know which pool to use. I had that working but then realised I could push the pools into the library and remove a load of duplicated code at the same time, leaving a cleaner cbor api to work with. So I didn't spend much time profiling the first version. I'll explore it again though. |
Closing in favour of #67 |
Adds pools for small buffers used for reading cbor headers and for larger
buffers used when reading strings. This trades off some cpu for less pressure
on the garbage collector. The benchmarks show a notable increase in
cpu but allocations are amortized to near zero in many cases.
Internalising the management of scratch buffers simplifies the code and
allows removal/deprecation of duplicate implementations for several functions.
Users will need to re-generate marshaling methods to benefit from the removal
of scratch buffers from those methods.
Benchstat comparison with master: