Replace scratch buffers with pools #65

iand · 2022-03-18T11:32:26Z

Adds pools for small buffers used for reading cbor headers and for larger
buffers used when reading strings. This trades off some cpu for less pressure
on the garbage collector. The benchmarks show a notable increase in
cpu but allocations are amortized to near zero in many cases.

Internalising the management of scratch buffers simplifies the code and
allows removal/deprecation of duplicate implementations for several functions.

Users will need to re-generate marshaling methods to benefit from the removal
of scratch buffers from those methods.

Benchstat comparison with master:

name               old time/op    new time/op    delta
Marshaling-8          564ns ± 1%    1123ns ± 3%   +99.16%  (p=0.000 n=9+10)
Unmarshaling-8       2.75µs ± 5%    3.53µs ± 4%   +28.63%  (p=0.000 n=10+10)
LinkScan-8            744ns ± 0%    1694ns ± 1%  +127.69%  (p=0.000 n=10+9)
Deferred-8           1.68µs ± 1%    3.90µs ± 0%  +131.76%  (p=0.000 n=10+9)
MapMarshaling-8       317ns ± 0%     667ns ± 2%  +110.55%  (p=0.000 n=9+10)
MapUnmarshaling-8    2.71µs ±10%    3.33µs ± 3%   +23.26%  (p=0.000 n=10+10)

name               old alloc/op   new alloc/op   delta
Marshaling-8           160B ± 0%        0B       -100.00%  (p=0.000 n=10+10)
Unmarshaling-8       3.44kB ± 0%    1.96kB ± 0%   -42.94%  (p=0.000 n=10+10)
LinkScan-8             112B ± 0%        0B       -100.00%  (p=0.000 n=10+10)
Deferred-8            88.0B ± 0%     72.0B ± 0%   -18.18%  (p=0.000 n=10+10)
MapMarshaling-8       48.0B ± 0%      2.0B ± 0%   -95.83%  (p=0.000 n=10+10)
MapUnmarshaling-8    2.53kB ± 0%    1.54kB ± 0%   -39.15%  (p=0.000 n=10+10)

name               old allocs/op  new allocs/op  delta
Marshaling-8           10.0 ± 0%       0.0       -100.00%  (p=0.000 n=10+10)
Unmarshaling-8         43.0 ± 0%      21.0 ± 0%   -51.16%  (p=0.000 n=10+10)
LinkScan-8             1.00 ± 0%      0.00       -100.00%  (p=0.000 n=10+10)
Deferred-8             3.00 ± 0%      2.00 ± 0%   -33.33%  (p=0.000 n=10+10)
MapMarshaling-8        5.00 ± 0%      2.00 ± 0%   -60.00%  (p=0.000 n=10+10)
MapUnmarshaling-8      56.0 ± 0%      27.0 ± 0%   -51.79%  (p=0.000 n=10+10)

Adds pools for small buffers used for reading cbor headers and for larger buffers used when reading strings. This trades off some cpu for less pressure on the garbage collector. The benchmarks show a notable increase in cpu but allocations are amortized to near zero in many cases. Internalising the management of scratch buffers simplifies the code and allows removal/deprecation of duplicate implementations for several functions. Users will need to re-generate marshaling methods to benefit from the removal of scratch buffers from those methods. Benchstat comparison with master: name old time/op new time/op delta Marshaling-8 564ns ± 1% 1123ns ± 3% +99.16% (p=0.000 n=9+10) Unmarshaling-8 2.75µs ± 5% 3.53µs ± 4% +28.63% (p=0.000 n=10+10) LinkScan-8 744ns ± 0% 1694ns ± 1% +127.69% (p=0.000 n=10+9) Deferred-8 1.68µs ± 1% 3.90µs ± 0% +131.76% (p=0.000 n=10+9) MapMarshaling-8 317ns ± 0% 667ns ± 2% +110.55% (p=0.000 n=9+10) MapUnmarshaling-8 2.71µs ±10% 3.33µs ± 3% +23.26% (p=0.000 n=10+10) name old alloc/op new alloc/op delta Marshaling-8 160B ± 0% 0B -100.00% (p=0.000 n=10+10) Unmarshaling-8 3.44kB ± 0% 1.96kB ± 0% -42.94% (p=0.000 n=10+10) LinkScan-8 112B ± 0% 0B -100.00% (p=0.000 n=10+10) Deferred-8 88.0B ± 0% 72.0B ± 0% -18.18% (p=0.000 n=10+10) MapMarshaling-8 48.0B ± 0% 2.0B ± 0% -95.83% (p=0.000 n=10+10) MapUnmarshaling-8 2.53kB ± 0% 1.54kB ± 0% -39.15% (p=0.000 n=10+10) name old allocs/op new allocs/op delta Marshaling-8 10.0 ± 0% 0.0 -100.00% (p=0.000 n=10+10) Unmarshaling-8 43.0 ± 0% 21.0 ± 0% -51.16% (p=0.000 n=10+10) LinkScan-8 1.00 ± 0% 0.00 -100.00% (p=0.000 n=10+10) Deferred-8 3.00 ± 0% 2.00 ± 0% -33.33% (p=0.000 n=10+10) MapMarshaling-8 5.00 ± 0% 2.00 ± 0% -60.00% (p=0.000 n=10+10) MapUnmarshaling-8 56.0 ± 0% 27.0 ± 0% -51.79% (p=0.000 n=10+10)

iand · 2022-03-18T11:36:09Z

utils.go

@@ -336,6 +293,12 @@ func CborReadHeaderBuf(br io.Reader, scratch []byte) (byte, uint64, error) {
 		return 0, 0, err
 	}

+	defer func() {


Note that the defer in CborReadHeaderBuf was missed in #64 - it will need to be fixed if this PR is not accepted.

mvdan · 2022-03-20T17:46:21Z

Note that allocating buffers like make([]byte, 8) will end up on the stack unless they escape into the heap, by e.g. being returned or being fed to a generic io.Reader.Read. I imagine part of the slow-down here is caused by some of the lines you're replacing not being allocations in the first place.

Stebalien · 2022-03-20T19:17:36Z

Given that there are so few allocations anyways, I'm somewhat skeptical of this. Have you run any pprof profiles with go tool pprof -alloc_objects to verify?

iand · 2022-03-21T10:17:05Z

Note that allocating buffers like make([]byte, 8) will end up on the stack unless they escape into the heap, by e.g. being returned or being fed to a generic io.Reader.Read. I imagine part of the slow-down here is caused by some of the lines you're replacing not being allocations in the first place.

Some perhaps. But escape analysis shows that, in general, these scratch buffers escape:

~/wip/iand/cbor-gen/testing [master|✔] 
10:16 $ go build -gcflags="-m"

... elided...
./cbor_gen.go:32:17: make([]byte, 9) escapes to heap
./cbor_gen.go:50:7: t does not escape
./cbor_gen.go:50:37: leaking param: r
./cbor_gen.go:53:21: &typegen.peeker{...} escapes to heap
./cbor_gen.go:54:17: make([]byte, 8) escapes to heap
./cbor_gen.go:60:8: func literal does not escape
./cbor_gen.go:82:20: ... argument does not escape
./cbor_gen.go:82:21: extra escapes to heap
./cbor_gen.go:90:18: make([]uint64, extra) escapes to heap
./cbor_gen.go:97:25: ... argument does not escape
./cbor_gen.go:101:25: ... argument does not escape
./cbor_gen.go:101:26: maj escapes to heap
./cbor_gen.go:112:7: leaking param content: t
./cbor_gen.go:112:37: leaking param: w
./cbor_gen.go:121:17: make([]byte, 9) escapes to heap
./cbor_gen.go:179:7: leaking param content: t
./cbor_gen.go:179:39: leaking param: r
./cbor_gen.go:182:21: &typegen.peeker{...} escapes to heap
./cbor_gen.go:183:17: make([]byte, 8) escapes to heap
... elided...

iand · 2022-03-21T10:20:51Z

Given that there are so few allocations anyways, I'm somewhat skeptical of this. Have you run any pprof profiles with go tool pprof -alloc_objects to verify?

Yes. The general approach in this kind of optimization is to profile step by step, analyzing the top allocating functions, optimizing and then repeating.

I'm following this up today with profiling clients of this package to see the effects against real usage.

iand · 2022-03-21T12:34:39Z

I compared this branch against the current master for a couple of downstream packages. That is I updated each package to the use latest cbor-gen (485c475) re-ran the generation and benchmarked, then repeated for this branch. Note for go-hamt-ipld I hand edited the generated files to remove the scratch buffers and calls to *Buf functions as per the comment in the generated file.

go-hamt-ipld

name             old time/op    new time/op    delta
SerializeNode-8    15.1µs ±21%    13.1µs ±24%     ~     (p=0.113 n=9+10)

name             old alloc/op   new alloc/op   delta
SerializeNode-8    15.0kB ± 0%    14.1kB ± 0%   -5.47%  (p=0.000 n=9+10)

name             old allocs/op  new allocs/op  delta
SerializeNode-8       112 ± 1%        61 ± 6%  -45.89%  (p=0.000 n=9+10)

go-amt-ipld

name                old time/op    new time/op    delta
AMTInsertBulk-8       18.5ms ±11%    19.1ms ±13%   +2.98%  (p=0.000 n=120+120)
AMTLoadAndInsert-8    128µs ±367%    137µs ±368%     ~     (p=0.157 n=110+110)
NodesForHeight-8      0.24ns ± 1%    0.24ns ± 1%   +0.06%  (p=0.039 n=109+115)

name                old gets/op    new gets/op    delta
AMTInsertBulk-8        60.8 ±452%     60.8 ±452%     ~     (p=1.000 n=110+110)
AMTLoadAndInsert-8      2.98 ±97%      2.98 ±97%     ~     (p=1.000 n=110+110)

name                old puts/op    new puts/op    delta
AMTInsertBulk-8        60.8 ±452%     60.8 ±452%     ~     (p=1.000 n=110+110)
AMTLoadAndInsert-8     3.00 ±100%     3.00 ±100%     ~     (p=1.000 n=110+110)

name                old alloc/op   new alloc/op   delta
AMTInsertBulk-8       6.37MB ± 5%    6.29MB ± 5%   -1.33%  (p=0.000 n=110+110)
AMTLoadAndInsert-8    334kB ±410%    328kB ±411%     ~     (p=0.259 n=110+110)
NodesForHeight-8       0.00B          0.00B          ~     (all equal)

name                old allocs/op  new allocs/op  delta
AMTInsertBulk-8        92.0k ± 9%     81.9k ±10%  -10.99%  (p=0.000 n=110+110)
AMTLoadAndInsert-8    2.27k ±389%    1.91k ±386%     ~     (p=0.169 n=110+110)
NodesForHeight-8        0.00           0.00          ~     (all equal)

Note: go-amt-ipld is on an old version of cbor-gen. I also benchmarked that version against current master and saw some performance regressions but I didn't dig into the cause. The variability in the benchmarks looks really odd, especially numbers of gets/puts for AMTInsertBulk, leading to a p of 1. The above benchmarks are based on current master.

iand · 2022-03-21T15:54:53Z

More comparisons between current master and this pool change.

lotus chain/state package

name                 old time/op    new time/op    delta
StateTreeSet-8          495ns ± 8%     512ns ±14%     ~     (p=0.326 n=9+10)
StateTreeSetFlush-8    10.0ms ± 0%    11.7ms ± 1%  +17.20%  (p=0.008 n=5+5)

name                 old alloc/op   new alloc/op   delta
StateTreeSet-8           359B ± 8%      364B ± 4%     ~     (p=0.197 n=9+9)
StateTreeSetFlush-8    7.35MB ± 0%    7.22MB ± 0%   -1.74%  (p=0.004 n=6+5)

name                 old allocs/op  new allocs/op  delta
StateTreeSet-8           5.00 ± 0%      5.00 ± 0%     ~     (all equal)
StateTreeSetFlush-8      142k ± 0%      132k ± 0%   -7.04%  (p=0.000 n=6+5)

lotus chain/types package

name                  old time/op    new time/op    delta
BlockHeaderMarshal-8     593ns ± 6%     834ns ± 1%  +40.78%  (p=0.000 n=10+10)
WinCounts-8             3.39µs ± 2%    3.36µs ± 2%     ~     (p=0.068 n=10+9)
SerializeMessage-8       533ns ± 6%     590ns ± 3%  +10.75%  (p=0.000 n=10+8)

name                  old alloc/op   new alloc/op   delta
BlockHeaderMarshal-8      112B ± 0%       64B ± 0%  -42.86%  (p=0.000 n=10+10)
WinCounts-8             1.11kB ± 0%    1.11kB ± 0%     ~     (all equal)
SerializeMessage-8        416B ± 0%      400B ± 0%   -3.85%  (p=0.000 n=10+10)

name                  old allocs/op  new allocs/op  delta
BlockHeaderMarshal-8      16.0 ± 0%      12.0 ± 0%  -25.00%  (p=0.000 n=10+10)
WinCounts-8               21.0 ± 0%      21.0 ± 0%     ~     (all equal)
SerializeMessage-8        16.0 ± 0%      13.0 ± 0%  -18.75%  (p=0.000 n=10+10)

iand · 2022-03-21T15:56:02Z

I'll take a further look to see if I can reduce the slowdown. It may be that the addition of the header buffer pool is not worth the cpu overhead, but the string buffer may well be.

Stebalien · 2022-03-21T16:11:54Z

I'd really start by profiling the target program to see if this is a source of allocation issues.

iand · 2022-03-21T16:36:23Z

I'd really start by profiling the target program to see if this is a source of allocation issues.

Yes. This is what I already have from a full history lily node. I've rebuilt with a version of go-hamt-ipld that uses the pools and leaving it to run for a bit before running a comparison profile.

CBOR unmarshalling is right up there, so I believe optimization is worth investing in.

ubuntu@lily-fullhistory:~$ go tool pprof lily/lily lily-master-allocs.pprof 
File: lily
Build ID: 1e7d35986b67115e2a08957bf5e3873630e0f73b
Type: alloc_space
Time: Mar 21, 2022 at 4:21pm (UTC)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top10
Showing nodes accounting for 54.24TB, 74.79% of 72.52TB total
Dropped 3633 nodes (cum <= 0.36TB)
Showing top 10 nodes out of 275
      flat  flat%   sum%        cum   cum%
   39.13TB 53.96% 53.96%    39.13TB 53.96%  bytes.makeSlice
    3.35TB  4.62% 58.58%    14.66TB 20.21%  github.com/dgraph-io/badger/v2/table.(*Builder).addHelper
    1.96TB  2.70% 61.28%    12.24TB 16.88%  github.com/filecoin-project/go-hamt-ipld/v3.(*Node).UnmarshalCBOR
    1.69TB  2.33% 63.61%     1.69TB  2.33%  github.com/ipfs/go-cid.CidFromBytes
    1.67TB  2.30% 65.91%     1.67TB  2.30%  go.opencensus.io/stats/view.(*DistributionData).clone
    1.41TB  1.95% 67.85%     5.78TB  7.97%  github.com/dgraph-io/badger/v2.(*Txn).Get
    1.39TB  1.92% 69.78%     1.39TB  1.92%  github.com/whyrusleeping/cbor-gen.ReadByteArray
    1.30TB  1.80% 71.57%    40.43TB 55.76%  bytes.(*Buffer).grow
    1.17TB  1.61% 73.19%     1.17TB  1.61%  github.com/dgraph-io/badger/v2/table.(*Table).block
    1.16TB  1.61% 74.79%     1.17TB  1.61%  github.com/dgraph-io/badger/v2.(*DB).newTransaction


ubuntu@lily-fullhistory:~$ go tool pprof --alloc_objects lily/lily lily-master-allocs.pprof 
File: lily
Build ID: 1e7d35986b67115e2a08957bf5e3873630e0f73b
Type: alloc_objects
Time: Mar 21, 2022 at 4:21pm (UTC)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top10
Showing nodes accounting for 311306667274, 50.81% of 612656612901 total
Dropped 3596 nodes (cum <= 3063283064)
Showing top 10 nodes out of 312
      flat  flat%   sum%        cum   cum%
48044314111  7.84%  7.84% 48044314111  7.84%  github.com/dgraph-io/badger/v2.(*levelHandler).getTableForKey
39225411024  6.40% 14.24% 39225411024  6.40%  github.com/ipfs/go-cid.CidFromBytes
37485368248  6.12% 20.36% 165872094588 27.07%  github.com/filecoin-project/go-hamt-ipld/v3.(*Pointer).UnmarshalCBOR
34311629606  5.60% 25.96% 87380907218 14.26%  github.com/whyrusleeping/cbor-gen.(*Deferred).UnmarshalCBOR
32029887703  5.23% 31.19% 32034798094  5.23%  github.com/whyrusleeping/cbor-gen.ReadByteArray
31750037182  5.18% 36.37% 199192856458 32.51%  github.com/filecoin-project/go-hamt-ipld/v3.(*Node).UnmarshalCBOR
27717356689  4.52% 40.90% 84075830794 13.72%  github.com/filecoin-project/go-hamt-ipld/v3.(*KV).UnmarshalCBOR
22379296201  3.65% 44.55% 41252074314  6.73%  bytes.(*Buffer).grow
19490588397  3.18% 47.73% 19490588397  3.18%  bytes.NewBuffer (inline)
18872778113  3.08% 50.81% 18872778113  3.08%  bytes.makeSlice

(note: this node is simply syncing the chain, not running any lily specific work)

iand · 2022-03-21T16:54:43Z

After running lily with the modified go-hamt-ipld for 30 minutes you can see that the cbor unmarshalling has reduced allocations by nearly half (Deferred, for example now responsible for 2.80% of all allocated objects, compared with 5.60% previously). I feel this profile is skewed somewhat because badger is still thrashing its cache somewhat so I'll leave it overnight and resample tomorrow.

ubuntu@lily-fullhistory:~$ go tool pprof lily/lily-cbor-pool lily-pools-allocs.pprof 
File: lily-cbor-pool
Build ID: f314a712be68792ce764d44a24e2361ffeeb1b08
Type: alloc_space
Time: Mar 21, 2022 at 4:47pm (UTC)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top10
Showing nodes accounting for 42.93GB, 89.52% of 47.96GB total
Dropped 1437 nodes (cum <= 0.24GB)
Showing top 10 nodes out of 167
      flat  flat%   sum%        cum   cum%
   17.68GB 36.86% 36.86%    25.22GB 52.58%  github.com/dgraph-io/badger/v2/pb.(*TableIndex).Unmarshal
    7.54GB 15.71% 52.58%     7.54GB 15.71%  github.com/dgraph-io/badger/v2/pb.(*BlockOffset).Unmarshal
    5.95GB 12.41% 64.98%     5.95GB 12.41%  encoding/json.(*decodeState).literalStore
    5.93GB 12.36% 77.35%     5.93GB 12.36%  github.com/dgraph-io/ristretto/z.(*Bloom).Size
    3.49GB  7.29% 84.63%     3.49GB  7.29%  bytes.makeSlice
    0.61GB  1.27% 85.91%     0.61GB  1.27%  github.com/filecoin-project/lotus/lib/backupds.(*Entry).UnmarshalCBOR
    0.50GB  1.04% 86.94%     2.94GB  6.13%  github.com/filecoin-project/go-hamt-ipld/v3.(*Node).UnmarshalCBOR
    0.48GB  1.01% 87.95%     0.49GB  1.01%  github.com/whyrusleeping/cbor-gen.ReadByteArray
    0.43GB   0.9% 88.86%     0.43GB   0.9%  github.com/ipfs/go-cid.CidFromBytes
    0.32GB  0.67% 89.52%     1.41GB  2.94%  github.com/dgraph-io/badger/v2.(*Txn).Get

ubuntu@lily-fullhistory:~$ go tool pprof --alloc_objects lily/lily-cbor-pool lily-pools-allocs.pprof 
File: lily-cbor-pool
Build ID: f314a712be68792ce764d44a24e2361ffeeb1b08
Type: alloc_objects
Time: Mar 21, 2022 at 4:47pm (UTC)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top10
Showing nodes accounting for 272095232, 78.76% of 345487341 total
Dropped 1413 nodes (cum <= 1727436)
Showing top 10 nodes out of 191
      flat  flat%   sum%        cum   cum%
 101155979 29.28% 29.28%  101155979 29.28%  github.com/dgraph-io/badger/v2/pb.(*BlockOffset).Unmarshal
 100562611 29.11% 58.39%  201718590 58.39%  github.com/dgraph-io/badger/v2/pb.(*TableIndex).Unmarshal
  11829392  3.42% 61.81%   11829392  3.42%  github.com/dgraph-io/badger/v2.(*levelHandler).getTableForKey
   9776194  2.83% 64.64%    9776194  2.83%  github.com/ipfs/go-cid.CidFromBytes
   9723821  2.81% 67.45%    9887663  2.86%  github.com/whyrusleeping/cbor-gen.ReadByteArray
   9688562  2.80% 70.26%   22619162  6.55%  github.com/whyrusleeping/cbor-gen.(*Deferred).UnmarshalCBOR
   9142424  2.65% 72.91%   37840905 10.95%  github.com/filecoin-project/go-hamt-ipld/v3.(*Pointer).UnmarshalCBOR
   7879392  2.28% 75.19%   46080747 13.34%  github.com/filecoin-project/go-hamt-ipld/v3.(*Node).UnmarshalCBOR
   7454212  2.16% 77.34%    7454212  2.16%  github.com/filecoin-project/lotus/lib/backupds.(*Entry).UnmarshalCBOR
   4882645  1.41% 78.76%    4882645  1.41%  bytes.NewBuffer (inline)

Stebalien · 2022-03-21T17:13:38Z

Awesome! @whyrusleeping or @Kubuxu can you remember why we didn't do this before?

whyrusleeping · 2022-03-21T19:43:42Z

I was optimizing through just for minimizing alloc count and CPU cost. @iand what happens if we leave the buffers to get passed through all the *Buf methods, but use the pool to allocate the top level scratch buffers? That was one change I was going to investigate making, I'm willing to bet that puts the cpu usage back in line with what we were seeing before, while still keeping overall GC load down

iand · 2022-03-22T10:11:25Z

Lily profile after 12 hours with the pool changes

ubuntu@lily-fullhistory:~$ go tool pprof lily/lily-cbor-pool lily-pools-allocs.pprof 
File: lily-cbor-pool
Build ID: f314a712be68792ce764d44a24e2361ffeeb1b08
Type: alloc_space
Time: Mar 22, 2022 at 9:54am (UTC)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top10
Showing nodes accounting for 740.40GB, 78.68% of 941.03GB total
Dropped 2669 nodes (cum <= 4.71GB)
Showing top 10 nodes out of 209
      flat  flat%   sum%        cum   cum%
  550.83GB 58.53% 58.53%   550.83GB 58.53%  bytes.makeSlice
   50.80GB  5.40% 63.93%   222.84GB 23.68%  github.com/dgraph-io/badger/v2/table.(*Builder).addHelper
   21.93GB  2.33% 66.26%    31.31GB  3.33%  github.com/dgraph-io/badger/v2/pb.(*TableIndex).Unmarshal
   19.31GB  2.05% 68.32%    19.31GB  2.05%  github.com/ipfs/go-cid.CidFromBytes
   18.32GB  1.95% 70.26%   115.30GB 12.25%  github.com/filecoin-project/go-hamt-ipld/v3.(*Node).UnmarshalCBOR
   17.57GB  1.87% 72.13%    75.74GB  8.05%  github.com/dgraph-io/badger/v2.(*Txn).Get
   16.37GB  1.74% 73.87%    16.37GB  1.74%  github.com/dgraph-io/badger/v2/table.(*Table).block
   15.44GB  1.64% 75.51%    15.45GB  1.64%  github.com/whyrusleeping/cbor-gen.ReadByteArray
   15.20GB  1.62% 77.13%   566.03GB 60.15%  bytes.(*Buffer).grow
   14.63GB  1.55% 78.68%    14.68GB  1.56%  github.com/dgraph-io/badger/v2.(*DB).newTransaction


ubuntu@lily-fullhistory:~$ go tool pprof --alloc_objects lily/lily-cbor-pool lily-pools-allocs.pprof 
File: lily-cbor-pool
Build ID: f314a712be68792ce764d44a24e2361ffeeb1b08
Type: alloc_objects
Time: Mar 22, 2022 at 9:54am (UTC)
Entering interactive mode (type "help" for commands, "o" for options)
(pprof) top10
Showing nodes accounting for 3417812671, 52.86% of 6466082967 total
Dropped 2646 nodes (cum <= 32330414)
Showing top 10 nodes out of 232
      flat  flat%   sum%        cum   cum%
 605703250  9.37%  9.37%  605703250  9.37%  github.com/dgraph-io/badger/v2.(*levelHandler).getTableForKey
 464724635  7.19% 16.55% 1123818522 17.38%  github.com/whyrusleeping/cbor-gen.(*Deferred).UnmarshalCBOR
 435888632  6.74% 23.30%  435888632  6.74%  github.com/ipfs/go-cid.CidFromBytes
 345108583  5.34% 28.63%  345446512  5.34%  github.com/whyrusleeping/cbor-gen.ReadByteArray
 337844730  5.22% 33.86% 1439777385 22.27%  github.com/filecoin-project/go-hamt-ipld/v3.(*Pointer).UnmarshalCBOR
 286784303  4.44% 38.29% 1740062413 26.91%  github.com/filecoin-project/go-hamt-ipld/v3.(*Node).UnmarshalCBOR
 255089795  3.95% 42.24%  457805603  7.08%  bytes.(*Buffer).grow
 244209190  3.78% 46.01%  244209190  3.78%  bytes.NewBuffer (inline)
 224220629  3.47% 49.48%  224220629  3.47%  github.com/dgraph-io/badger/v2/table.(*Table).block
 218238924  3.38% 52.86%  218238924  3.38%  github.com/dgraph-io/badger/v2.(*DB).getMemTables

(*Node).UnmarshalCBOR is the root of the allocation tree for most CBOR operations. Without pools it, or the functions it calls, is responsible for 32.51% of all allocations. With the pools that is down to 26.91%. That's a pretty good reduction considering that Unmarshalling always has to allocate.

Added svg versions of the alloc_objects profiles which are surprisingly readable:

lily master:

lily with pools:

iand · 2022-03-22T10:22:02Z

I was optimizing through just for minimizing alloc count and CPU cost. @iand what happens if we leave the buffers to get passed through all the *Buf methods, but use the pool to allocate the top level scratch buffers? That was one change I was going to investigate making, I'm willing to bet that puts the cpu usage back in line with what we were seeing before, while still keeping overall GC load down

This is what I tried originally. I had the generated files include a pool that the marshaling functions would use to create a buffer to pass to the cbor-gen library code.

I didn't profile it much because I felt it was quite messy. For example you have to deal with the client generating more than one file in the same package which means either holding some state so you only generate one pool or generating a separate pool for each file. Then you have to make the generation functions aware of the scope so they know which pool to use.

I had that working but then realised I could push the pools into the library and remove a load of duplicated code at the same time, leaving a cleaner cbor api to work with. So I didn't spend much time profiling the first version. I'll explore it again though.

iand · 2022-03-22T13:25:27Z

Closing in favour of #67

iand commented Mar 18, 2022

View reviewed changes

iand mentioned this pull request Mar 22, 2022

Add CborReader and CborWriter #67

Merged

iand closed this Mar 22, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Replace scratch buffers with pools #65

Replace scratch buffers with pools #65

iand commented Mar 18, 2022 •

edited

Loading

iand Mar 18, 2022

mvdan commented Mar 20, 2022

Stebalien commented Mar 20, 2022

iand commented Mar 21, 2022

iand commented Mar 21, 2022

iand commented Mar 21, 2022

iand commented Mar 21, 2022

iand commented Mar 21, 2022

Stebalien commented Mar 21, 2022

iand commented Mar 21, 2022 •

edited

Loading

iand commented Mar 21, 2022

Stebalien commented Mar 21, 2022

whyrusleeping commented Mar 21, 2022

iand commented Mar 22, 2022 •

edited

Loading

iand commented Mar 22, 2022

iand commented Mar 22, 2022

Replace scratch buffers with pools #65

Replace scratch buffers with pools #65

Conversation

iand commented Mar 18, 2022 • edited Loading

iand Mar 18, 2022

Choose a reason for hiding this comment

mvdan commented Mar 20, 2022

Stebalien commented Mar 20, 2022

iand commented Mar 21, 2022

iand commented Mar 21, 2022

iand commented Mar 21, 2022

go-hamt-ipld

go-amt-ipld

iand commented Mar 21, 2022

lotus chain/state package

lotus chain/types package

iand commented Mar 21, 2022

Stebalien commented Mar 21, 2022

iand commented Mar 21, 2022 • edited Loading

iand commented Mar 21, 2022

Stebalien commented Mar 21, 2022

whyrusleeping commented Mar 21, 2022

iand commented Mar 22, 2022 • edited Loading

iand commented Mar 22, 2022

iand commented Mar 22, 2022

iand commented Mar 18, 2022 •

edited

Loading

iand commented Mar 21, 2022 •

edited

Loading

iand commented Mar 22, 2022 •

edited

Loading