Add CborReader and CborWriter #67

iand · 2022-03-22T13:20:17Z

Gives a better tradeoff between cpu and allocations than #65 so I am closing that PR in favour of this one.

Adds CborReader and CborWriter that contain a small buffer for optimizing CBOR header reads and writes. In contrast to #65 no pool is used for header buffers, instead the reader or writer reuses a single buffer over multiple operations. A package level pool is used for 8k buffers used when reading strings. The benchmarks show an increase in cpu but allocations are substantially reduced.

Benchstat comparison with master:

name               old time/op    new time/op    delta
Marshaling-8          564ns ± 1%     840ns ± 0%  +49.06%  (p=0.000 n=9+9)
Unmarshaling-8       2.75µs ± 5%    3.07µs ± 9%  +11.71%  (p=0.000 n=10+10)
LinkScan-8            744ns ± 0%     932ns ± 1%  +25.26%  (p=0.000 n=10+10)
Deferred-8           1.68µs ± 1%    1.89µs ± 2%  +12.57%  (p=0.000 n=10+10)
MapMarshaling-8       317ns ± 0%     441ns ± 0%  +39.21%  (p=0.000 n=9+9)
MapUnmarshaling-8    2.71µs ±10%    3.07µs ±12%  +13.50%  (p=0.000 n=10+10)

name               old alloc/op   new alloc/op   delta
Marshaling-8           160B ± 0%       64B ± 0%  -60.00%  (p=0.000 n=10+10)
Unmarshaling-8       3.44kB ± 0%    2.03kB ± 0%  -41.08%  (p=0.000 n=10+10)
LinkScan-8             112B ± 0%      112B ± 0%     ~     (all equal)
Deferred-8            88.0B ± 0%     88.0B ± 0%     ~     (all equal)
MapMarshaling-8       48.0B ± 0%     64.0B ± 0%  +33.33%  (p=0.000 n=10+10)
MapUnmarshaling-8    2.53kB ± 0%    1.60kB ± 0%  -36.63%  (p=0.000 n=10+10)

name               old allocs/op  new allocs/op  delta
Marshaling-8           10.0 ± 0%       2.0 ± 0%  -80.00%  (p=0.000 n=10+10)
Unmarshaling-8         43.0 ± 0%      23.0 ± 0%  -46.51%  (p=0.000 n=10+10)
LinkScan-8             1.00 ± 0%      1.00 ± 0%     ~     (all equal)
Deferred-8             3.00 ± 0%      3.00 ± 0%     ~     (all equal)
MapMarshaling-8        5.00 ± 0%      4.00 ± 0%  -20.00%  (p=0.000 n=10+10)
MapUnmarshaling-8      56.0 ± 0%      29.0 ± 0%  -48.21%  (p=0.000 n=10+10)

There's probably some additional opportunities for refactoring around the reader/writers but I'm leaviing this change focussed on the initial performance improvements.

Adds pools for small buffers used for reading cbor headers and for larger buffers used when reading strings. This trades off some cpu for less pressure on the garbage collector. The benchmarks show a notable increase in cpu but allocations are amortized to near zero in many cases. Internalising the management of scratch buffers simplifies the code and allows removal/deprecation of duplicate implementations for several functions. Users will need to re-generate marshaling methods to benefit from the removal of scratch buffers from those methods. Benchstat comparison with master: name old time/op new time/op delta Marshaling-8 564ns ± 1% 1123ns ± 3% +99.16% (p=0.000 n=9+10) Unmarshaling-8 2.75µs ± 5% 3.53µs ± 4% +28.63% (p=0.000 n=10+10) LinkScan-8 744ns ± 0% 1694ns ± 1% +127.69% (p=0.000 n=10+9) Deferred-8 1.68µs ± 1% 3.90µs ± 0% +131.76% (p=0.000 n=10+9) MapMarshaling-8 317ns ± 0% 667ns ± 2% +110.55% (p=0.000 n=9+10) MapUnmarshaling-8 2.71µs ±10% 3.33µs ± 3% +23.26% (p=0.000 n=10+10) name old alloc/op new alloc/op delta Marshaling-8 160B ± 0% 0B -100.00% (p=0.000 n=10+10) Unmarshaling-8 3.44kB ± 0% 1.96kB ± 0% -42.94% (p=0.000 n=10+10) LinkScan-8 112B ± 0% 0B -100.00% (p=0.000 n=10+10) Deferred-8 88.0B ± 0% 72.0B ± 0% -18.18% (p=0.000 n=10+10) MapMarshaling-8 48.0B ± 0% 2.0B ± 0% -95.83% (p=0.000 n=10+10) MapUnmarshaling-8 2.53kB ± 0% 1.54kB ± 0% -39.15% (p=0.000 n=10+10) name old allocs/op new allocs/op delta Marshaling-8 10.0 ± 0% 0.0 -100.00% (p=0.000 n=10+10) Unmarshaling-8 43.0 ± 0% 21.0 ± 0% -51.16% (p=0.000 n=10+10) LinkScan-8 1.00 ± 0% 0.00 -100.00% (p=0.000 n=10+10) Deferred-8 3.00 ± 0% 2.00 ± 0% -33.33% (p=0.000 n=10+10) MapMarshaling-8 5.00 ± 0% 2.00 ± 0% -60.00% (p=0.000 n=10+10) MapUnmarshaling-8 56.0 ± 0% 27.0 ± 0% -51.79% (p=0.000 n=10+10)

iand · 2022-03-22T13:24:27Z

@Stebalien @whyrusleeping

utils.go

Stebalien · 2022-03-22T22:20:56Z

utils.go

+		for i := range buf {
+			buf[i] = 0
+		}


But we don't really need this, do we?

(although I guess it likely doesn't matter)

Good practice to clear the buffer before returning to pool. Although we're currently careful not to read past bounds so it's not strictly necessary.

This gives a small improvement in some cases, but an 8% improvement in map marshalling speed.

It doesn't look like this escapes, but removing this saves ~5% for LinkScan, Deferred, and MapMarshaling.

Stebalien · 2022-03-22T23:46:05Z

utils.go

@@ -614,12 +632,15 @@ func bufToCid(buf []byte) (cid.Cid, error) {
 var byteArrZero = []byte{0}

 func WriteCid(w io.Writer, c cid.Cid) error {
+	if cw, ok := w.(*CborWriter); ok {
+		w = cw // take advantage of cbor writer scratch buffer


This doesn't do anything.

Yeah, I should call NewCborWriter here

iand · 2022-03-23T12:31:10Z

Updated comparison with master shows a much better time delta for the Marshaling benchmark

name               old time/op    new time/op    delta
Marshaling-8          564ns ± 1%     634ns ± 1%  +12.48%  (p=0.000 n=9+10)
Unmarshaling-8       2.75µs ± 5%    3.21µs ±16%  +16.98%  (p=0.000 n=10+10)
LinkScan-8            744ns ± 0%     948ns ± 3%  +27.32%  (p=0.000 n=10+10)
Deferred-8           1.68µs ± 1%    1.89µs ± 2%  +12.68%  (p=0.000 n=10+9)
MapMarshaling-8       317ns ± 0%     438ns ± 0%  +38.38%  (p=0.000 n=9+10)
MapUnmarshaling-8    2.71µs ±10%    3.18µs ±15%  +17.66%  (p=0.000 n=10+9)

name               old alloc/op   new alloc/op   delta
Marshaling-8           160B ± 0%       64B ± 0%  -60.00%  (p=0.000 n=10+10)
Unmarshaling-8       3.44kB ± 0%    2.03kB ± 0%  -41.08%  (p=0.000 n=10+9)
LinkScan-8             112B ± 0%      112B ± 0%     ~     (all equal)
Deferred-8            88.0B ± 0%     88.0B ± 0%     ~     (all equal)
MapMarshaling-8       48.0B ± 0%     64.0B ± 0%  +33.33%  (p=0.000 n=10+10)
MapUnmarshaling-8    2.53kB ± 0%    1.60kB ± 0%  -36.63%  (p=0.000 n=10+8)

name               old allocs/op  new allocs/op  delta
Marshaling-8           10.0 ± 0%       2.0 ± 0%  -80.00%  (p=0.000 n=10+10)
Unmarshaling-8         43.0 ± 0%      23.0 ± 0%  -46.51%  (p=0.000 n=10+10)
LinkScan-8             1.00 ± 0%      1.00 ± 0%     ~     (all equal)
Deferred-8             3.00 ± 0%      3.00 ± 0%     ~     (all equal)
MapMarshaling-8        5.00 ± 0%      4.00 ± 0%  -20.00%  (p=0.000 n=10+10)
MapUnmarshaling-8      56.0 ± 0%      29.0 ± 0%  -48.21%  (p=0.000 n=10+10)

iand · 2022-03-23T16:35:11Z

Results from go-hamt-ipld, comparison against master of cbor-gen. Find benchmark exercises unmarshaling pretty well.

name                       old time/op       new time/op       delta
SerializeNode-8                 9.88µs ±13%      10.92µs ± 5%  +10.54%  (p=0.009 n=10+10)
GetNode-8                       17.8µs ± 2%       14.4µs ± 5%  -19.13%  (p=0.000 n=10+10)
Find/n=1k/bitwidth=5-8          24.0µs ± 2%       23.0µs ± 2%   -3.98%  (p=0.000 n=10+10)
Find/n=5k/bitwidth=5-8          28.3µs ± 3%       27.7µs ± 5%     ~     (p=0.065 n=9+10)
Find/n=10k/bitwidth=5-8         23.1µs ± 3%       21.7µs ± 2%   -6.07%  (p=0.000 n=10+10)
Find/n=50k/bitwidth=5-8         38.4µs ± 7%       36.1µs ± 6%   -5.83%  (p=0.015 n=10+10)
Find/n=100k/bitwidth=5-8        41.0µs ± 8%       41.2µs ± 6%     ~     (p=1.000 n=10+10)
Find/n=500k/bitwidth=5-8        30.2µs ± 3%       29.3µs ± 4%   -3.29%  (p=0.003 n=10+10)
Find/n=1000k/bitwidth=5-8       36.7µs ± 2%       35.3µs ± 1%   -3.91%  (p=0.000 n=9+8)

name                       old alloc/op      new alloc/op      delta
SerializeNode-8                 14.9kB ± 3%       16.3kB ± 3%   +9.57%  (p=0.000 n=10+10)
GetNode-8                       20.7kB ± 0%       18.7kB ± 0%   -9.61%  (p=0.000 n=10+10)
Find/n=1k/bitwidth=5-8          35.0kB ± 2%       34.7kB ± 1%     ~     (p=0.156 n=10+9)
Find/n=5k/bitwidth=5-8          35.6kB ± 4%       35.4kB ± 6%     ~     (p=0.631 n=10+10)
Find/n=10k/bitwidth=5-8         23.3kB ± 2%       22.5kB ± 2%   -3.20%  (p=0.000 n=10+10)
Find/n=50k/bitwidth=5-8         51.2kB ± 0%       50.3kB ± 1%   -1.73%  (p=0.000 n=10+10)
Find/n=100k/bitwidth=5-8        55.0kB ± 1%       54.1kB ± 1%   -1.69%  (p=0.000 n=10+10)
Find/n=500k/bitwidth=5-8        33.5kB ± 0%       32.7kB ± 0%   -2.33%  (p=0.000 n=10+10)
Find/n=1000k/bitwidth=5-8       46.3kB ± 0%       45.3kB ± 0%   -2.06%  (p=0.000 n=10+10)

name                       old allocs/op     new allocs/op     delta
SerializeNode-8                    111 ± 2%          107 ± 5%   -3.86%  (p=0.001 n=10+10)
GetNode-8                          519 ± 0%          264 ± 0%  -49.13%  (p=0.000 n=10+10)
Find/n=1k/bitwidth=5-8             478 ± 2%          398 ± 1%  -16.63%  (p=0.000 n=10+10)
Find/n=10k/bitwidth=5-8            413 ± 1%          330 ± 1%  -20.25%  (p=0.000 n=10+10)
Find/n=50k/bitwidth=5-8            728 ± 0%          600 ± 0%  -17.50%  (p=0.000 n=10+9)
Find/n=100k/bitwidth=5-8           784 ± 0%          647 ± 0%  -17.45%  (p=0.000 n=10+10)
Find/n=500k/bitwidth=5-8           600 ± 0%          479 ± 0%  -20.17%  (p=0.000 n=10+10)
Find/n=1000k/bitwidth=5-8          743 ± 0%          602 ± 0%  -19.04%  (p=0.000 n=9+10)

Stebalien

LGTM.

Stebalien · 2022-03-23T18:30:11Z

Also, I'm seeing different perf numbers when comparing to master:

name               old time/op    new time/op    delta
Marshaling-8          497ns ± 2%     501ns ± 1%     ~     (p=0.310 n=5+5)
Unmarshaling-8       3.55µs ±15%    2.88µs ± 4%  -18.95%  (p=0.008 n=5+5)
LinkScan-8            547ns ± 2%     729ns ± 1%  +33.13%  (p=0.008 n=5+5)
Deferred-8           1.30µs ± 2%    1.49µs ± 2%  +14.80%  (p=0.008 n=5+5)
MapMarshaling-8       281ns ± 2%     348ns ± 2%  +23.88%  (p=0.008 n=5+5)
MapUnmarshaling-8    3.10µs ±13%    2.51µs ± 4%  -19.03%  (p=0.008 n=5+5)

What version of go are you using? (I'm using 1.18, on an intel i7 laptop).

iand · 2022-03-24T09:54:01Z

Also, I'm seeing different perf numbers when comparing to master:

name               old time/op    new time/op    delta
Marshaling-8          497ns ± 2%     501ns ± 1%     ~     (p=0.310 n=5+5)
Unmarshaling-8       3.55µs ±15%    2.88µs ± 4%  -18.95%  (p=0.008 n=5+5)
LinkScan-8            547ns ± 2%     729ns ± 1%  +33.13%  (p=0.008 n=5+5)
Deferred-8           1.30µs ± 2%    1.49µs ± 2%  +14.80%  (p=0.008 n=5+5)
MapMarshaling-8       281ns ± 2%     348ns ± 2%  +23.88%  (p=0.008 n=5+5)
MapUnmarshaling-8    3.10µs ±13%    2.51µs ± 4%  -19.03%  (p=0.008 n=5+5)

What version of go are you using? (I'm using 1.18, on an intel i7 laptop).

1.18, Linux, Intel i7 desktop

iand changed the title ~~Chore/pool shared~~ Add CborReader and CborWriter Mar 22, 2022

iand mentioned this pull request Mar 22, 2022

Replace scratch buffers with pools #65

Closed

iand force-pushed the chore/pool-shared branch from 5ac3596 to a47bc05 Compare March 22, 2022 13:31

Add CborReader and CborWriter

da8b264

iand force-pushed the chore/pool-shared branch from a47bc05 to da8b264 Compare March 22, 2022 13:32

Stebalien reviewed Mar 22, 2022

View reviewed changes

Stebalien force-pushed the chore/pool-shared branch from 174d96c to da8b264 Compare March 22, 2022 23:17

Stebalien added 2 commits March 22, 2022 19:31

fix: specialize readByte and readByteBuf for CborReader

7c034ad

This gives a small improvement in some cases, but an 8% improvement in map marshalling speed.

fix: remove unecessary scratch allocation

6c166f2

It doesn't look like this escapes, but removing this saves ~5% for LinkScan, Deferred, and MapMarshaling.

Stebalien reviewed Mar 22, 2022

View reviewed changes

Use CborWriter in WriteCid

b835d9c

Stebalien approved these changes Mar 23, 2022

View reviewed changes

Stebalien merged commit 98fa825 into whyrusleeping:master Mar 23, 2022

This was referenced Mar 24, 2022

Use cbor-gen readers and writers filecoin-project/go-hamt-ipld#98

Merged

Use cbor-gen readers and writers filecoin-project/go-amt-ipld#65

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add CborReader and CborWriter #67

Add CborReader and CborWriter #67

iand commented Mar 22, 2022 •

edited

Loading

iand commented Mar 22, 2022

Stebalien Mar 22, 2022

Stebalien Mar 22, 2022

iand Mar 23, 2022

Stebalien Mar 22, 2022

iand Mar 23, 2022

iand commented Mar 23, 2022

iand commented Mar 23, 2022

Stebalien left a comment

Stebalien commented Mar 23, 2022

iand commented Mar 24, 2022

Add CborReader and CborWriter #67

Add CborReader and CborWriter #67

Conversation

iand commented Mar 22, 2022 • edited Loading

iand commented Mar 22, 2022

Stebalien Mar 22, 2022

Choose a reason for hiding this comment

Stebalien Mar 22, 2022

Choose a reason for hiding this comment

iand Mar 23, 2022

Choose a reason for hiding this comment

Stebalien Mar 22, 2022

Choose a reason for hiding this comment

iand Mar 23, 2022

Choose a reason for hiding this comment

iand commented Mar 23, 2022

iand commented Mar 23, 2022

Stebalien left a comment

Choose a reason for hiding this comment

Stebalien commented Mar 23, 2022

iand commented Mar 24, 2022

iand commented Mar 22, 2022 •

edited

Loading