Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add CborReader and CborWriter #67

Merged
merged 5 commits into from
Mar 23, 2022

Conversation

iand
Copy link
Contributor

@iand iand commented Mar 22, 2022

Gives a better tradeoff between cpu and allocations than #65 so I am closing that PR in favour of this one.

Adds CborReader and CborWriter that contain a small buffer for optimizing CBOR header reads and writes. In contrast to #65 no pool is used for header buffers, instead the reader or writer reuses a single buffer over multiple operations. A package level pool is used for 8k buffers used when reading strings. The benchmarks show an increase in cpu but allocations are substantially reduced.

Benchstat comparison with master:

name               old time/op    new time/op    delta
Marshaling-8          564ns ± 1%     840ns ± 0%  +49.06%  (p=0.000 n=9+9)
Unmarshaling-8       2.75µs ± 5%    3.07µs ± 9%  +11.71%  (p=0.000 n=10+10)
LinkScan-8            744ns ± 0%     932ns ± 1%  +25.26%  (p=0.000 n=10+10)
Deferred-8           1.68µs ± 1%    1.89µs ± 2%  +12.57%  (p=0.000 n=10+10)
MapMarshaling-8       317ns ± 0%     441ns ± 0%  +39.21%  (p=0.000 n=9+9)
MapUnmarshaling-8    2.71µs ±10%    3.07µs ±12%  +13.50%  (p=0.000 n=10+10)

name               old alloc/op   new alloc/op   delta
Marshaling-8           160B ± 0%       64B ± 0%  -60.00%  (p=0.000 n=10+10)
Unmarshaling-8       3.44kB ± 0%    2.03kB ± 0%  -41.08%  (p=0.000 n=10+10)
LinkScan-8             112B ± 0%      112B ± 0%     ~     (all equal)
Deferred-8            88.0B ± 0%     88.0B ± 0%     ~     (all equal)
MapMarshaling-8       48.0B ± 0%     64.0B ± 0%  +33.33%  (p=0.000 n=10+10)
MapUnmarshaling-8    2.53kB ± 0%    1.60kB ± 0%  -36.63%  (p=0.000 n=10+10)

name               old allocs/op  new allocs/op  delta
Marshaling-8           10.0 ± 0%       2.0 ± 0%  -80.00%  (p=0.000 n=10+10)
Unmarshaling-8         43.0 ± 0%      23.0 ± 0%  -46.51%  (p=0.000 n=10+10)
LinkScan-8             1.00 ± 0%      1.00 ± 0%     ~     (all equal)
Deferred-8             3.00 ± 0%      3.00 ± 0%     ~     (all equal)
MapMarshaling-8        5.00 ± 0%      4.00 ± 0%  -20.00%  (p=0.000 n=10+10)
MapUnmarshaling-8      56.0 ± 0%      29.0 ± 0%  -48.21%  (p=0.000 n=10+10)

There's probably some additional opportunities for refactoring around the reader/writers but I'm leaviing this change focussed on the initial performance improvements.

Adds pools for small buffers used for reading cbor headers and for larger
buffers used when reading strings. This trades off some cpu for less pressure
on the garbage collector. The benchmarks show a notable increase in
cpu but allocations are amortized to near zero in many cases.

Internalising the management of scratch buffers simplifies the code and
allows removal/deprecation of duplicate implementations for several functions.

Users will need to re-generate marshaling methods to benefit from the removal
of scratch buffers from those methods.

Benchstat comparison with master:

name               old time/op    new time/op    delta
Marshaling-8          564ns ± 1%    1123ns ± 3%   +99.16%  (p=0.000 n=9+10)
Unmarshaling-8       2.75µs ± 5%    3.53µs ± 4%   +28.63%  (p=0.000 n=10+10)
LinkScan-8            744ns ± 0%    1694ns ± 1%  +127.69%  (p=0.000 n=10+9)
Deferred-8           1.68µs ± 1%    3.90µs ± 0%  +131.76%  (p=0.000 n=10+9)
MapMarshaling-8       317ns ± 0%     667ns ± 2%  +110.55%  (p=0.000 n=9+10)
MapUnmarshaling-8    2.71µs ±10%    3.33µs ± 3%   +23.26%  (p=0.000 n=10+10)

name               old alloc/op   new alloc/op   delta
Marshaling-8           160B ± 0%        0B       -100.00%  (p=0.000 n=10+10)
Unmarshaling-8       3.44kB ± 0%    1.96kB ± 0%   -42.94%  (p=0.000 n=10+10)
LinkScan-8             112B ± 0%        0B       -100.00%  (p=0.000 n=10+10)
Deferred-8            88.0B ± 0%     72.0B ± 0%   -18.18%  (p=0.000 n=10+10)
MapMarshaling-8       48.0B ± 0%      2.0B ± 0%   -95.83%  (p=0.000 n=10+10)
MapUnmarshaling-8    2.53kB ± 0%    1.54kB ± 0%   -39.15%  (p=0.000 n=10+10)

name               old allocs/op  new allocs/op  delta
Marshaling-8           10.0 ± 0%       0.0       -100.00%  (p=0.000 n=10+10)
Unmarshaling-8         43.0 ± 0%      21.0 ± 0%   -51.16%  (p=0.000 n=10+10)
LinkScan-8             1.00 ± 0%      0.00       -100.00%  (p=0.000 n=10+10)
Deferred-8             3.00 ± 0%      2.00 ± 0%   -33.33%  (p=0.000 n=10+10)
MapMarshaling-8        5.00 ± 0%      2.00 ± 0%   -60.00%  (p=0.000 n=10+10)
MapUnmarshaling-8      56.0 ± 0%      27.0 ± 0%   -51.79%  (p=0.000 n=10+10)
@iand
Copy link
Contributor Author

iand commented Mar 22, 2022

@Stebalien @whyrusleeping

@iand iand changed the title Chore/pool shared Add CborReader and CborWriter Mar 22, 2022
utils.go Outdated Show resolved Hide resolved
utils.go Show resolved Hide resolved
Comment on lines +586 to +588
for i := range buf {
buf[i] = 0
}
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But we don't really need this, do we?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

(although I guess it likely doesn't matter)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good practice to clear the buffer before returning to pool. Although we're currently careful not to read past bounds so it's not strictly necessary.

This gives a small improvement in some cases, but an 8% improvement in
map marshalling speed.
It doesn't look like this escapes, but removing this saves ~5% for
LinkScan, Deferred, and MapMarshaling.
utils.go Outdated
@@ -614,12 +632,15 @@ func bufToCid(buf []byte) (cid.Cid, error) {
var byteArrZero = []byte{0}

func WriteCid(w io.Writer, c cid.Cid) error {
if cw, ok := w.(*CborWriter); ok {
w = cw // take advantage of cbor writer scratch buffer
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't do anything.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I should call NewCborWriter here

@iand
Copy link
Contributor Author

iand commented Mar 23, 2022

Updated comparison with master shows a much better time delta for the Marshaling benchmark

name               old time/op    new time/op    delta
Marshaling-8          564ns ± 1%     634ns ± 1%  +12.48%  (p=0.000 n=9+10)
Unmarshaling-8       2.75µs ± 5%    3.21µs ±16%  +16.98%  (p=0.000 n=10+10)
LinkScan-8            744ns ± 0%     948ns ± 3%  +27.32%  (p=0.000 n=10+10)
Deferred-8           1.68µs ± 1%    1.89µs ± 2%  +12.68%  (p=0.000 n=10+9)
MapMarshaling-8       317ns ± 0%     438ns ± 0%  +38.38%  (p=0.000 n=9+10)
MapUnmarshaling-8    2.71µs ±10%    3.18µs ±15%  +17.66%  (p=0.000 n=10+9)

name               old alloc/op   new alloc/op   delta
Marshaling-8           160B ± 0%       64B ± 0%  -60.00%  (p=0.000 n=10+10)
Unmarshaling-8       3.44kB ± 0%    2.03kB ± 0%  -41.08%  (p=0.000 n=10+9)
LinkScan-8             112B ± 0%      112B ± 0%     ~     (all equal)
Deferred-8            88.0B ± 0%     88.0B ± 0%     ~     (all equal)
MapMarshaling-8       48.0B ± 0%     64.0B ± 0%  +33.33%  (p=0.000 n=10+10)
MapUnmarshaling-8    2.53kB ± 0%    1.60kB ± 0%  -36.63%  (p=0.000 n=10+8)

name               old allocs/op  new allocs/op  delta
Marshaling-8           10.0 ± 0%       2.0 ± 0%  -80.00%  (p=0.000 n=10+10)
Unmarshaling-8         43.0 ± 0%      23.0 ± 0%  -46.51%  (p=0.000 n=10+10)
LinkScan-8             1.00 ± 0%      1.00 ± 0%     ~     (all equal)
Deferred-8             3.00 ± 0%      3.00 ± 0%     ~     (all equal)
MapMarshaling-8        5.00 ± 0%      4.00 ± 0%  -20.00%  (p=0.000 n=10+10)
MapUnmarshaling-8      56.0 ± 0%      29.0 ± 0%  -48.21%  (p=0.000 n=10+10)

@iand
Copy link
Contributor Author

iand commented Mar 23, 2022

Results from go-hamt-ipld, comparison against master of cbor-gen. Find benchmark exercises unmarshaling pretty well.

name                       old time/op       new time/op       delta
SerializeNode-8                 9.88µs ±13%      10.92µs ± 5%  +10.54%  (p=0.009 n=10+10)
GetNode-8                       17.8µs ± 2%       14.4µs ± 5%  -19.13%  (p=0.000 n=10+10)
Find/n=1k/bitwidth=5-8          24.0µs ± 2%       23.0µs ± 2%   -3.98%  (p=0.000 n=10+10)
Find/n=5k/bitwidth=5-8          28.3µs ± 3%       27.7µs ± 5%     ~     (p=0.065 n=9+10)
Find/n=10k/bitwidth=5-8         23.1µs ± 3%       21.7µs ± 2%   -6.07%  (p=0.000 n=10+10)
Find/n=50k/bitwidth=5-8         38.4µs ± 7%       36.1µs ± 6%   -5.83%  (p=0.015 n=10+10)
Find/n=100k/bitwidth=5-8        41.0µs ± 8%       41.2µs ± 6%     ~     (p=1.000 n=10+10)
Find/n=500k/bitwidth=5-8        30.2µs ± 3%       29.3µs ± 4%   -3.29%  (p=0.003 n=10+10)
Find/n=1000k/bitwidth=5-8       36.7µs ± 2%       35.3µs ± 1%   -3.91%  (p=0.000 n=9+8)

name                       old alloc/op      new alloc/op      delta
SerializeNode-8                 14.9kB ± 3%       16.3kB ± 3%   +9.57%  (p=0.000 n=10+10)
GetNode-8                       20.7kB ± 0%       18.7kB ± 0%   -9.61%  (p=0.000 n=10+10)
Find/n=1k/bitwidth=5-8          35.0kB ± 2%       34.7kB ± 1%     ~     (p=0.156 n=10+9)
Find/n=5k/bitwidth=5-8          35.6kB ± 4%       35.4kB ± 6%     ~     (p=0.631 n=10+10)
Find/n=10k/bitwidth=5-8         23.3kB ± 2%       22.5kB ± 2%   -3.20%  (p=0.000 n=10+10)
Find/n=50k/bitwidth=5-8         51.2kB ± 0%       50.3kB ± 1%   -1.73%  (p=0.000 n=10+10)
Find/n=100k/bitwidth=5-8        55.0kB ± 1%       54.1kB ± 1%   -1.69%  (p=0.000 n=10+10)
Find/n=500k/bitwidth=5-8        33.5kB ± 0%       32.7kB ± 0%   -2.33%  (p=0.000 n=10+10)
Find/n=1000k/bitwidth=5-8       46.3kB ± 0%       45.3kB ± 0%   -2.06%  (p=0.000 n=10+10)

name                       old allocs/op     new allocs/op     delta
SerializeNode-8                    111 ± 2%          107 ± 5%   -3.86%  (p=0.001 n=10+10)
GetNode-8                          519 ± 0%          264 ± 0%  -49.13%  (p=0.000 n=10+10)
Find/n=1k/bitwidth=5-8             478 ± 2%          398 ± 1%  -16.63%  (p=0.000 n=10+10)
Find/n=10k/bitwidth=5-8            413 ± 1%          330 ± 1%  -20.25%  (p=0.000 n=10+10)
Find/n=50k/bitwidth=5-8            728 ± 0%          600 ± 0%  -17.50%  (p=0.000 n=10+9)
Find/n=100k/bitwidth=5-8           784 ± 0%          647 ± 0%  -17.45%  (p=0.000 n=10+10)
Find/n=500k/bitwidth=5-8           600 ± 0%          479 ± 0%  -20.17%  (p=0.000 n=10+10)
Find/n=1000k/bitwidth=5-8          743 ± 0%          602 ± 0%  -19.04%  (p=0.000 n=9+10)

Copy link
Collaborator

@Stebalien Stebalien left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM.

@Stebalien
Copy link
Collaborator

Also, I'm seeing different perf numbers when comparing to master:

name               old time/op    new time/op    delta
Marshaling-8          497ns ± 2%     501ns ± 1%     ~     (p=0.310 n=5+5)
Unmarshaling-8       3.55µs ±15%    2.88µs ± 4%  -18.95%  (p=0.008 n=5+5)
LinkScan-8            547ns ± 2%     729ns ± 1%  +33.13%  (p=0.008 n=5+5)
Deferred-8           1.30µs ± 2%    1.49µs ± 2%  +14.80%  (p=0.008 n=5+5)
MapMarshaling-8       281ns ± 2%     348ns ± 2%  +23.88%  (p=0.008 n=5+5)
MapUnmarshaling-8    3.10µs ±13%    2.51µs ± 4%  -19.03%  (p=0.008 n=5+5)

What version of go are you using? (I'm using 1.18, on an intel i7 laptop).

@Stebalien Stebalien merged commit 98fa825 into whyrusleeping:master Mar 23, 2022
@iand
Copy link
Contributor Author

iand commented Mar 24, 2022

Also, I'm seeing different perf numbers when comparing to master:

name               old time/op    new time/op    delta
Marshaling-8          497ns ± 2%     501ns ± 1%     ~     (p=0.310 n=5+5)
Unmarshaling-8       3.55µs ±15%    2.88µs ± 4%  -18.95%  (p=0.008 n=5+5)
LinkScan-8            547ns ± 2%     729ns ± 1%  +33.13%  (p=0.008 n=5+5)
Deferred-8           1.30µs ± 2%    1.49µs ± 2%  +14.80%  (p=0.008 n=5+5)
MapMarshaling-8       281ns ± 2%     348ns ± 2%  +23.88%  (p=0.008 n=5+5)
MapUnmarshaling-8    3.10µs ±13%    2.51µs ± 4%  -19.03%  (p=0.008 n=5+5)

What version of go are you using? (I'm using 1.18, on an intel i7 laptop).

1.18, Linux, Intel i7 desktop

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants