Encode structs directly to output buffer. #519

benluddy · 2024-04-12T19:12:15Z

Description

For variable-length structs (structs with omitempty fields), encoding to the unused capacity at the
end of the output buffer while counting nonempty items is cheaper than using a separate temporary
buffer (no pool interactions and better spatial locality). Copying the items can be avoided entirely
by reserving space in the output buffer for the head if the encoded length of the head can be
predicted before checking optional fields.

                                                                     │ before.txt  │              after.txt              │
                                                                     │   sec/op    │   sec/op     vs base                │
Marshal/Go_struct_to_CBOR_map                                          1.404µ ± 0%   1.408µ ± 1%        ~ (p=0.170 n=10)
Marshal/Go_struct_many_fields_all_omitempty_all_empty_to_CBOR_map      443.8n ± 0%   430.6n ± 0%   -2.99% (p=0.000 n=10)
Marshal/Go_struct_some_fields_all_omitempty_all_empty_to_CBOR_map      181.7n ± 0%   163.5n ± 0%  -10.04% (p=0.000 n=10)
Marshal/Go_struct_many_fields_all_omitempty_all_nonempty_to_CBOR_map   813.5n ± 0%   784.8n ± 0%   -3.53% (p=0.000 n=10)
Marshal/Go_struct_some_fields_all_omitempty_all_nonempty_to_CBOR_map   300.8n ± 0%   275.4n ± 0%   -8.43% (p=0.000 n=10)
Marshal/Go_struct_many_fields_one_omitempty_to_CBOR_map                763.8n ± 0%   727.7n ± 0%   -4.73% (p=0.000 n=10)
Marshal/Go_struct_some_fields_one_omitempty_to_CBOR_map                284.2n ± 0%   257.6n ± 0%   -9.36% (p=0.000 n=10)
Marshal/Go_struct_keyasint_to_CBOR_map                                 1.422µ ± 0%   1.414µ ± 1%   -0.56% (p=0.029 n=10)
Marshal/Go_struct_toarray_to_CBOR_array                                1.341µ ± 1%   1.338µ ± 1%        ~ (p=0.340 n=10)
MarshalCanonical/Go_struct_to_CBOR_map                                 386.4n ± 0%   392.4n ± 0%   +1.57% (p=0.000 n=10)
MarshalCanonical/Go_struct_to_CBOR_map_canonical                       386.9n ± 0%   384.8n ± 0%   -0.52% (p=0.001 n=10)
geomean                                                                560.5n        540.4n        -3.59%

                                                                     │ before.txt │              after.txt              │
                                                                     │    B/op    │    B/op     vs base                 │
Marshal/Go_struct_to_CBOR_map                                          208.0 ± 0%   208.0 ± 0%       ~ (p=1.000 n=10) ¹
Marshal/Go_struct_many_fields_all_omitempty_all_empty_to_CBOR_map      1.000 ± 0%   1.000 ± 0%       ~ (p=1.000 n=10) ¹
Marshal/Go_struct_some_fields_all_omitempty_all_empty_to_CBOR_map      1.000 ± 0%   1.000 ± 0%       ~ (p=1.000 n=10) ¹
Marshal/Go_struct_many_fields_all_omitempty_all_nonempty_to_CBOR_map   176.0 ± 0%   176.0 ± 0%       ~ (p=1.000 n=10) ¹
Marshal/Go_struct_some_fields_all_omitempty_all_nonempty_to_CBOR_map   48.00 ± 0%   48.00 ± 0%       ~ (p=1.000 n=10) ¹
Marshal/Go_struct_many_fields_one_omitempty_to_CBOR_map                160.0 ± 0%   160.0 ± 0%       ~ (p=1.000 n=10) ¹
Marshal/Go_struct_some_fields_one_omitempty_to_CBOR_map                48.00 ± 0%   48.00 ± 0%       ~ (p=1.000 n=10) ¹
Marshal/Go_struct_keyasint_to_CBOR_map                                 192.0 ± 0%   192.0 ± 0%       ~ (p=1.000 n=10) ¹
Marshal/Go_struct_toarray_to_CBOR_array                                192.0 ± 0%   192.0 ± 0%       ~ (p=1.000 n=10) ¹
MarshalCanonical/Go_struct_to_CBOR_map                                 64.00 ± 0%   64.00 ± 0%       ~ (p=1.000 n=10) ¹
MarshalCanonical/Go_struct_to_CBOR_map_canonical                       64.00 ± 0%   64.00 ± 0%       ~ (p=1.000 n=10) ¹
geomean                                                                46.18        46.18       +0.00%
¹ all samples are equal

                                                                     │ before.txt │              after.txt              │
                                                                     │ allocs/op  │ allocs/op   vs base                 │
Marshal/Go_struct_to_CBOR_map                                          1.000 ± 0%   1.000 ± 0%       ~ (p=1.000 n=10) ¹
Marshal/Go_struct_many_fields_all_omitempty_all_empty_to_CBOR_map      1.000 ± 0%   1.000 ± 0%       ~ (p=1.000 n=10) ¹
Marshal/Go_struct_some_fields_all_omitempty_all_empty_to_CBOR_map      1.000 ± 0%   1.000 ± 0%       ~ (p=1.000 n=10) ¹
Marshal/Go_struct_many_fields_all_omitempty_all_nonempty_to_CBOR_map   1.000 ± 0%   1.000 ± 0%       ~ (p=1.000 n=10) ¹
Marshal/Go_struct_some_fields_all_omitempty_all_nonempty_to_CBOR_map   1.000 ± 0%   1.000 ± 0%       ~ (p=1.000 n=10) ¹
Marshal/Go_struct_many_fields_one_omitempty_to_CBOR_map                1.000 ± 0%   1.000 ± 0%       ~ (p=1.000 n=10) ¹
Marshal/Go_struct_some_fields_one_omitempty_to_CBOR_map                1.000 ± 0%   1.000 ± 0%       ~ (p=1.000 n=10) ¹
Marshal/Go_struct_keyasint_to_CBOR_map                                 1.000 ± 0%   1.000 ± 0%       ~ (p=1.000 n=10) ¹
Marshal/Go_struct_toarray_to_CBOR_array                                1.000 ± 0%   1.000 ± 0%       ~ (p=1.000 n=10) ¹
MarshalCanonical/Go_struct_to_CBOR_map                                 1.000 ± 0%   1.000 ± 0%       ~ (p=1.000 n=10) ¹
MarshalCanonical/Go_struct_to_CBOR_map_canonical                       1.000 ± 0%   1.000 ± 0%       ~ (p=1.000 n=10) ¹
geomean                                                                1.000        1.000       +0.00%
¹ all samples are equal

PR Was Proposed and Welcomed in Currently Open Issue

This PR was proposed and welcomed by maintainer(s) in issue #___
Closes or Updates Issue #___

Checklist (for code PR only, ignore for docs PR)

Include unit tests that cover the new code
Pass all unit tests
Pass all lint checks in CI (goimports, gosec, staticcheck, etc.)
Sign each commit with your real name and email.
Last line of each commit message should be in this format:
Signed-off-by: Firstname Lastname firstname.lastname@example.com
Certify the Developer's Certificate of Origin 1.1
(see next section).

Certify the Developer's Certificate of Origin 1.1

By marking this item as completed, I certify
the Developer Certificate of Origin 1.1.

Developer Certificate of Origin
Version 1.1

Copyright (C) 2004, 2006 The Linux Foundation and its contributors.
660 York Street, Suite 102,
San Francisco, CA 94110 USA

Everyone is permitted to copy and distribute verbatim copies of this
license document, but changing it is not allowed.

Developer's Certificate of Origin 1.1

By making a contribution to this project, I certify that:

(a) The contribution was created in whole or in part by me and I
    have the right to submit it under the open source license
    indicated in the file; or

(b) The contribution is based upon previous work that, to the best
    of my knowledge, is covered under an appropriate open source
    license and I have the right under that license to submit that
    work with modifications, whether created in whole or in part
    by me, under the same open source license (unless I am
    permitted to submit under a different license), as indicated
    in the file; or

(c) The contribution was provided directly to me by some other
    person who certified (a), (b) or (c) and I have not modified
    it.

(d) I understand and agree that this project and the contribution
    are public and that a record of the contribution (including all
    personal information I submit with it, including my sign-off) is
    maintained indefinitely and may be redistributed consistent with
    this project or the open source license(s) involved.

fxamacker

@benluddy Thanks for opening this PR!

In addition to getting non-empty field count in the first pass, what do you think about also getting non-empty fields ([]*reflect.Value) as well in the same pass? So we don't need to perform same operations in both passes.

For example:

in the first pass, get non-empty field count kvcount, also create and populate a non-empty field reflect values fvs []*reflect.Value on stack.
in the second pass, encode non-empty field if fvs[i] != nil.

Benchmarks show more improvement when updated with these changes.

Thoughts?

benluddy · 2024-04-23T01:59:09Z

Instead of making two passes, it now encodes the items to the output buffer while counting, encodes the head at the end, and uses excess capacity in the output buffer to swap the positions of the encoded head and the encoded items. This turned out to be faster. Then I realized that you'll usually have a variable-length struct whose head encodes to the same number of bytes regardless of the number of items (e.g. any struct with fewer than 24 fields). In that case, it reserves space in the output buffer for the head, encodes the items, then overwrites the head bytes in the output buffer at the end, once it knows the actual map size.

fxamacker

@benluddy Thanks for updating this PR! The one pass approach sounds good! 👍

I have some suggestions to simplify the code for your consideration:

have a utility function to return length of encoded map head with element count
only cache maxHeadLen in encodingStructType
in encodeStruct(), we can:
- reserve bytes of maxHeadLen before encoding elements
- overwrite the reserved bytes with the real head after elements are encoded
- if real head len < max head len, shift encoded elements to the left and truncate underlying buffer

Thoughts?

benluddy · 2024-04-29T15:12:40Z

@benluddy Thanks for updating this PR! The one pass approach sounds good! 👍

I have some suggestions to simplify the code for your consideration:

* have a utility function to return length of encoded map head with element count

* only cache `maxHeadLen` in `encodingStructType`

* in `encodeStruct()`, we can:
  
  * reserve bytes of `maxHeadLen` before encoding elements
  * overwrite the reserved bytes with the real head after elements are encoded
  * if real head len < max head len, shift encoded elements to the left and truncate underlying buffer

Thoughts?

Absolutely! For some reason I was afraid of overlapping copies, but they are clearly safe according to the spec (https://go.dev/ref/spec#Appending_and_copying_slices). I'll implement your suggestions and rerun the benchmarks. Thanks!

For variable-length structs (structs with omitempty fields), encoding to the unused capacity at the end of the output buffer while counting nonempty items is cheaper than using a separate temporary buffer (no pool interactions and better spatial locality). Copying the items can be avoided entirely by reserving space in the output buffer for the head if the encoded length of the head can be predicted before checking optional fields. │ before.txt │ after.txt │ │ sec/op │ sec/op vs base │ Marshal/Go_struct_to_CBOR_map 1.404µ ± 0% 1.408µ ± 1% ~ (p=0.170 n=10) Marshal/Go_struct_many_fields_all_omitempty_all_empty_to_CBOR_map 443.8n ± 0% 430.6n ± 0% -2.99% (p=0.000 n=10) Marshal/Go_struct_some_fields_all_omitempty_all_empty_to_CBOR_map 181.7n ± 0% 163.5n ± 0% -10.04% (p=0.000 n=10) Marshal/Go_struct_many_fields_all_omitempty_all_nonempty_to_CBOR_map 813.5n ± 0% 784.8n ± 0% -3.53% (p=0.000 n=10) Marshal/Go_struct_some_fields_all_omitempty_all_nonempty_to_CBOR_map 300.8n ± 0% 275.4n ± 0% -8.43% (p=0.000 n=10) Marshal/Go_struct_many_fields_one_omitempty_to_CBOR_map 763.8n ± 0% 727.7n ± 0% -4.73% (p=0.000 n=10) Marshal/Go_struct_some_fields_one_omitempty_to_CBOR_map 284.2n ± 0% 257.6n ± 0% -9.36% (p=0.000 n=10) Marshal/Go_struct_keyasint_to_CBOR_map 1.422µ ± 0% 1.414µ ± 1% -0.56% (p=0.029 n=10) Marshal/Go_struct_toarray_to_CBOR_array 1.341µ ± 1% 1.338µ ± 1% ~ (p=0.340 n=10) MarshalCanonical/Go_struct_to_CBOR_map 386.4n ± 0% 392.4n ± 0% +1.57% (p=0.000 n=10) MarshalCanonical/Go_struct_to_CBOR_map_canonical 386.9n ± 0% 384.8n ± 0% -0.52% (p=0.001 n=10) geomean 560.5n 540.4n -3.59% │ before.txt │ after.txt │ │ B/op │ B/op vs base │ Marshal/Go_struct_to_CBOR_map 208.0 ± 0% 208.0 ± 0% ~ (p=1.000 n=10) ¹ Marshal/Go_struct_many_fields_all_omitempty_all_empty_to_CBOR_map 1.000 ± 0% 1.000 ± 0% ~ (p=1.000 n=10) ¹ Marshal/Go_struct_some_fields_all_omitempty_all_empty_to_CBOR_map 1.000 ± 0% 1.000 ± 0% ~ (p=1.000 n=10) ¹ Marshal/Go_struct_many_fields_all_omitempty_all_nonempty_to_CBOR_map 176.0 ± 0% 176.0 ± 0% ~ (p=1.000 n=10) ¹ Marshal/Go_struct_some_fields_all_omitempty_all_nonempty_to_CBOR_map 48.00 ± 0% 48.00 ± 0% ~ (p=1.000 n=10) ¹ Marshal/Go_struct_many_fields_one_omitempty_to_CBOR_map 160.0 ± 0% 160.0 ± 0% ~ (p=1.000 n=10) ¹ Marshal/Go_struct_some_fields_one_omitempty_to_CBOR_map 48.00 ± 0% 48.00 ± 0% ~ (p=1.000 n=10) ¹ Marshal/Go_struct_keyasint_to_CBOR_map 192.0 ± 0% 192.0 ± 0% ~ (p=1.000 n=10) ¹ Marshal/Go_struct_toarray_to_CBOR_array 192.0 ± 0% 192.0 ± 0% ~ (p=1.000 n=10) ¹ MarshalCanonical/Go_struct_to_CBOR_map 64.00 ± 0% 64.00 ± 0% ~ (p=1.000 n=10) ¹ MarshalCanonical/Go_struct_to_CBOR_map_canonical 64.00 ± 0% 64.00 ± 0% ~ (p=1.000 n=10) ¹ geomean 46.18 46.18 +0.00% ¹ all samples are equal │ before.txt │ after.txt │ │ allocs/op │ allocs/op vs base │ Marshal/Go_struct_to_CBOR_map 1.000 ± 0% 1.000 ± 0% ~ (p=1.000 n=10) ¹ Marshal/Go_struct_many_fields_all_omitempty_all_empty_to_CBOR_map 1.000 ± 0% 1.000 ± 0% ~ (p=1.000 n=10) ¹ Marshal/Go_struct_some_fields_all_omitempty_all_empty_to_CBOR_map 1.000 ± 0% 1.000 ± 0% ~ (p=1.000 n=10) ¹ Marshal/Go_struct_many_fields_all_omitempty_all_nonempty_to_CBOR_map 1.000 ± 0% 1.000 ± 0% ~ (p=1.000 n=10) ¹ Marshal/Go_struct_some_fields_all_omitempty_all_nonempty_to_CBOR_map 1.000 ± 0% 1.000 ± 0% ~ (p=1.000 n=10) ¹ Marshal/Go_struct_many_fields_one_omitempty_to_CBOR_map 1.000 ± 0% 1.000 ± 0% ~ (p=1.000 n=10) ¹ Marshal/Go_struct_some_fields_one_omitempty_to_CBOR_map 1.000 ± 0% 1.000 ± 0% ~ (p=1.000 n=10) ¹ Marshal/Go_struct_keyasint_to_CBOR_map 1.000 ± 0% 1.000 ± 0% ~ (p=1.000 n=10) ¹ Marshal/Go_struct_toarray_to_CBOR_array 1.000 ± 0% 1.000 ± 0% ~ (p=1.000 n=10) ¹ MarshalCanonical/Go_struct_to_CBOR_map 1.000 ± 0% 1.000 ± 0% ~ (p=1.000 n=10) ¹ MarshalCanonical/Go_struct_to_CBOR_map_canonical 1.000 ± 0% 1.000 ± 0% ~ (p=1.000 n=10) ¹ geomean 1.000 1.000 +0.00% ¹ all samples are equal Signed-off-by: Ben Luddy <bluddy@redhat.com>

benluddy · 2024-04-29T16:01:23Z

@fxamacker I just pushed those changes. Benchmarks looks good! As expected, it's a bit faster on the interesting cases by avoiding the extra copies, and the worst-case scratch buffer space needed is only a few bytes instead of being proportional to the encoded size of the entire map:

                                                                     │  prev.txt   │              next.txt              │
                                                                     │   sec/op    │   sec/op     vs base               │
Marshal/Go_struct_to_CBOR_map                                          1.403µ ± 1%   1.408µ ± 1%       ~ (p=0.672 n=10)
Marshal/Go_struct_many_fields_all_omitempty_all_empty_to_CBOR_map      425.5n ± 0%   430.6n ± 0%  +1.20% (p=0.000 n=10)
Marshal/Go_struct_some_fields_all_omitempty_all_empty_to_CBOR_map      163.5n ± 0%   163.5n ± 0%       ~ (p=0.283 n=10)
Marshal/Go_struct_many_fields_all_omitempty_all_nonempty_to_CBOR_map   800.4n ± 0%   784.8n ± 0%  -1.94% (p=0.000 n=10)
Marshal/Go_struct_some_fields_all_omitempty_all_nonempty_to_CBOR_map   278.7n ± 0%   275.4n ± 0%  -1.15% (p=0.000 n=10)
Marshal/Go_struct_many_fields_one_omitempty_to_CBOR_map                730.9n ± 0%   727.7n ± 0%  -0.43% (p=0.000 n=10)
Marshal/Go_struct_some_fields_one_omitempty_to_CBOR_map                259.3n ± 0%   257.6n ± 0%  -0.64% (p=0.000 n=10)
Marshal/Go_struct_keyasint_to_CBOR_map                                 1.414µ ± 1%   1.414µ ± 1%       ~ (p=0.445 n=10)
Marshal/Go_struct_toarray_to_CBOR_array                                1.352µ ± 1%   1.338µ ± 1%  -1.07% (p=0.007 n=10)
MarshalCanonical/Go_struct_to_CBOR_map                                 392.5n ± 0%   392.4n ± 0%       ~ (p=0.514 n=10)
MarshalCanonical/Go_struct_to_CBOR_map_canonical                       393.7n ± 0%   384.8n ± 0%  -2.25% (p=0.001 n=10)
geomean                                                                543.3n        540.4n       -0.54%

One benchmark case appeared to regress, but I'm convinced I missed an interfering background process during that run. Re-running that case gave results on par with the previous implementation:

                                                                     │    0.txt    │                1.txt                 │
                                                                     │   sec/op    │   sec/op     vs base                 │
Marshal/Go_struct_many_fields_all_omitempty_all_empty_to_CBOR_map      425.5n ± 0%   425.9n ± 0%  +0.09% (p=0.015 n=10)

fxamacker

Thanks @benluddy for updating this PR and sharing benchmarks! 👍 LGTM!

This commit removes encodeFixedLengthStruct() and reuses encodeStruct() to simplify code. Previously, encodeStruct() used extra buffer to encode elements to get actual encoded element count. To avoid this overhead, encodeFixedLengthStruct() was created to encode fixed length struct (struct without any "omitempty" fields) since encoded element count is always known in this use case. With PR #519 (#519), encodeStruct() doesn't use extra buffer any more, and encodeFixedLengthStruct() isn't necessary.

fxamacker reviewed Apr 21, 2024

View reviewed changes

fxamacker added the performance label Apr 21, 2024

fxamacker added this to the v2.7.0 milestone Apr 22, 2024

benluddy force-pushed the struct-encode-directly branch 2 times, most recently from 2a344fa to 020398e Compare April 23, 2024 01:52

fxamacker reviewed Apr 28, 2024

View reviewed changes

benluddy force-pushed the struct-encode-directly branch from 020398e to d981dec Compare April 29, 2024 15:50

benluddy requested a review from fxamacker April 29, 2024 16:01

fxamacker approved these changes May 4, 2024

View reviewed changes

fxamacker merged commit 28a8572 into fxamacker:master May 4, 2024
17 checks passed

fxamacker mentioned this pull request May 5, 2024

Refactor to reuse functions and improve code coverage #531

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Encode structs directly to output buffer. #519

Encode structs directly to output buffer. #519

benluddy commented Apr 12, 2024 •

edited

Loading

fxamacker left a comment

benluddy commented Apr 23, 2024

fxamacker left a comment

benluddy commented Apr 29, 2024

benluddy commented Apr 29, 2024

fxamacker left a comment

Encode structs directly to output buffer. #519

Encode structs directly to output buffer. #519

Conversation

benluddy commented Apr 12, 2024 • edited Loading

Description

PR Was Proposed and Welcomed in Currently Open Issue

Checklist (for code PR only, ignore for docs PR)

Certify the Developer's Certificate of Origin 1.1

fxamacker left a comment

Choose a reason for hiding this comment

benluddy commented Apr 23, 2024

fxamacker left a comment

Choose a reason for hiding this comment

benluddy commented Apr 29, 2024

benluddy commented Apr 29, 2024

fxamacker left a comment

Choose a reason for hiding this comment

benluddy commented Apr 12, 2024 •

edited

Loading