Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shrink derive output size #2250

Closed

Conversation

nnethercote
Copy link

@nnethercote nnethercote commented Jul 18, 2022

This PR has a series of commits that shrink the size of the code generated for derive(Serialize, Deserialize), particularly for large structs.

Here are some instruction count results for some synthetic benchmarks.

Benchmark Profile Scenario % Change Significance Factor?
big300 check full -31.80% 159.00x
big300 debug full -30.15% 150.75x
big300 opt full -29.99% 149.96x
big200 check full -29.44% 147.19x
big200 debug full -26.94% 134.69x
big200 opt full -26.83% 134.15x
big100 check full -25.50% 127.48x
big100 debug full -21.02% 105.11x
big100 opt full -20.73% 103.64x
derive-serde check full -6.50% 32.50x
derive-serde opt full -3.97% 19.85x
derive-serde debug full -3.76% 18.82x

big{1,2,3}00 each contain a single struct with {100,200,300} u32 fields that is annotated with derive(Serialize, Deserialize).

derive-serde contains a variety of structs and enums. It is identical to https://github.com/rust-lang/rust/blob/master/src/test/ui/deriving/deriving-all-codegen.rs except with the builtin trait derives replaced with derive(Serialize, Deserialize).

serde/src/de/mod.rs Outdated Show resolved Hide resolved
@Mingun
Copy link
Contributor

Mingun commented Jul 18, 2022

What the table headers mean?

@nnethercote
Copy link
Author

The headings are those used in https://perf.rust-lang.org/compare.html. The important one is the "% Change".

For a struct like this:
```
struct Small {
    x: u32,
    y: u32,
    #[serde(skip_serializing_if = "Option::is_none")]
    maybe: Option<&'static str>,
}
```
it will change the field count computation from this:
```
false as usize + 1 + 1 + if Option::is_none(&self.maybe) { 0 } else { 1 }
```
to this:
```
2usize + if Option::is_none(&self.maybe) { 0 } else { 1 }
```
For fields with many fields, the difference will be bigger, avoiding long
sequences of `1 + 1 + 1 + ...`. This is a small compile-time win.
This turns two matches into one, for each field in a struct.
This turns two matches into one, for each field in a struct.
This turns two matches into one, for each field in a struct.
/// calling `Error::invalid_length`.
///
/// This method exists purely to reduce the size of the generated code,
/// reducing compile times. `SeqAccess` implementations should not override

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any reason to implement is specifically as a method rather than a free function? Using a free function gives the desired no-overriding guarantee, and method call syntax is not used in derived code anyway.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A method was easier, being more similar to the original function. But it's not fundamental.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @AnthonyMikh here, that this would be better as free functions in serde::__private::de module

@nnethercote
Copy link
Author

Any thoughts on whether this change is a good one? There are 32 positive emoji reactions on the description. I'm happy to make adjustments (e.g. moving that method to a free function), but it's not yet clear to me if this change is wanted by the serde developers. Thanks!

@LukasKalbertodt
Copy link

LukasKalbertodt commented Jul 26, 2022

Summary: the wins in terms of size reduction in a "real world" project that I tested are unfortunately very small.


I was very happy to see this PR as I recently found out that generated serde code makes up a big part of the final binary of a project I'm working on as part of my day job. It's a web application with Rust backend that performs various tasks and uses serde derive for lots of different stuff. (It's even open source, if anyone's interested: link). So I thought it might be a valuable data point to see how this PR performs with a "production project".

I know that this PR is mainly motivated by compile times, but I thought checking it for binary output size is interesting too.

Unfortunately, the results were not as great as I has hoped for. I tested serde 1.0.139 vs. this branch (f8cc37c), using [patch]. I performed a release build with these adjustments:

[profile.release]
debug = 1
codegen-units = 1
lto = "thin"

[profile.release.package."*"]
debug = false

We need basic debug info to better debug errors in production. And it allows me to check per symbol size! Also, some custom --remap-path-prefix flags are set and the resulting binary is compressed with objcopy --compress-debug-sections.

But first, the final file sizes (the -stripped files had strip applied to them):

23602168  smaller-serde
23595960  with-1.0.139
14708592  smaller-serde-stripped
14720880  with-1.0.139-stripped

With this patch, the binary including debug info is even larger than with 1.0.139. That was unexpected! At least the stripped one shows a small win. I tried only checking the size of serde functions. I did that via this command chain:

llvm-nm-10 --print-size --size-sort --radix=d <binary> 
    | rg -F <filter>
    | cut -d ' ' -f 2 
    | awk '{s+=$1} END {print s}'

Here is the result matrix for the two variables <filter> and <binary>. Note that my llvm-nm-10 did not demangle the symbols, that's the reason for the .. there.

Filter: serde serde..de..Deserialize serde..ser..Serialize
with-1.0.139 809,524 525,434 56,330
smaller-serde 796,166 510,409 56,330

That's a roughly 13KiB reduction = 98.3% of the original size. A win is a win, but I had hoped for more and I feel like there are a lot of size optimization that could still be done.

I also looked at the biggest functions to see what changed there.

20 largest serde functions

with-1.0.139:

 6834 <core::marker::PhantomData<T> as serde::de::DeserializeSeed>::deserialize
 6936 <core::marker::PhantomData<T> as serde::de::DeserializeSeed>::deserialize
 7143 <tobira::search::confique_partial_meili_config::_::<impl serde::de::Deserialize for tobira::search::confique_partial_meili_config::PartialMeiliConfig>::deserialize::__Visitor as serde::de::Visitor>::visit_map
 7536 <core::marker::PhantomData<T> as serde::de::DeserializeSeed>::deserialize
 7595 <meilisearch_sdk::errors::_::<impl serde::de::Deserialize for meilisearch_sdk::errors::ErrorCode>::deserialize::__FieldVisitor as serde::de::Visitor>::visit_bytes
 7771 <tobira::config::theme::confique_partial_theme_config::_::<impl serde::de::Deserialize for tobira::config::theme::confique_partial_theme_config::PartialThemeConfig>::deserialize::__Visitor as serde::de::Visitor>::visit_map
 7977 <toml::de::MapVisitor as serde::de::Deserializer>::deserialize_any
 8026 <tobira::config::general::confique_partial_general_config::_::<impl serde::de::Deserialize for tobira::config::general::confique_partial_general_config::PartialGeneralConfig>::deserialize::__Visitor as serde::de::Visitor>::visit_map
10939 <serde::de::impls::OptionVisitor<T> as serde::de::Visitor>::visit_some
11060 <core::marker::PhantomData<T> as serde::de::DeserializeSeed>::deserialize
13154 <tobira::auth::confique_partial_auth_config::_::<impl serde::de::Deserialize for tobira::auth::confique_partial_auth_config::PartialAuthConfig>::deserialize::__Visitor as serde::de::Visitor>::visit_map
13461 <tobira::search::event::_::<impl serde::de::Deserialize for tobira::search::event::Event>::deserialize::__Visitor as serde::de::Visitor>::visit_map
15906 <toml::de::ValueDeserializer as serde::de::Deserializer>::deserialize_struct
17591 meilisearch_sdk::tasks::_::<impl serde::de::Deserialize for meilisearch_sdk::tasks::TaskType>::deserialize
19177 tobira::sync::harvest::response::_::<impl serde::de::Deserialize for tobira::sync::harvest::response::HarvestItem>::deserialize
20449 <toml::de::MapVisitor as serde::de::MapAccess>::next_value_seed
22081 meilisearch_sdk::tasks::_::<impl serde::de::Deserialize for meilisearch_sdk::tasks::Task>::deserialize
24697 <serde::__private::de::content::ContentDeserializer<E> as serde::de::Deserializer>::deserialize_any
26080 <juniper::http::GraphQLResponse<T> as serde::ser::Serialize>::serialize
95307 <tobira::config::confique_partial_config::_::<impl serde::de::Deserialize for tobira::config::confique_partial_config::PartialConfig>::deserialize::__Visitor as serde::de::Visitor>::visit_map

smaller-serde

 6605 <tobira::sync::confique_partial_sync_config::_::<impl serde::de::Deserialize for tobira::sync::confique_partial_sync_config::PartialSyncConfig>::deserialize::__Visitor as serde::de::Visitor>::visit_map
 6607 <core::marker::PhantomData<T> as serde::de::DeserializeSeed>::deserialize
 7143 <tobira::search::confique_partial_meili_config::_::<impl serde::de::Deserialize for tobira::search::confique_partial_meili_config::PartialMeiliConfig>::deserialize::__Visitor as serde::de::Visitor>::visit_map
 7212 <core::marker::PhantomData<T> as serde::de::DeserializeSeed>::deserialize
 7444 <core::marker::PhantomData<T> as serde::de::DeserializeSeed>::deserialize
 7595 <meilisearch_sdk::errors::_::<impl serde::de::Deserialize for meilisearch_sdk::errors::ErrorCode>::deserialize::__FieldVisitor as serde::de::Visitor>::visit_bytes
 7744 <tobira::config::theme::confique_partial_theme_config::_::<impl serde::de::Deserialize for tobira::config::theme::confique_partial_theme_config::PartialThemeConfig>::deserialize::__Visitor as serde::de::Visitor>::visit_map
 7937 <tobira::config::general::confique_partial_general_config::_::<impl serde::de::Deserialize for tobira::config::general::confique_partial_general_config::PartialGeneralConfig>::deserialize::__Visitor as serde::de::Visitor>::visit_map
 7977 <toml::de::MapVisitor as serde::de::Deserializer>::deserialize_any
10923 <serde::de::impls::OptionVisitor<T> as serde::de::Visitor>::visit_some
11031 <core::marker::PhantomData<T> as serde::de::DeserializeSeed>::deserialize
13239 <tobira::auth::confique_partial_auth_config::_::<impl serde::de::Deserialize for tobira::auth::confique_partial_auth_config::PartialAuthConfig>::deserialize::__Visitor as serde::de::Visitor>::visit_map
13324 <tobira::search::event::_::<impl serde::de::Deserialize for tobira::search::event::Event>::deserialize::__Visitor as serde::de::Visitor>::visit_map
15929 <toml::de::ValueDeserializer as serde::de::Deserializer>::deserialize_struct
17396 meilisearch_sdk::tasks::_::<impl serde::de::Deserialize for meilisearch_sdk::tasks::TaskType>::deserialize
20435 <toml::de::MapVisitor as serde::de::MapAccess>::next_value_seed
21953 meilisearch_sdk::tasks::_::<impl serde::de::Deserialize for meilisearch_sdk::tasks::Task>::deserialize
26080 <juniper::http::GraphQLResponse<T> as serde::ser::Serialize>::serialize
32282 tobira::sync::harvest::response::_::<impl serde::de::Deserialize for tobira::sync::harvest::response::HarvestItem>::deserialize
93552 <tobira::config::confique_partial_config::_::<impl serde::de::Deserialize for tobira::config::confique_partial_config::PartialConfig>::deserialize::__Visitor as serde::de::Visitor>::visit_map

A few things:

  • The tobira::sync::harvest.... function is significantly larger with this patch (19177 -> 32282). The struct definition is here.
  • I'm not sure why there are several PhantomData as DeserializeSeed impls in there.
  • Some functions stayed exactly the same size.
  • The largest function is quite a lot larger than everything else, so it might be worth checking that out specifically. But I don't have time to dig into that now. I can roughly say that it's a a struct with 9 fields of type Option<X> where X is different per field and is always another struct with several Option fields. If anyone is interested in digging into that, I'm happy to help (as the confique thingy is by me).

Ok, I hope this investigation is useful at all. As someone with 800KiB of serde functions in my release artifact, I'm very interested in improvements in this area ^_^

@dtolnay
Copy link
Member

dtolnay commented Jul 26, 2022

Thank you Lukas, your writeup is enormously helpful in evaluating this PR. One thing I would want to look into is whether any of the "small serde" redesigns (miniserde, deser, …) are viable for your use case, or even do at all better on code size.

In response to @nnethercote's last comment: it is not unwanted, but it is unsolicited, and an unsolicited PR (i.e. not in response to a "help wanted" label or an issue discussion which I participate in) is always less likely to appear at the top of my priorities. The way to get a PR reviewed is by being the most impactful thing for me to dedicate time to according to my opinion, and my sense has been this is not that, since there are multiple other serde PRs that have been queued longer and which I would rather land over this. Recently though my biggest chunk of time is going to a major serde_yaml revamp since that crate has been neglected much worse and longer than this crate.

You called out how impressive the emoji reactions to this PR are, but that is not a signal I look at in deciding where to invest time. It is predominantly a measure of social media reach over anything else, and prioritizing PRs from whoever has the highest twitter followers, rather than whichever is the most impactful contribution to the project, is not a productive direction I think.

@lnicola
Copy link

lnicola commented Jul 26, 2022

I made some tests with rust-analyzer (it mostly uses serde via lsp-types) and the results are a mixed bag, too:

# run cargo clean, then cargo build --release

# v1.0.139, incremental = true, opt-level = 3
   text	   data	    bss	    dec	    hex	filename
25289491	1451288	   3784	26744563	19816f3	target/release/rust-analyzer

# nn, incremental = true, opt-level = 3
   text	   data	    bss	    dec	    hex	filename
25296787	1448360	   3784	26748931	1982803	target/release/rust-analyzer

# v1.0.139, incremental = true, opt-level = "s"
   text	   data	    bss	    dec	    hex	filename
22104455	1663648	   3784	23771887	16abaef	target/release/rust-analyzer

# nn, incremental = true, opt-level = "s"
   text	   data	    bss	    dec	    hex	filename
22115079	1660808	   3784	23779671	16ad957	target/release/rust-analyzer

Notice how the text increase is somewhat offset by data, but only in the opt-level = 3 case. I haven't specifically checked the size of the serde functions, but they add up to 1 144 089 bytes in the third case above.


Still, I think this is worth pursuing. serde is in the top 10 of most downloaded crates, so any compile time or code size wins will add up across the ecosystem. I'm pretty sure that nobody has ever said "I love serde because it compiles so quickly", but I've seen a lot of people actively avoiding it, or other proc macro crates.

And I'd like to thank nnethercote for putting in the effort to optimize not only the Rust compiler, but also many of the popular third-party crate.

@nnethercote
Copy link
Author

@LukasKalbertodt, @lnicola: thanks for the data. Your results are expected, though I see now this wasn't clear from my original description. The change I made shrinks the size of the Rust code that the front end has to process, but the new functions are marked inline, so by the time inlining happens (either at the MIR level or at the LLVM IR level) the final code will be very similar to the current. Which means this PR will help compile times but have little effect on binary size. If you removed the inline annotations from the new functions things might be different, but ser/deser speed might also be slower.

For binary size, my measurements show that Deserialize::visit_map is easily the biggest function. E.g for the stress test of a struct with 100 u32 fields (which is big but within the realms of possibility) the generated visit_map is 2833 lines of code before this PR, and 2133 lines of code after this PR.

AFAICT, the generated visit_map is able to handle struct fields being present in any order, even though the generated Serialize::serialize always writes the fields in declaration order. If visit_map could assume the fields were in declaration order it could be shorter.

@adamreichold
Copy link

adamreichold commented Jul 27, 2022

AFAICT, the generated visit_map is able to handle struct fields being present in any order, even though the generated Serialize::serialize always writes the fields in declaration order. If visit_map could assume the fields were in declaration order it could be shorter.

AFAIK, the visitor is driven by the deserializer which can mean that it is driven by the order of the fields in the wire format. Hence it needs to be able to handle any order to be able to deserialize the output of systems other than Serde.

@Mingun
Copy link
Contributor

Mingun commented Jul 27, 2022

From the other hand a new struct attribute could be added, for example, #[serde(strict_order)] which would allow only the order that is defined in the struct

@bjorn3
Copy link

bjorn3 commented Jul 27, 2022

Something that did likely also help would be an attribute to only allow structs and enums with named fields and variants (for json, toml, ...) or structs and enums with index based fields and variants (for bincode, ...), while disallowing the other for the type it is applied to. In the common case where you only need to serialize/deserialize a single format, that would cut the size of the derived Deserialize impl in half I think.

Copy link
Contributor

@Mingun Mingun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@dtolnay, in light of recent speedup attempts of serde_derive shouldn't this PR be merged (after resolving conflicts)? At a minimum, it gives a more readable output of cargo expand and I do not see objective disadvantages.

/// calling `Error::invalid_length`.
///
/// This method exists purely to reduce the size of the generated code,
/// reducing compile times. `SeqAccess` implementations should not override
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with @AnthonyMikh here, that this would be better as free functions in serde::__private::de module

Copy link
Member

@dtolnay dtolnay left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am not on board with this approach; I am confident that this adds more bloat in the long term than it would remove in the short term, along with other downsides that have not been identified above.

I have planned to prototype a counterproposal but didn't get to it so far.

  • The new factoring here is overly specific to the current set of patterns used by our generated visit_seq and visit_map implementations, which are not good patterns and I have already intended to change them (clarified below).

  • I am not on board with adding new private functions, neither doc(hidden) on the public deserializer API traits as done here, nor in __private as brought up by some commenters above, for vanilla types that don't rely on the various buffering internals. This introduces friction for everyone trying to understand how serde's deserializer API is intended to be used. This PR makes it that much harder to use serde_derive-generated code as a starting point for adapting to fit someone's custom need. I do not agree that this "gives a more readable output of cargo expand". It gives output that is more compact, but wrong, in the sense that you are not allowed to use it. Your code will break if you use it. Previously cargo expand would give you code that you could more easily adapt and use. As a consequence of hygiene it must access things like Result and Ok through a private re-export in serde, but all the types and traits and functions used are public API available for the user in their handwritten impls.

  • I am also not on board with just making the functions added here public to avoid hidden functions. Related to the first bullet, these will become obsolete, and they would be permanent API bloat. It also exacerbates the issue that the API of serde being challenging is as much of an impediment as compile time instruction counts are.

Regarding the reliance on private functions, including for buffering: one of my projects is to make all private functions relied on by serde_derive's generated code into public API available to users. The only possible way this is viable is if they can be versioned independently of the serde crate. If someone is interested in making progress with this, I would recommend picking up rust-lang/rust#54363 (comment).

Regarding how to do visit_map better: among other things, it definitely should not be producing a separate __Field type for every data structure you use derive(Deserialize) on and instantiating the Deserializer's methods with it. There should be one single non-generic DeserializeSeed type that is used for deserializing all fields of all data structures, because none of that logic is different depending on whose fields they are.

  • This would reduce the number of instantiations of Deserializer methods by 50%, without requiring data formats to do unsafe manual erasing like Optimize compile time of Deserializer trait methods json#687.

  • It would also eliminate the code currently reproduced for visit_str/visit_bytes/visit_u64 of all those distinct __Field enums, which is additional savings on top of the deserializer instantiations.

  • Finally it's actually an opportunity to improve field deserialization performance too, since the one reusable DeserializeSeed would be able to implement smarter strategies that are not feasible today, such as optimizing for the common case where fields in the input appear in the same order as fields in the Rust data structure. This is an optimization that exists in Thrift that has been extremely significant. Instead of comparing incoming fields against data structure fields top to bottom every time, there is state to remember and resume search from the most recently matched field. For a large data structure with n fields this is (n/2)× speedup in the overwhelmingly common case (quadratic time becomes linear time) and only 2× slowdown in the uncommon worst case of fields encountered in exactly reverse order, and identical performance for randomly ordered fields.

Further in the direction of non-generic __Field is the idea that Deserialize impls should be done by handing off a kind of type-erased bytecode for the deserializer to evaluate. If serde were to end up in the standard library ever, this is the form I would be most comfortable with it taking. This is the ultimate non-generic DeserializeSeed, essentially. Instead of a __Field-specific seed that holds a static array of field names and can produce field indices from them during deserialization, which separate code still needs to match on to deal with field values, this could do more of the work non-generically by also representing a callback to run per field, location to write the result, what checks to perform (duplicate fields allowed or not), etc. In this approach no control flow needs to be compiled individually per derived Deserialize impl, and no Deserializer methods need to be instantiated, so Deserialize impls become free — basically a single static.

Thank you anyway for the PR and the attention to this issue.

@dtolnay dtolnay closed this Aug 6, 2023
@Mingun
Copy link
Contributor

Mingun commented Aug 9, 2023

There should be one single non-generic DeserializeSeed type that is used for deserializing all fields of all data structures, because none of that logic is different depending on whose fields they are.

Something like this?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

9 participants