Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Multi-version snapshot support #9980

Merged
merged 17 commits into from
May 22, 2020

Conversation

svenski123
Copy link
Contributor

@svenski123 svenski123 commented May 11, 2020

Problem

#7442 AccountStorageEntry.count_and_status is serialized but not used when deserialized

Summary of Changes

  • New snapshot format 1.1.1 added which no longer serializes count and status fields, resulting in snapshot savings
  • solana-validator and solana-ledger-tool now take an optional option --snapshot-version <VERSION> which directs the respective tool to produce snapshots in the specified format (input snapshots can be in either 1.1.0 or 1.1.1
  • Both the new 1.1.1 version and the old 1.1.0 version are recognised / processed by snapshot_utils
  • Serialization code for banks, accounts, storage entries, etc. in runtime has been moved to and consolidated in serde_utils
  • Account snapshot serialization code now parameterized on serialisation context type (SerdeContext) to produce the different versions - here the the version specific types for serializing Account Storage Entry and AppendVec are specified.

fixes #7442

@mvines mvines added the CI Pull Request is ready to enter CI label May 11, 2020
@solana-grimes solana-grimes removed the CI Pull Request is ready to enter CI label May 11, 2020
@ryoqun ryoqun added the CI Pull Request is ready to enter CI label May 12, 2020
@solana-grimes solana-grimes removed the CI Pull Request is ready to enter CI label May 12, 2020
ledger-tool/src/main.rs Outdated Show resolved Hide resolved
@mvines
Copy link
Member

mvines commented May 12, 2020

I'd actually rather not fully support multiple snapshot versions (--snapshot-version argument), it's only during the transition period that more than one version is needed, and can be managed like this:

Step 1:

  • Both snapshot versions can be loaded, but only the newer snapshot version is written

Step 2:

  • Remove code for loading the old snapshot version

On the master branch, we can jump directly to Step 2. On the v1.1 branch, stop at Step 1. The v1.1 branch will live on testnet and then mainnet-beta through the end of June.

@mvines
Copy link
Member

mvines commented May 12, 2020

Serialization code for banks, accounts, storage entries, etc. in runtime has been moved to and consolidated in serde_utils

This looks like general cleanup that could be factored out into its own PR and landed independently. https://github.com/solana-labs/solana/blob/master/CONTRIBUTING.md#pull-requests sums it up nicely:

Small, frequent PRs are much preferred to large, infrequent ones. A large PR is difficult to review, can block others from making progress, and can quickly get its author into "rebase hell". A large PR oftentimes arises when one change requires another, which requires another, and then another.

Your work here looks very good overall though @svenski123, thanks!

@svenski123
Copy link
Contributor Author

@sakridge @mvines Thank you for your reviews and comments.

Apologies the PR was this long - I did want to do this is in a maintainable way and without cleaning up the serialisation I think the result would have been far more inscrutable and convoluted and generally a disservice to the code base.

@mvines
Copy link
Member

mvines commented May 13, 2020

@sakridge @mvines Thank you for your reviews and comments.

Apologies the PR was this long - I did want to do this is in a maintainable way and without cleaning up the serialisation I think the result would have been far more inscrutable and convoluted and generally a disservice to the code base.

No need to apologize! I love the intent. Mechanically I just far prefer lots of little PRs because they're individually much easier to review in one session before I get interrupted. But even without interruptions, a large PR is asking for a lot of brain power and most people will start paging down quickly after a couple screen fulls so you're less likely to get an honest review.

@svenski123 svenski123 force-pushed the multi-version-snapshot-master branch from c52f79e to 6ab9222 Compare May 13, 2020 12:37
@svenski123
Copy link
Contributor Author

Rebased the PR against master and applied the #2 output snapshot version approach.
Also rustfmt and clippy apeasement.

@mvines mvines added the CI Pull Request is ready to enter CI label May 14, 2020
@solana-grimes solana-grimes removed the CI Pull Request is ready to enter CI label May 14, 2020
@svenski123 svenski123 force-pushed the multi-version-snapshot-master branch from 960dcb9 to 1a31226 Compare May 15, 2020 18:06
@mvines mvines added the CI Pull Request is ready to enter CI label May 18, 2020
@solana-grimes solana-grimes removed the CI Pull Request is ready to enter CI label May 18, 2020
Copy link
Member

@mvines mvines left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks really good @svenski123, sorry for the slow review.

@ryoqun - can you please review too?

ledger/src/snapshot_utils.rs Outdated Show resolved Hide resolved
@codecov
Copy link

codecov bot commented May 18, 2020

Codecov Report

Merging #9980 into master will decrease coverage by 0.0%.
The diff coverage is 76.0%.

@@           Coverage Diff            @@
##           master   #9980     +/-   ##
========================================
- Coverage    81.5%   81.4%   -0.1%     
========================================
  Files         283     288      +5     
  Lines       66046   66211    +165     
========================================
+ Hits        53832   53938    +106     
- Misses      12214   12273     +59     

runtime/src/bank.rs Outdated Show resolved Hide resolved
runtime/src/serde_utils.rs Outdated Show resolved Hide resolved
runtime/src/serde_utils.rs Outdated Show resolved Hide resolved
runtime/src/serde_utils.rs Outdated Show resolved Hide resolved
// as compared to serializing directory to the stream,
// this effectively prepends a u64 containing the length
// of the byte vector to the serialized output
serializer.serialize_bytes(&buf)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My biggest favor and extra bonus review comment is that I really want to get rid of those unneeded bincode indirection.

Actually, serialize_bytes isn't needed to use. This just complicates code.See 7bf89d9. This clean-up commit is just easy but obviously introduces abi-incompat. So, I really wished it but I couldn't...

(@sakridge: Do you know whether there is subtle reason we forced to use serialize_bytes instead of 7bf89d9?)

Now time has come. :)

If this isn't too much, could you incorporate and correctly versionize this in this pr some way?

This removes these odd code or put versionized code-path at least:
https://github.com/solana-labs/solana/pull/9980/files#diff-5c68e62b5d516dfc53a73fcbe5e68829R277
https://github.com/solana-labs/solana/pull/9980/files#diff-2099c5256db4eb5975c8834af38f6456L600

Also, this hide abi type info my stall pr #8012..... So I want to fix this in that regard too.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the intent of this was to allow multiple snapshot versions, I've taken out the nested byte array serialisations as part of v1.2.0.

Serde goes out of it's way to ensure you can only serialise one thing and one thing only with the serializer passed into a Serialize::serialize() implementation - so shoving everything into a byte array and serialising that is a crude but very effective workaround for serde's quite intentional limitation in this regard.

It turns out serialising a object as a serde tuple allows you write successive elements out with no prefixed length/size (at least when using the bincode serializer; try this with JSON or some other output format and your mileage may vary). So that's what v1.2.0 does...

Copy link
Member

@ryoqun ryoqun May 20, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As the intent of this was to allow multiple snapshot versions, I've taken out the nested byte array serialisations as part of v1.2.0.

Yeah, I saw this, it's really cool. :)

Serde goes out of it's way to ensure...

Yeah, that's right. thanks for grokking my comment's intention and serde internals. That's serde's opinionated design decision. You may not like it but let's follow the convention for cleaner code.

Anyway, thanks for taking my big favor and it's really cool that we got one-step closer to the ideal situation!

I think the last thing is to further put the needless serde logic in SerdeContextV1_1_0 and make the common codepath and SerdeContextV1_2_0 really clean.
So, when we add 1_3_0, it'll become dumb easy to copy from ContextV1_2_0 and start coding without worrying about these chaos in the V1_1_0.
To that end, we rather don't want to define various methods like legacy_deserialize_byte_length for the common trait SerdeContext.
Ultimately, when we're 1_5_0 or like, we can safely remove old relics entirely easily.

Lastly, although those well-factored-out common code (like Serialize for SerializableAccountsDatabaseUnversioned) looks nice, I anticipate these code is very likely to be affected by newer snapshot changes. In other words, the common code and the actual trait implementation code is so intertwined, making code maintenance difficult a bit due to being tightly coupled.

I think manually writing serialize_tuple is too tedious for SerdeContextV1_2_0+ and i don't want to carry it over SerdeContextV1_3_0...

Well, hard to explain by words, so i've made a sample commit: 6d58b73

Feel free to pick this commit. :)
note: This commit done in hurry. no test updated...

Copy link
Member

@ryoqun ryoqun May 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@svenski123 Did you read this comment? Sorry I unresolved this thread and added a comment on a outdated diff yesterday. :)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok - I've gone through it all again as well that commit of yours - that's a big commit!

At outset with this PR I was aiming to implement two different but similar serialisation formats with one set of code, reasonably limited changes but structured enough so that I could reason about correctness despite unfamiliarity. To be fair this is not where we are now.

I intentionally avoided making whole new copies of the ser/des logic and the pre and post processing which is non-trivial and imposes a mainentance burden that is undesirable and unnecessary if the old format is retired quickly and there is no need to continue refactoring / developing the new format in the meantime.

If this will take longer to get out there (I suppose 1.1.x to 1.2.x is a hint) and the new format needs to be decoupled more completely from the older one then by all means separate copies is the way as they will diverge - but let's be clear on the requirement and the scope.

Some comments:

  • Using serialize_map / serialize_tuple / serialize_seq is actually quite beneficial as it avoids unnecessary copying of data; I've added some generic functions that serialize iterators directly out as seqs, tuples and maps which should hopefully help
  • the old snapshot code currently serializes pre-serialized byte sequences but then reads in and discards a u64 and then proceeds to deserialize the data directly rather than deserializing a byte vector and then deserializing that; the point here is the current serde usage is just as broken as serializing a tuple and deserializing indivudal elements and is just as likely (or unlikely) to break in the near term
  • I've put the 1.1.0 / legacy specific TypeContext stuff (formerly SerdeContext) stuff in a submodule with the intention that it be jettisoned as soon as it's no longer used
  • If having two forks of the ser/des is where we want to go with this - which is fine - then I'd suggest each version go in a separate submodule
  • The TypeContext was meant to encapsulate the differences between two ser/des formats - it's kind of ugly in C++, even uglier in Rust and the more the formats differ, the bigger and uglier TypeContext needs to get; and then soon enough it needs to be put down

I see there are some other comments - I'll review and respond there.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@svenski123

I intentionally avoided making whole new copies of the ser/des logic and the pre and post processing...
... let's be clear on the requirement and the scope.

Yeah, I'm sorry for confusing you for requirement and scope.

Considering you started off from the mere omission of count_and_status in AccountStorageEntry, what you original thought as the requirement and scope is completely understandable.

But, your awesome snapshot versioning ground work itself changed the game. That literally broadened the scope for me:
This work is awesome too good just for the single field omission. Let's make this work the foundation for future snapshot versioning.

Yeah, I didn't communicate it explicitly. Sorry about that.

As a rapidly-developed blockchain project, we wanted a non-breaking versioned snapshot, without worrying too much for temporal code duplication, sacrificing otherwise more well engineered / refactored code.

For context, we did an ad-hoc bank serialization transition like this in the past: https://github.com/solana-labs/solana/pull/9033/files#diff-030d72e3f57f46e47aff0909219268baR35
Yeah, these ad-hoc ness well tolerated here... ;) But, ideally, we wanted to have versioning in place.

So,

the new format needs to be decoupled more completely from the older one then by all means separate copies is the way as they will diverge...
...
If having two forks of the ser/des is where we want to go with this - which is fine - then I'd suggest each version go in a separate submodule

I'm afraid but this is true. This is the correct scope/requirement I have now.
Could you rework your PR accordingly each version for each submodule?

PS:

If this will take longer to get out there (I suppose 1.1.x to 1.2.x is a hint)

I think we can merge this pr this week or early next week. 1.2 branching is almost here. :)

Using serialize_map / serialize_tuple / serialize_seq is actually quite beneficial as it avoids unnecessary copying of data; I've added some generic functions that serialize iterators directly out as seqs, tuples and maps which should hopefully help

Yeah, I personaly like these gadgets. These will be useful for streaming serialization. But for us, we're selializing on-memory data. So, that's not case... foo_bar_map.serialize(serializer) is enough...

the old snapshot code currently serializes pre-serialized byte sequences but then reads in and discards a u64 and then proceeds to deserialize the data directly rather than deserializing a byte vector and then deserializing that; the point here is the current serde usage is just as broken as serializing a tuple and deserializing indivudal elements and is just as likely (or unlikely) to break in the near term

Yeah, I see. My argument was weak and rather hypothetic. I'm fine to keep your reworked serialization/desealization for old snapshot version in TypeContextLagacy if you're confident. :)

The TypeContext was meant to encapsulate the differences between two ser/des formats - it's kind of ugly in C++, even uglier in Rust and the more the formats differ, the bigger and uglier TypeContext needs to get; and then soon enough it needs to be put down

Yeah, that's the concern I'm worried with. Our development speed is quite fast and we want to add many changes to Bank/AccountsDB/etc..

@svenski123 svenski123 force-pushed the multi-version-snapshot-master branch from 1a31226 to 235ee72 Compare May 18, 2020 20:22
.unwrap_or_else(|| panic!("No bank_hashes entry for slot {}", self.slot)),
))?;

seq.end()
Copy link
Member

@ryoqun ryoqun May 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@svenski123
I appreciate what you've done for us. That's really impressive and I've learned advanced stuff for serde from you. For example, I've never come up with generic type parameterized versioned de-/seerialization scheme. I really like this part!

But then, we'd like more simplified approach to the solution.

I'm not sure NEWER/OLDER Future/Lagecy makes life easier for us. I think it will require us to rename this new Future to Legacy when we add another snapshot version? That's a bit tedious.

Instead, I think it's fine to embed explicit specific version numbers in constants. Even at a time, there might temporarily be 3 versions when actively developping, that's fine.

In addition, we also want to avoid to define like serialize_iter_as_seq/serialize_iter_as_map/serialize_iter_as_tuple by ourselves unless it's absolutely needed.

And as I commented yesterday here, I think there is more simplified approach.

For example, the code this comment is attached to is serializing AccountsDB. And the corresponding code in my suggested commit is like this.

I think my suggestion is simpler and more manageable in my opinion.

Or I might be missing some. Could you explain why you think this pr's current approach is expected to be simper and future-proof?

Specifically, could you explain how could we reach to this state or even simpler versioned code starting from the current implementation?

I mean, could you elaborate a bit on this?: #9980 (comment)

Once the legacy serialisation style is no longer needed, much of this code can and should be removed, as well as the generic type parameterized implementation if it proves to be more trouble than it's worth (I must admit it is quite awkward, especially in rust).

Also, how could we remove the Legacy serialization debt along the course from now, including like legacy_serialize_byte_length, which is in the common codepath? Maybe just remove after we remove the current Legacy? If so, why cannot we go one step further and remove them now?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ryoqun Having pulled down your repo and looked at your commit in an editor (as opposed to github), your intent is much clearer to my now and much of my concern is proven unfounded. Obviously I have only myself to blame for this, so apologies. As I have written the comments below I will still share them in the hope that they may be useful, though please bear in mind that until close to the end I though you had a much less nuanced approach.
-Kris

@svenski123
I appreciate what you've done for us. That's really impressive and I've learned advanced stuff for serde from you. For example, I've never come up with generic type parameterized versioned de-/seerialization scheme. I really like this part!

It can help when you're faced with a large chunk of unfamiliar, complex code - cordon off / parameterize the piece you need to change, test it with the original parameterize to ensure the logic is maintained, then design a new type/piece to slot in knowing that a) the original behavior remains unaffected and b) any problems with the new behaviour are due solely to issues with the new type/piece. Sort of like scaffolding, keeps your house up while you're working on it, though longer term you shouldn't need it.

But then, we'd like more simplified approach to the solution.

I'm not sure NEWER/OLDER Future/Lagecy makes life easier for us. I think it will require us to rename this new Future to Legacy when we add another snapshot version? That's a bit tedious.

There were two considerations in my mind with this -

  1. the snapshot format (currently) encompasses a tar archive with a bank / bankrc serialised file, a serialised status cache file, the memory mapped accounts files, etc.; there can be changes in the overall snapshot format that don't affect bankrc serialisation and vice, so I don't think they should be coupled (well at this point at any rate)
  2. @mvines seemed quite clear to me on his desire for there to be maximum of two possible versions in use at any time - when there's to be a change in format, the next release will support reading the older version and the newer one, but will only output the newer version; and the next release will drop support for the older version entirely.

Along these lines then the "x.y.z" snapshot version number remains the concern of ledger/src/snapshot_utils.rs, and snapshot_utils then maps this to OLDER or NEWER as appropriate; (note the other serialisations (i.e. status cache, etc.) could follow this approach too).

If a new snapshot format is to be released, the "x.y.z" numbers are updated - one is mapped to NEWER and the previous one to OLDER. In the case of bankrc serialiation, if there's no change there then nothing needs to change in runtime, otherwise NEWER will need to be made OLDER and then a newer NEWER implemented. In the next release after a snapshot version change with a new NEWER, the OLDER is then dropped and is aliased to NEWER.

I was looking ideally for two unordered names but settled on NEWER and OLDER as there should never be any question which version is which and which one you want, as for an upgrade you always want the NEWER one. But maybe this might be more difficult in practice - if so, the convention is easily changed.

However having lots of different kinds of x_y_z constants all over the place that constantly need to be kept in agreement lest the world comes to an end / bad things happen / etc. is asking for trouble in the longer term.

Instead, I think it's fine to embed explicit specific version numbers in constants. Even at a time, there might temporarily be 3 versions when actively developping, that's fine.

The ser/des flavours in serde_snapshot can be identified be version number, that's also entirely reasonable however those versions do not have a bijective mapping with the snapshot versions, so they should be quite different to ensure there is no chance of confusion.

In addition, we also want to avoid to define like serialize_iter_as_seq/serialize_iter_as_map/serialize_iter_as_tuple by ourselves unless it's absolutely needed.

As I pointed, it is absolutely needed to implement the Serialize trait properly for larger data structures with for instance using serialize_map, serialize_seq, serialize_tuple, serialize_struct, etc. as you otherwise have to first copy your data into a data struct that does implement them - currently account storage entry lists are copied into lists and then into a hash map which is then dumped out - but after a ton of heap allocation, hash calcs, insertions all of which is unnecessary work that takes a lot of time.

Note serde provides collect_map() and collect_seq() - this was in fact the inspiration for the serialize_iter_as_xxx() functions above.

And as I commented yesterday here, I think there is more simplified approach.

For example, the code this comment is attached to is serializing AccountsDB. And the corresponding code in my suggested commit is like this.

I think my suggestion is simpler and more manageable in my opinion.

Or I might be missing some. Could you explain why you think this pr's current approach is expected to be simper and future-proof?

The big problem is:

let entries = serializable_db
    .account_storage_entries
    .iter()
    .map(|x| {
        (
            x.first().unwrap().slot,
            x.iter()
                .map(|x| Self::SerializableAccountStorageEntry::from(x.as_ref()))
                .collect::<Vec<_>>(),
        )
    })
    .collect::<HashMap<Slot, _>>();

where account_storage_entries is a list of lists of ASEs, it creates a new vector for each sublist and then shoves them into a hash map until when it's put into a tuple which is then serialized - there is no reason why the ASEs can't or shouldn't be written out to disk as they are processed.

With the serialize_iter_as_map() helper function you can write:

let entries = serialize_iter_as_map(
    self.account_storage_entries.iter().map(
        |x| (
            x.first().unwrap().slot,
            serialize_iter_as_seq(
                x.iter()
                    .map(|x| C::SerializableAccountStorageEntry::from(x.as_ref())),
            ),
        )
    )
);

and the later in the 4-tuple, include it as &entries - this won't create any extra Vecs or HashMaps, and won't do any serizalition until the tuple starts serializing and gets to that element.

(And yes, there's little practical overhead in creating a four element tuple itself at top level and serializing it as it's only four pointers but the data they point to must already exist - or the objects are Serializable which is what serialize_iter_as_xxx() returns, and serialize on the fly).

Specifically, could you explain how could we reach to this state or even simpler versioned code starting from the current implementation?

Having pulled your branch and looked at it in an editor, the intent is now clearer to me.
Github made everything look green and it looked like all the code had duplicated for each SerdeContext trait implementation (including the context_bankrc/accountsdb_to/from_stream() functions) - actually github elided these because they hadn't changed.

Please accept my apologies.

That bookmarklet script would been very useful yesterday!

I mean, could you elaborate a bit on this?: #9980 (comment)

Once the legacy serialisation style is no longer needed, much of this code can and should be removed, as well as the generic type parameterized implementation if it proves to be more trouble than it's worth (I must admit it is quite awkward, especially in rust).

My thoughts with that were to save that for a later PR for the release after the version change, which would remove the *V1_1_0 types, as well as the legacy_or_zero and other trait functions that dealt with the u64 length fields.

Also, how could we remove the Legacy serialization debt along the course from now, including like legacy_serialize_byte_length, which is in the common codepath? Maybe just remove after we remove the current Legacy? If so, why cannot we go one step further and remove them now?

Argument goes likes this, you have to remove the legacy stuff either now or later (after upgrade rolled out). If you wait till later, you just remove the legacy stuff and you're done. If you do it now however, you also have to duplicate some more code to so the legacy format still works. Later after the upgrade is rolled out, you still have some stuff left to delete (but it's easier).

Unless the presence of the legacy stuff is blocking some other development or something else, or if there's a risk that if it doesn't happen now it will be forgotton about, then the wait until later approach requires less total time / resource but still gets you to clean state eventually.

Copy link
Member

@ryoqun ryoqun May 21, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@svenski123

@ryoqun Having pulled down your repo and looked at your commit in an editor (as opposed to github), your intent is much clearer to my now and much of my concern is proven unfounded.

Nice! We're on the same page, then? I've read the rest of comment. I have no doubt that you're very knowledgeable and your code is well reasoned and thought one. I thought code speaks better than words... Seems like I'm too pragmatic coding-first guy. ;)

Then, I hope that this pr's direction is corrected by you. I'm afraid of wasting your time because some code will be reverted. But at least I learnt a lot from your code and comments!

Some extra remarks:

one is mapped to NEWER and the previous one to OLDER. ... In the next release after a snapshot version change with a new NEWER, the OLDER is then dropped and is aliased to NEWER.

Yeah, I could imagine that. That's neat but we generally prefer less diff/code shifting solution. Also, older X.Y.Z code will promptly removed; we like removing code. :)

those versions do not have a bijective mapping with the snapshot versions

I think it's fine for bijective mapping to exist between snapshot versions and particular versioned codepath for particular struct like AccountDB/StatusCache. The motto is let's be simple unless needed.

As I pointed, it is absolutely needed to implement the Serialize trait properly for larger data structures with for instance using serialize_map, serialize_seq, serialize_tuple, serialize_struct, etc. as you otherwise have to first copy your data into a data struct that does implement them...

Yeah, certainly heap copy should be avoided and your solution does it correct in that regard. But, this might not be a problem for now. If there is indeed problem, then we could take your code with comments justifying so. Otherwise, we prefer less code. To be honest, I just didn't notice the wasted memory allocation could happen. Thanks for pointing out!

I dunno rustc is clever enough to elide these copies. I wish so but I guess not. ;)

Argument goes likes this, you have to remove the legacy stuff either now or later (after upgrade rolled out)...

Here too, your reasoning is correct! But for us, the amount of code diff and the count of prs weights a bit more. Then, you might have guessed this now: Changing code and reviewing diff takes human time. Let's take a path of fewer of them. And our most precious resource is engineering time and we can't increase the throughput of it easily. So the only way for more productivity is to reduce the traffic (code changes) ;)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah the stuff you picked out into the versioned traits was a lot more targeted / limited than what it looked like at first in github's diff view - it makes sense.

The PR branch is merged with your stuff, rebased off of master and is close, but not quite, to where it needs to be. The older / newer enum is still there but fixing that up is cosmetic.

One aspect of serde that I've come to appreciate is it doesn't really matter what type you think your data is, what matters is what serde thinks it is - and it can only ever be one of the 29 serde types listed here https://serde.rs/data-model.html. That said, some of the serializers can't tell the difference between some of the type (i.e. bincode appears unable to distinguish between tuples and tuple structs for instance) whereas using an XML serializer (say) with serde should embed a lot more metadata in the serialized output. Actually this could make for an interesting test case....

Copy link
Member

@ryoqun ryoqun May 22, 2020

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@svenski123 Whoo! Thanking for quickly reworking on this.

The PR branch is merged with your stuff, rebased off of master and is close, but not quite, to where it needs to be. The older / newer enum is still there but fixing that up is cosmetic.

Yeah, those is cosmetics. We can fix it later. Let's land first.

Congratulation! I think this pr is ready for merge. I'm soo appreciating your hard work :)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

One aspect of serde that I've come to appreciate is it doesn't really matter what type you think your data is, what matters is what serde thinks it is - and it can only ever be one of the 29 serde types listed here

Hehe, that sounds interesting observation. Could you tell me what did enlighten this insight from this PR? Maybe some relaxed coercion from variants of rust types into one of serde types? Which one, specifically?

said, some of the serializers can't tell the difference between some of the type ...

Yeah, that's true. But because of the compromise, serde established the role of marketplace between application serializer logics and actual encoder logics in the rust ecosystem. This is much like Pandoc and I think they done decent job for it with fine balance for common denominator. For us as performance-oriented project, it's a bit bloated, though... ;) serde's primary target is web backends, after all.

@ryoqun
Copy link
Member

ryoqun commented May 21, 2020

@svenski123 Also, as a github protip, there is some bookmarklet to expand all diffs like this: https://gist.github.com/juanca/5fd799c5b094e3e4f8b709cd101d7403#file-github_load_all_diffs-js :)

I use it alot when reviewing this pr.. :)

@svenski123
Copy link
Contributor Author

@svenski123 Also, as a github protip, there is some bookmarklet to expand all diffs like this: https://gist.github.com/juanca/5fd799c5b094e3e4f8b709cd101d7403#file-github_load_all_diffs-js :)

I use it alot when reviewing this pr.. :)

That looks cool - is that supposed to go into a bookmark as the URL?

@ryoqun
Copy link
Member

ryoqun commented May 21, 2020

That looks cool - is that supposed to go into a bookmark as the URL?

@svenski123 Yeah, also you can directly input via developer inspector.

Despite clippy's suggestion, it is not currently possible to create type aliases
for traits and so everything within the 'Box<...>' cannot be type aliased.

This then leaves creating full blown traits, and either implementing
said traits by closure (somehow) or moving the closures into new structs
implementing said traits which seems a bit of a palaver.

Alternatively it is possible to define and use the type alias 'type ResultBox<T> = Result<Box<T>>'
which does seems rather pointless and not a great reduction in complexity but is enough to keep clippy happy.

In the end I simply went with squelching the clippy warning.
remove nested use of serialize_bytes
Remove construction of Vec and HashMap temporaries during serialization.
clean up leakage of implementation details in serde_snapshot.
commit 6d58b73
Author: Ryo Onodera <ryoqun@gmail.com>
Date:   Wed May 20 14:02:02 2020 +0900

    Confine snapshot 1.1 relic to versioned codepath
legacy accounts_db format now "owns" both leading u64s, legacy bank_rc format has none
@svenski123 svenski123 force-pushed the multi-version-snapshot-master branch from 9c12d4d to 5b11777 Compare May 21, 2020 21:31
Copy link
Member

@ryoqun ryoqun left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Maybe the largest contributor-originated pr ever for solana. :)

(I'll land this shortly after! I'll be afk for awhile)

@ryoqun ryoqun merged commit b7a32f0 into solana-labs:master May 22, 2020
@ryoqun
Copy link
Member

ryoqun commented May 27, 2020

@svenski123 Apparently, you did improve the snapshot serialization considerably (5-6x) as a side-effect of this pr!

As I pointed, it is absolutely needed to implement the Serialize trait properly for larger data structures with for instance using serialize_map, serialize_seq, serialize_tuple, serialize_struct, etc. as you otherwise have to first copy your data into a data struct that does implement them...

Yeah, certainly heap copy should be avoided and your solution does it correct in that regard. But, this might not be a problem for now.

Happily, I was wong. :)

Screenshot from 2020-05-28 08-09-10

This is the relevant chart for the snapshot relevant serialization time on our devnet, where there are huge number of AccountStorageEntries around 400k.

serialize_iter_as_seq/serialize_iter_as_map/serialize_iter_as_tuple is doing great job? :)

Anyway, super great job! And needless to say, I see no performance regression by the snapshot code-path being versioned with this pr. :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

AccountStorageEntry.count_and_status is serialized but not used when deserialized
5 participants