TODO: SimpleSerialize (SSZ) spec #2

hwwhww · 2018-09-22T06:33:19Z

Open an issue for following up our discussion on gitter.

Specification requirements

Design rationale
- Directly mapping to the specified type and speeding block processing up. (by Vitalik from https://www.youtube.com/watch?reload=9&v=cp0LxJiyV3I)
Encoding
Decoding
Types
- Integers
  - [not in the current implemenation] Specify the signed and unsigned integers support.
  - [not in the current implemenation] Static types?

Anything else? :)

Reference

Python implementation - Vitalik's latest implementation in beacon_chain repo
Rust implemenation - Lighthouse SSZ
Lighthouse Serialization Sandbox Output
Ethereum Yellow Paper - Appendix B: Recursive Length Prefix (RLP)
The EVM Jello Paper - Recursive Length Prefix (RLP)
simple_serialise - not possible to parse the structure without type information beacon_chain#94: simple_serialise - not possible to parse the structure without type information
ssz intX type is actually uint beacon_chain#112: ssz intX type is actually uint

cc @vbuterin @djrtwo @arnetheduck @mratsim @paulhauner @NatoliChris @PoSeyy

hwwhww · 2018-09-22T06:35:27Z

Repost the gitter discussion about integer types:

@djrtwo: What is the argument for not supporting any int size that satifies size % 8 == 0?
@mratsim: for compiled language default int types are all that can be represented as 2^p. int128 and int256 works but int40 or int24 don’t. That’s not super blocking as we can always represent them as an array of bytes or packed uint8 + uint16. i.e. in memory it’s just easier to use uint32 constrained to a certain range for uint24 for example. For serialization we would use an array of bytes.
@djrtwo: in memory, absolutely these things should be in 16/32/64 chunks. for the serialization, I see the value in tightly packing
@arnetheduck: when processing, it sometimes makes sense to use the smaller types in memory as well - compilers are actually pretty smart nowadays and in some static languages you can get away with 24-bit integers and the like. of course there are other considerations as well, alignment for the type, struct packing etc etc - all architecture-dependent
@hwwhww: Maybe use the integer type expression in solidity https://solidity.readthedocs.io/en/latest/types.html#integers
Integers int / uint: Signed and unsigned integers of various sizes. Keywords uint8 to uint256 in steps of 8 (unsigned of 8 up to 256 bits) and int8 to int256. But don’t use this: uint and int are aliases for uint256 and int256, respectively. <— that’s too confusing in our use cases.
@arnetheduck: yeah - having 8-bit increments can be useful sometimes - the corresponding feature in C/C++ are bit fields - https://en.cppreference.com/w/cpp/language/bit_field - optimizers that are smart can do interesting things with them because they know upper and lower bounds more precisely - also, if there are large amounts of data, more fits in cache and less needs to be moved around, in general. the downside tends to be in unaligned access - most platforms have the property that aligned reads are faster - on some, it's even an error to read unaligned data, and you might have to resort to reading a 24-bit value byte by byte
llvm likewise uses arbitrary-precision integers (http://llvm.org/docs/LangRef.html#integer-type) while wasm limits itself to i32 and i64

paulhauner · 2018-09-22T06:40:11Z

Good call @hwwhww!

Here's our README: https://github.com/sigp/lighthouse/tree/master/ssz

EDIT: Oh, I see it's already in your list :)

p-mc2 · 2018-09-24T00:55:48Z

This is great, I love having so much information all in one place. A few quick comments:

Some of the above on uint vs int and explicit types is covered here and will hopefully clear up this small(?) issue.

I'm writing a blog post on the basics of ssz and the rationale/motivation for introducing a new serialization format, I think the biggest issue with something like this is getting people to understand the rationale for a new format/understanding why it's necessary. Cheers.

p-mc2 · 2018-09-24T02:19:04Z

One thing I did want to question: in RLP the main motivations were simplicity and guaranteed absolute byte-perfect consistency. Obviously simplicity is carried over, it seems like guaranteed absolute byte-perfect consistency is also a part of simple serialize. Can someone confirm this is the case?

ChihChengLiang · 2018-09-26T07:31:09Z

I have some archaeological discovery that might shed some light on this issue.

One of our core principles in Ethereum is simplicity; the protocol should be as simple as possible, and the protocol should not contain any black boxes. Every single feature of every single sub-protocol should be precisely 100% documented on the whitepaper or wiki, and implemented using that as a specification (ie. test-driven development).

Buterin, V. 2014 RLP v.s. external technologies

mratsim · 2018-09-26T08:34:14Z

Here is Nim implementation, simple tests and the discussion for a common test format.

One thing that I'm not too sure of regarding current SSZ are serializing container types. For example:

# CrystallizedState
fields = {
    # List of validators
    'validators': [ValidatorRecord],
    # Last CrystallizedState recalculation
    'last_state_recalc': 'int64',
    # What active validators are part of the attester set
    # at what slot, and in what shard. Starts at slot
    # last_state_recalc - CYCLE_LENGTH
    'shard_and_committee_for_slots': [[ShardAndCommittee]],
    # The last justified slot
    'last_justified_slot': 'int64',
    # Number of consecutive justified slots ending at this one
    'justified_streak': 'int64',
    # The last finalized slot
    'last_finalized_slot': 'int64',
    # The current dynasty
    'current_dynasty': 'int64',
    # Records about the most recent crosslink `for each shard
    'crosslink_records': [CrosslinkRecord],
    # Used to select the committees for each shard
    'dynasty_seed': 'hash32',
    # Start of the current dynasty
    'dynasty_start': 'int64'
}

# ValidatorRecord
fields = {
    # The validator's public key
    'pubkey': 'int256',
    # What shard the validator's balance will be sent to 
    # after withdrawal
    'withdrawal_shard': 'int16',
    # And what address
    'withdrawal_address': 'address',
    # The validator's current RANDAO beacon commitment
    'randao_commitment': 'hash32',
    # Current balance
    'balance': 'int128',
    # Dynasty where the validator  is inducted
    'start_dynasty': 'int64',
    # Dynasty where the validator leaves
    'end_dynasty': 'int64'
}

# ShardsAndCommittee
fields = {
    # The shard ID
    'shard_id': 'int16',
    # Validator indices
    'committee': ['int24']
}

# CrosslinkRecord
fields = {
    # What dynasty the crosslink was submitted in
    'dynasty': 'int64',
    # What slot
    'slot': 'int64',
    # The block hash
    'hash': 'hash32'
}

mratsim · 2018-09-29T08:10:31Z

Some thoughts in the past 2 days:

Type prefixes

Replying to @AlexeyAkhunov ethereum/beacon_chain#94 and ethereum/eth2.0-pm#8

As mentioned by @raulk in ethereum/eth2.0-pm#8 call, Protocol Labs (libp2p) will probably develop specialized Wireshard dissectors for Eth2.0 which should help traffic analysis a lot.

However, I do think type prefixes would be useful at the top level for BeaconBlock, ActiveState, CrystallizedState i.e. treat them as tagged unions/sum types, but probably unneeded for nested types like ShardAndCommittee, CrosslinkRecord, ValidatorRecord, AttestationRecord (unless you serialize and send them as standalone?)

For instance let's look at ActiveState and CrystallizedState

# ActiveState
fields = {
    # Attestations that have not yet been processed
    'pending_attestations': [AttestationRecord],
    # Most recent 2 * CYCLE_LENGTH block hashes, older to newer
    'recent_block_hashes': ['hash32']
}

# CrystallizedState
fields = {
    # List of validators
    'validators': [ValidatorRecord],
    # Last CrystallizedState recalculation
    'last_state_recalc': 'int64',
    # What active validators are part of the attester set
    # at what slot, and in what shard. Starts at slot
    # last_state_recalc - CYCLE_LENGTH
    'shard_and_committee_for_slots': [[ShardAndCommittee]],
    # The last justified slot
    'last_justified_slot': 'int64',
    # Number of consecutive justified slots ending at this one
    'justified_streak': 'int64',
    # The last finalized slot
    'last_finalized_slot': 'int64',
    # The current dynasty
    'current_dynasty': 'int64',
    # Records about the most recent crosslink `for each shard
    'crosslink_records': [CrosslinkRecord],
    # Used to select the committees for each shard
    'dynasty_seed': 'hash32',
    # Start of the current dynasty
    'dynasty_start': 'int64'
}

To be able to deserialize and distinguish between those efficiently you would need to send or agree on the types beforehand.

Also having type prefixes for top-level types would allow to concat multiple messages in a single one. I don't think we should use type prefixes for 'hash32' or 'address' unless they are serialized stand-alone that would add a lot of bytes especially for types that can appear in a list.

On schema version

From ethereum/beacon_chain#94 as well

I think we should have 2 bytes at the start to encode SSZ major.minor version

On length prefix vs length suffix

ethereum/beacon_chain#94 and ethereum/eth2.0-pm#8

The argument for moving at the end is because otherwise you would have to iterate on the data structure once to get the length then once again to get feed the data. ==> All languages even Haskell have a vector implementation that tracks length so you wouldn't need the first iteration to get length.
Furthermore putting length at the end means you need specific token to indicate end of data like ‘\0’ in cstring. This is causing a lot of issues in C code also this prevents reading by chunk into a buffer, you would have to read token by token (or read by chunk and backtrack).

raulk · 2018-09-29T09:58:12Z

As mentioned by @raulk in ethereum/eth2.0-pm#8 call, Protocol Labs (libp2p) will probably develop specialized Wireshard dissectors for Eth2.0 which should help traffic analysis a lot.

A community member has made progress with developing Wireshark dissectors for some basic libp2p protocols, like SecIO (security), Multistream (protocol selection), Yamux (multiplexing), etc. However, the challenge is accessing the session secrets to further decrypt the payload. We've briefly discussed that in this issue: libp2p/specs#46.

Wireshark uses an "onion" approach to dissect frames. It calls dissectors iteratively to decapsulate the different layers in the message. At one point the dissection is going to encounter an application specific protocol in the payload (eth2.0 protocol), and to decrypt it, a Wireshark dissector should be registered for it. It's best if those dissectors are developed and maintained by the Ethereum 2.0 community, as they are application-specific and they'll need to evolve in lockstep with the protocol.

EDIT: the existing dissectors are likely outdated and not exhaustively tested.

EDIT 2: Contributions super welcome. Let’s buidl this tooling together.

arnetheduck · 2018-10-05T13:30:21Z

@mratsim

All languages even Haskell have a vector implementation that tracks length so you wouldn't need the first iteration to get length.

the argument is that you don't keep all data in memory at all - ie not even in a vector. Consider a linked list - there are two common implementations - one that has O(n) size and one with O(1) - the latter stores size in a central location but this breaks several nice properties of the linked lists.. likewise, if you have the data in a database, you may want to stream it instead of putting it in a vector first - this will make the max-memory requirement of your application go down. Both these things can be solved with a two-pass approach which is sometimes appropriate

Furthermore putting length at the end means you need specific token to indicate end of data like ‘\0’ in cstring. This is causing a lot of issues in C code also this prevents reading by chunk into a buffer, you would have to read token by token (or read by chunk and backtrack).

This depends - if you know the size of the full buffer (as is common when working with a datagram-style protocol, or if like above, you're storing the total size somewhere), you can work backwards from there to figure out the positions of the fields (just like when it's at the front, and you're implicitly assuming the buffer starts at "0").

arnetheduck · 2018-10-05T15:36:06Z

Based on my understanding, one of the priorities of the format was efficient in-memory access (along with a willingness to pay a price in terms of space efficiency)

I just went through the exercise of hardening our implementation against overruns and misaligned pointers (status-im/nimbus-eth2#7 - hopefully caught all of them 😄) - the following things are noteworthy, when writing cross-platform code:

every field (of size >1) much be alignment-checked, including list items - this causes some performance degradation and prevents us from using mmap and the like to directly map the data, when loading from a trusted source
while it's possible to carefully nurse specific structs to minimize misalignment issues by manually ordering fields in a particular way, it remains impossible to do so generally:
- anything following a variable-length field can be misaligned
- anything that has a mix of type lengths will need to manually add padding/dummy fields

A possible solution to this issue is to introduce a rule to the serializer that ensures that padding is mandatory and deterministic:

0-pad any data up to the size-aligned boundary of the next field (4 or 8 bytes for variable-length fields - the size must be aligned, but also the items themselves)
introduce field-reorder or possibly a check, such fields in the serialized format remain ordered by size, descending - this minimizes the amount of padding necessary, but also makes it harder to do upgrades / add fields

For a good example of a well-aligned format, check out https://google.github.io/flatbuffers/flatbuffers_internals.html

It is worth to note that modern X86 processors have mostly solved unaligned access and the performance penalty is small - nonetheless, for several cases of optimized access, even X86 requires aligned data as the optimization level of the compiler gets cranked up - sse4.2 for example, as seen here: http://pzemtsov.github.io/2016/11/06/bug-story-alignment-on-x86.html

recmo · 2018-10-18T13:37:26Z

I was reading through the current spec. Couple of thoughts:

The terminology hash96/97 for BLS aggregate public keys is confusing. It is not a hash function in the normal cryptographical sense. (Though admittedly it shares some properties). Why not bytes96 instead?
In fact, why not have bytesN for fixed size byte strings much like the fixed size integers?
List/Vector explanation assumes elements have a fixed length (len(list) * sizeof(element)), but the rest of the spec and reference implementation support variable sized elements. It is thus not clear whether variable sized list items are supported. This potentially makes a O(n) vs O(1) difference for random indexing performance, so clarity would be appreciated.
I'm not entirely clear on the intended scope of the spec: network format, on disk storage, in memory representation (i.e. mmap-able), ABI format, VM input format, etc. All would imply different trade offs in design.
Unlike RLP, this format is not self-describing. A schema needs to be provided to disambiguate the serialized data. Where in RLP the valid bytestrings are in one-to-one correspondence with semantic inputs, here they are not: a valid SSZ bytestring can have multiple interpretations depending on which schema is used. This means that a hash of SSZ data is not unique. This is can be solved perfectly by hashing in a serialization of the schema like in EIP-712. Alternatively, given the scope, it might be sufficient to manually disambiguate where necessary.
As far as I can see there is only one unique encoding for every input, so no issues there. Just don't forget the assert in the bool decoding :)
Fun facts: For every bytestring there exists a schema that will produce a valid decoding (proof: ['uint8'] * len(s). For a given schema there always exists an invalid bytestring (proof: take a valid bytestring and append a byte).

@arnetheduck Here's another nice example of a well-designed serialization format meant for fast mmaping: https://capnproto.org/index.html I do think that takes alignment into consideration will necessarily not be 'simple'.

arnetheduck · 2018-10-18T13:44:38Z

@recmo
it does deal with alignment: https://capnproto.org/encoding.html
also, flatbuffers has lots of valuable discussion on the topic: https://google.github.io/flatbuffers/flatbuffers_internals.html

recmo · 2018-10-29T07:12:57Z

@arnetheduck I know it handles alignment, that's why i brought it up. :)

The point I failed to make was that handling alignment and creating a "Simple" format are somewhat conflicting goals.

djrtwo · 2018-11-05T13:35:38Z

Let's take discussion of alignment and it's relevance to another issue if necessary

build opimistic sync file

ignore venv folder

…diasg First pass review

hwwhww added the help wanted label Sep 22, 2018

NatoliChris mentioned this issue Oct 2, 2018

[WIP] Initial SimpleSerialize (SSZ) spec #18

Merged

djrtwo closed this as completed Nov 5, 2018

paulhauner mentioned this issue Feb 14, 2019

Finalization rule 3 is unreachable #611

Closed

JSON mentioned this issue May 5, 2019

Doc standardization #1039

Merged

oskarth mentioned this issue Jan 14, 2020

Data representation oskarth/notes-old#6

Open

8 tasks

protolambda mentioned this issue May 20, 2020

Snappy frames delimiter wanted #1819

Closed

djrtwo mentioned this issue Jun 15, 2020

Incorrect usage of shard_transition.shard_data_roots in process_crosslink_for_shard #1881

Closed

djrtwo pushed a commit that referenced this issue Jan 25, 2022

Merge pull request #2 from ethereum/build-opt-sync

b50bea8

build opimistic sync file

hwwhww pushed a commit that referenced this issue Jan 3, 2023

Merge pull request #2 from parithosh/remove-whitespace

fbc21d4

ignore venv folder

dankrad added a commit that referenced this issue Mar 4, 2023

Fix commitment type description #2

889be90

adiasg pushed a commit to adiasg/eth2.0-specs that referenced this issue Jan 11, 2024

Merge pull request ethereum#2 from saltiniroberto/confirmation_rule_a…

d53a44a

…diasg First pass review

jtraglia added a commit that referenced this issue Nov 22, 2024

Merge pull request #2 from etan-status/exclude-empty-requests

bfe166c

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TODO: SimpleSerialize (SSZ) spec #2

TODO: SimpleSerialize (SSZ) spec #2

hwwhww commented Sep 22, 2018 •

edited

Loading

hwwhww commented Sep 22, 2018

paulhauner commented Sep 22, 2018 •

edited

Loading

p-mc2 commented Sep 24, 2018

p-mc2 commented Sep 24, 2018

ChihChengLiang commented Sep 26, 2018

mratsim commented Sep 26, 2018

mratsim commented Sep 29, 2018 •

edited

Loading

raulk commented Sep 29, 2018 •

edited

Loading

arnetheduck commented Oct 5, 2018

arnetheduck commented Oct 5, 2018

recmo commented Oct 18, 2018 •

edited

Loading

arnetheduck commented Oct 18, 2018

recmo commented Oct 29, 2018 •

edited

Loading

djrtwo commented Nov 5, 2018

TODO: SimpleSerialize (SSZ) spec #2

TODO: SimpleSerialize (SSZ) spec #2

Comments

hwwhww commented Sep 22, 2018 • edited Loading

Specification requirements

Reference

hwwhww commented Sep 22, 2018

paulhauner commented Sep 22, 2018 • edited Loading

p-mc2 commented Sep 24, 2018

p-mc2 commented Sep 24, 2018

ChihChengLiang commented Sep 26, 2018

mratsim commented Sep 26, 2018

mratsim commented Sep 29, 2018 • edited Loading

Type prefixes

On schema version

On length prefix vs length suffix

raulk commented Sep 29, 2018 • edited Loading

arnetheduck commented Oct 5, 2018

arnetheduck commented Oct 5, 2018

recmo commented Oct 18, 2018 • edited Loading

arnetheduck commented Oct 18, 2018

recmo commented Oct 29, 2018 • edited Loading

djrtwo commented Nov 5, 2018

hwwhww commented Sep 22, 2018 •

edited

Loading

paulhauner commented Sep 22, 2018 •

edited

Loading

mratsim commented Sep 29, 2018 •

edited

Loading

raulk commented Sep 29, 2018 •

edited

Loading

recmo commented Oct 18, 2018 •

edited

Loading

recmo commented Oct 29, 2018 •

edited

Loading