-
Notifications
You must be signed in to change notification settings - Fork 2.6k
Add missing fields to the light sync state #7225
Conversation
|
||
Ok(sc_chain_spec::LightSyncState { | ||
header | ||
finalized_block_header: finalized_header, | ||
babe_finalized_block1_slot_number: 0, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need a way to get this as well.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure this is the way to go. I'll try to provide context on what the problem is and how we should go about fixing it.
Both BABE and GRANDPA maintain some data structures relating to the management of epochs (i.e. periods in which a given set of authorities is active). There's no need to go into details on the logic behind those data structures, but the key point is that they are populated as part of processing blocks. The data structures are:
- BABE - EpochChanges
- GRANDPA - AuthoritySet
In the scope of the light client, where we skip an initial sync and start from a block defined in the chainspec, this means that these data structures will not be populated with any data. As such, both BabeBlockImport
and GrandpaBlockImport
will fail to validate subsequent blocks (e.g. the errors you were seeing regarding BABE not finding epoch data). The solution is that we need to persist the data that exists in EpochChanges
and AuthoritySet
when we are creating the "light sync state", which we can then use when restoring from that state.
IMO as an initial implementation we can start by just dumping the data structures completely and persisting the whole thing (they are scale-encodable as we are already persisting them to the database). Once that is done we can prune these data structures to remove any unnecessary data (so that we don't persist everything). Both the data structures are easily available as part of service initialization: https://github.com/paritytech/substrate/blob/master/bin/node/cli/src/service.rs#L120 and https://github.com/paritytech/substrate/blob/master/bin/node/cli/src/service.rs#L112.
We could persist all the data that is needed in separate fields (as it seems the intention in this PR so far), but we'd still need to create some |
@andresilva done. The sync state increases the size of the chain spec by about 700kb, but that's not too bad that it's 3.8mb to begin with. |
@andresilva How should we be getting the babe slot number for the first block? I thought that you'd be able to get this from the epoch changes, but you can't because it gets pruned. |
Yeah, we'll have at least two lists of authority sets with public keys in there (BABE and GRANDPA). We might be able to prune some of the data you're seeing though (but let's worry about this later).
We can get it from BABE's runtime module https://github.com/paritytech/substrate/blob/master/frame/babe/src/lib.rs#L169. But do we need it? I know that Pierre mentioned this in the data we need to persist, but if we are persisting the latest epoch data I'm not sure we need this. At least the client code in substrate does not use this in any way, I believe this would only be needed in the |
@andresilva we're trying to support substrate-lite here, but @tomaka if we don't use it in substrate, why is it needed in substrate-lite? |
Also @tomaka am I right in thinking that |
I use it in substrate-lite to determine at which slot an epoch starts and ends.
Indeed. If a block uses the last slot of an epoch, we know that the next block will always be a different epoch. However, since not all slots necessarily have a block, it is possible for a block to not use the last slot of an epoch, yet the next block is in a different epoch anyway. |
Yep: https://substrate.dev/rustdocs/v2.0.0-rc6/sc_consensus_babe/struct.Epoch.html#structfield.start_slot |
This is becoming off-topic, but I would need to learn a bit more about the long term plan concerning Babe slots. I know we're supposed to have some small NTP-like protocol, or something like that, but I don't know the details. Using the slot of block 1 also assumes that the number of slots per epoch never changes. This is true at the moment, but maybe we'll add this possibility in the future? |
Anyway, my point is that this PR should do the thing that makes sense, not the thing that brings compatibility with substrate-lite. It's substrate-lite which has to adjust. But it's unclear to me what makes sense. |
Ok, I'm making progress getting the sync state to load in substrate (not lite). Now I'm getting:
@andresilva Should we saving babe block weights into the sync state as well? |
@andresilva on
|
Ok, so I've updated |
It is still panicking on a second chain spec on branch |
This PR is ready and I'd like to get it merged soon. The two things to check in reviews are:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
just a minor nit but overall lgtm.
@@ -53,7 +53,7 @@ impl<H, N> Clone for SharedAuthoritySet<H, N> { | |||
|
|||
impl<H, N> SharedAuthoritySet<H, N> { | |||
/// Acquire a reference to the inner read-write lock. | |||
pub(crate) fn inner(&self) -> &RwLock<AuthoritySet<H, N>> { | |||
pub fn inner(&self) -> &RwLock<AuthoritySet<H, N>> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you think we can instead add a method clone_inner()
or something like that, which clones the inner AuthoritySet
and returns it? It is safer not to expose the lock outside this crate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There can be multiple pending changes on the unfinalized chain (if GRANDPA has stalled). On Kusama we will be rotating the session every hour, which potentially leads to an authority set change in GRANDPA. Since the changes are only enacted on finality, if GRANDPA stops working for 6h we can end up with 6 pending authority set changes all on the same branch (which we need to finalize in-order afterwards). Probably I didn't do a very good job at explaining you how this works previously (why I need to write some docs). |
But this PR is only about returning the state of the finalized chain, no? A data structure containing the state of the finalized chain doesn't have to contain more than one pending change. |
@tomaka Well no, at the moment we just encode the whole |
Ok, I'd like to get another approval and then I'll merge this. |
To get something working I think it was easier to just persist BABE's tldr: the actual sync state data will probably change after this PR, IMO this is fine as we can get something working in the meantime without having to deal with all the details of what needs to be persisted. |
Actually, I suppose that's fine. |
bot merge |
Trying merge. |
If we want to make it easier to be consumed by |
finalized_block_header: finalized_header, | ||
babe_epoch_changes: self.shared_epoch_changes.lock().clone(), | ||
babe_finalized_block_weight: finalized_block_weight, | ||
grandpa_authority_set: self.shared_authority_set.clone_inner(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can this be prune (grandpa set or babe epochs) concurrently ?
btw even if possible the race seems highly improbable (probably when code get generic there will be
a trait method like synch_state_at(block_hash) -> Option<S>
for substrate component that require synch state and this will not be a possibillity.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It should be fine since both BABE and GRANDPA will acquire a lock when writing to these data structures (e.g. if there's a block being imported).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
so maybe we should lock both structure before getting the finalized block hash? (it would make thing easier to read at least)
Part of #6804.
polkadot companion: paritytech/polkadot#1801