Skip to content
This repository has been archived by the owner on Nov 15, 2023. It is now read-only.

Dynamic authority sets in GRANDPA #1014

Merged
merged 40 commits into from
Nov 15, 2018
Merged

Dynamic authority sets in GRANDPA #1014

merged 40 commits into from
Nov 15, 2018

Conversation

rphmeier
Copy link
Contributor

@rphmeier rphmeier commented Oct 25, 2018

Closes #875

The rough idea behind changing authority sets in GRANDPA is that at some point, we obtain agreement for some maximum block height that the current set can finalize, and once a block with that height is finalized the next set will pick up finalization from there.

Technically speaking, this would be implemented as a voting rule which says, "if there is a signal for a change in N blocks in block B, only vote on chains with length NUM(B) + N if they contain B". This conditional-inclusion logic is complex to compute because it requires looking arbitrarily far back in the chain.

Instead, we keep track of a list of all signals we've seen so far, sorted ascending by the block number they would be applied at. We never vote on chains with number higher than the earliest handoff block number (this is num(signal) + N). When finalizing a block, we either apply or prune any signaled changes based on whether the signaling block is included in the newly-finalized chain.

Implementation details:

This PR adds a BlockImport wrapper which queries the runtime to see if the digest contains any signals of set changes. All blocks imported must go through this wrapper for correctness.

When voting on blocks, the voters won't vote beyond the height of the lowest pending change.
When finalizing blocks, we check to see if any changes are meant to be applied. If the change is applied, we restart the voter with the new authority set by exiting from the current voter early with an ExitOrError enum. The voter is wrapped in a future::loop_fn which starts a new voter on intentional exit, using the new authority set.

Remaining work for this PR:

  • Favor a chain API where exposed WASM APIs are used to handle GRANDPA digests and grandpa authorities
  • Restart voter upon finalizing change-block
  • Write and load changes to DB
  • Test changes
  • publish GRANDPA 0.3.0 with any changes and use crates.io version

@rphmeier rphmeier added the A3-in_progress Pull request is in progress. No review needed at this stage. label Oct 25, 2018
/// applied in the runtime after those N blocks have passed.
///
/// The consensus protocol will coordinate the handoff externally.
pub trait Api<B: BlockT> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There's a convention to use a CAPS_CASE version of the ApiTraitName as an ApiId ("API identifier") that allows for extensible runtime versioning for each API:

pub mod id {
	use super::ApiId;
	pub const BLOCK_BUILDER: ApiId = *b"blkbuild";
	pub const TAGGED_TRANSACTION_QUEUE: ApiId = *b"validatx";
	pub const METADATA: ApiId = *b"metadata";
}

decl_apis! {
	pub trait Core<Block: BlockT, AuthorityId> { ... }
	pub trait Metadata<Data> { ... }
	pub trait TaggedTransactionQueue<Block: BlockT> { ... }
	pub trait BlockBuilder<Block: BlockT> ExtraClientSide <OverlayedChanges> { ... }
}

A terse name like Api makes this impractical or error prone. I would suggest Grandpa/GrandpaInterface or some other such name to prevent clashes and confusion.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some documentation on the decl_apis! macro would be helpful; there is no way to know about this convention otherwise.

On the naming, I intended the name to be used like this:

extern crate grandpa_primitives as grandpa;

fn uses_api<T: grandpa::Api>(T) { }

@rphmeier rphmeier mentioned this pull request Nov 3, 2018
Copy link
Contributor

@andresilva andresilva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Gave this a first look (along with grandpa and the rest of the substrate grandpa code) and left some minor comments.

pub(crate) struct Status<H, N> {
/// Whether internal changes were made.
pub(crate) changed: bool,
/// `Some` when underlying authority set has changed, containign the
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

containing

change.effective_number(),
change.canon_height.clone(),
))
.unwrap_or_else(|i| i);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible to signal multiple changes on the same block? Because in this case the last change to be signaled would take precedence.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

no, only once per block

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rphmeier are we ensuring that anywhere or is that just by convention? I couldn't find anything (in this PR) that rejects a change at a certain block if a change is already scheduled for that block. Maybe we should add such a restriction to prevent undefined behavior? (e.g. we created a change block at 10+30 and then one at 20+20 - which one is supposed to happen at block 40?)

}
// Ok(None) can be returned when `block` is after `limit`. That might cause issues.
// might be better to return the header itself in this (rare) case.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When block is beyond limit can't we vote on the limit instead?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

that's not a chain containing block, so it wouldn't be correct under the contract of the function

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

and this would also only happen if honest authorities are not voting according to this same rule

/// Status of the set after changes were applied.
pub(crate) struct Status<H, N> {
/// Whether internal changes were made.
pub(crate) changed: bool,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can't we remove this and use the Option below to track whether changes ocurred? The only case where changed is true but new_set_block is None is when the block imported finalizes a change at a given height which is not part of the canon chain, the change is discarded but the authority set remains unchanged.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if internal changes are made (such as pruning a non-canonicalized pending transition), we have to write to disk for correctness.

{
self.call_api_at(
&BlockId::hash(header.parent_hash().clone()),
::fg_primitives::PENDING_CHANGE_CALL,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fg_primitives::PENDING_CHANGE_CALL,

@andresilva
Copy link
Contributor

Also, do we need to filter prevotes and precommits for blocks after the first pending change limit? Honest voters shouldn't be casting those votes/commits.

@rphmeier
Copy link
Contributor Author

Also, do we need to filter prevotes and precommits for blocks after the first pending change limit? Honest voters shouldn't be casting those votes/commits.

We could do this filtering but I am assuming that since honest voters won't cast such votes that it is unnecessary.

@@ -911,7 +916,7 @@ impl<B, E, Block, RA> CallApiAt<Block> for Client<B, E, Block, RA> where
B: backend::Backend<Block, Blake2Hasher>,
E: CallExecutor<Block, Blake2Hasher> + Clone + Send + Sync,
Block: BlockT<Hash=H256>,
RA: CoreAPI<Block>
RA: Send + Sync,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
RA: Send + Sync,
RA: CoreAPI<Block>,

This should not be changed!

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why? everything compiles without?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah it compiles without it, but it enforces that the CoreAPI is implemented. The CoreAPI is required to be implemented by each runtime. It internally also uses a function from the CoreAPI, the initialise_block function. This function is currently not called by the corresponding trait function, but putting the CoreAPI as an requirement, makes sure the function is present.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ok, i will re-introduce and add a comment to that effect.


/// Call the given api function with strong arguments at the given block
/// and returns the decoded result.
fn call_api_at_strong<In: Encode, Out: Decode>(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This trait should just be used by the runtime api stuff and that does not require this function. Please remove it.

/// applied in the runtime after those N blocks have passed.
///
/// The consensus protocol will coordinate the handoff externally.
pub trait GrandpaApi<B: BlockT> {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This declaration should be placed into the decl_runtime_apis! macro.
When using the macro, please rename B: BlockT to Block: BlockT. That is currently a restriction of the macro.

B: Backend<Block, Blake2Hasher> + 'static,
E: CallExecutor<Block, Blake2Hasher> + 'static + Clone + Send + Sync,
DigestFor<Block>: Encode,
RA: Send + Sync,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
RA: Send + Sync,
RA: GrandpaApi<Block>,

fn genesis_authorities(&self) -> Result<Vec<(AuthorityId, u64)>, ClientError> {
use runtime_primitives::traits::Zero;

self.call_api_at_strong(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be changed to self.runtime_api().grandpa_authorities(&BlockId::Number(NumberFor::<Block>::zero()).

fn scheduled_change(&self, header: &Block::Header)
-> Result<Option<ScheduledChange<NumberFor<Block>>>, ClientError>
{
self.call_api_at_strong(
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This should be changed as above.

Copy link
Contributor

@andresilva andresilva left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@andresilva
Copy link
Contributor

Merge conflict needs resolving.

Copy link
Contributor

@gnunicorn gnunicorn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

just some minor wording things.

@@ -0,0 +1,21 @@
[package]
name = "substrate-fg-primitives"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we please use substrate-finality-grandpa-primitives as the crate name? I know it is verbose, but fg isn't really a commonly used shorthand (other than sr- for e.g.) and it makes it harder to understand what the crate is about when included somewhere (extern crate substrate_finality_grandpa_primitves is a lot clearer than extern crate substrate_fg_primitives) . Also matching path and crate name is a helpful convention we use everywhere else...

use client::runtime_api::ApiId;

/// ApiId for the GrandpaApi trait.
pub const GRANDPA_API: ApiId = *b"fgrandpa";
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

any particular reason this has a prefix f the other defined names don't have?

Copy link
Contributor Author

@rphmeier rphmeier Nov 15, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

to make it 8 characters: "finality-grandpa"

change.effective_number(),
change.canon_height.clone(),
))
.unwrap_or_else(|i| i);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@rphmeier are we ensuring that anywhere or is that just by convention? I couldn't find anything (in this PR) that rejects a change at a certain block if a change is already scheduled for that block. Maybe we should add such a restriction to prevent undefined behavior? (e.g. we created a change block at 10+30 and then one at 20+20 - which one is supposed to happen at block 40?)

/// Apply or prune any pending transitions. Provide a closure that can be used to check for the
/// finalized block with given number.
///
/// When the set has changed, the return value will be `Ok(Some((H, N)))` which is the cnaonical
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

*canonical

};

if let Err(e) = write_result {
warn!(target: "finality", "Failed to write updated authority set to disk. Bailing.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so we have the trace-target "finality" and "afg" in the same struct. Maybe we want to stick with one?


if let Err(e) = write_result {
warn!(target: "finality", "Failed to write updated authority set to disk. Bailing.");
warn!(target: "finality", "Node is in a potentially inconsistent state.");
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

how about including the error in this warn already?

@gnunicorn gnunicorn added A5-grumble and removed A0-please_review Pull request needs code review. labels Nov 15, 2018
@rphmeier rphmeier merged commit 23b075c into master Nov 15, 2018
lamafab pushed a commit to lamafab/substrate that referenced this pull request Jun 16, 2020
* Fix building master

* Fix westend chainspec
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants