-
Notifications
You must be signed in to change notification settings - Fork 10
Conversation
…ermint into parent-sync
|
||
#[derive(Parser, Debug)] | ||
pub struct Options { | ||
/// The URL of the ipc agent's RPC endpoint. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah, okay, I figured later this would be the agent.
You know what would be super useful is to add a README.md to the topdown
crate with a section about how to run this example, which should include instructions for starting both Lotus and the agent, or at least links to where this is explained, because I don't think readers will know, which limits the usefulness of this example.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Have you thought this?
#[arg(long, short)] | ||
pub subnet_id: String, | ||
|
||
/// The subnet id expressed a string |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Wrong docs. If this is between mainnet and testnet, maybe it could say which value corresponds to what.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
☝️
} | ||
|
||
let new_parent_views = | ||
get_new_parent_views(agent_proxy, starting_height, ending_height).await?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Need to be somewhat careful here that if a node hasn't seen anything previously then it will I think start from the Genesis of Lotus and load everything into memory.
The IPC genesis of Fendermint should have a non-zero starting height (the height at which this subnet has been created), and as we discussed earlier this routine must not be started while the node is syncing with others, so new joiners don't fetch the entire history of the subnet.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is an issue that tracks node is syncing with others
.
Regarding the genesis, I think adding into the config should do it.
let validator_set = agent_proxy | ||
.get_validator_set(h) | ||
.await | ||
.context("cannot fetch validator set")?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be worth creating an issue to optimise this so that if there's no change then it only records the configuration ID. Like a method that says get_validator_set_if_changed(h, current_id)
and it returns None
if it's the same as the current_id
, so we don't store a copy of the entire power table at every height in memory if nothing happened. cc @adlrocha
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think for all configuration id related issues, we should handle them altogether
return abort(Error::HeightThresholdNotReached); | ||
} | ||
|
||
let height = latest_height - self.config.chain_head_delay; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This seems to be not needed any more, because the syncer only adds data that is final (or should only do so anyway, like I said you should add docstrings to communicate this).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Really nice catch. I'm doing the integration testing, and with all the updates, I think the initial logic might have to update a little.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, I thought this would change.
let finality = self | ||
.last_committed_finality | ||
.read_clone()? | ||
.ok_or(StmError::Abort(Error::CommittedFinalityNotReady))?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
By the way I'm surprised that this is possible. There must always be something that we can consider final on the parent: genesis; either that of the entire parent chain, or where the subnet was conceived.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is what we discussed turning into a panic in the new phased initialisation:
- Internally the
last_committed_finality
is aTVar<Option<...>>
and starts empty because we don't want to peak into the genesis file, and we might be restoring from a snapshot anyway that CometBFT needs to fetch for us. - Before the genesis is applied or the snapshot is loaded, CometBFT will not give us blocks to process, so the
App
should not call this method. - If it did, that would be a programming error, and we can just
panic!
if thelast_committed_finality
isNone
. - During genesis or loading a snapshot, the interpreter should call
set_committed_finality
to cause this to go becomeSome
. - After that it's safe to query and return
Stm<IPCParentFinality>
- On subsequent starts we can query the ledger to know where to resume from.
Notably we discussed that we won't query in a loop because then there is a race condition between filling the cache and querying it.
* prepare and process * prepare and process * lint and format * format code * use dev branch * Update fendermint/app/config/default.toml Co-authored-by: Akosh Farkash <aakoshh@gmail.com> * Update fendermint/vm/interpreter/src/chain.rs Co-authored-by: Akosh Farkash <aakoshh@gmail.com> * review feedback * lint * fmt code * fix test * format code * update default subnet id * merge with parent branch * format code --------- Co-authored-by: Akosh Farkash <aakoshh@gmail.com>
|
||
[ipc.config] | ||
# The number of blocks behind chain head to propose parent finality. | ||
chain_head_delay = 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
chain_head_delay = 10 | |
chain_head_delay = 900 |
It's Filecoin, and this is the safe default.
@@ -21,6 +23,11 @@ cmd! { | |||
} | |||
} | |||
|
|||
async fn create_parent_finality(settings: &Settings) -> anyhow::Result<InMemoryFinalityProvider> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just curious why this has to be async
.
Or indeed why it has to return a Result
.
match atomically_or_err::<_, fendermint_vm_topdown::Error, _>(|| { | ||
finality_provider.next_proposal() | ||
}) | ||
.await? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Based on this code it still doesn't look like we're on the same page.
What I have been trying to say in my PR comments, Slack, and Zoom is that
next_proposal
should not return anError
, it should befn next_proposal(&self) -> Stm<Option<Proposal>>
.- it should be called with
atomically
, notatomically_or_err
- if you use
?
in an interpreter and it fails it will crash Fendermint, so only ever use it for fatal errors - none of the error conditions I remember are fatal, some were temporary, some config based
Sorry if I seem stuck on this issue, it's not because I am trying to be super pedantic, but because we have to be so careful about what we treat as an error. In other cases it's easy to say just return anyhow::Result
, who cares, ?
will make it disappear, but it's precisely that pattern that gets us into trouble here. @adlrocha asked the same thing first, if it's okay to return an error. It's okay if there is nothing else you could return. Like you can't reach the database, so what will you do? We must make sure that if we return errors its' because a) we want the call site to handle them or b) we want the world to stop basically.
block_hash: proposal.block_hash, | ||
}))) | ||
} | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Funny how you wedged this in between the creation and the use of ckpts
.
// Append at the end - if we run out of block space, these are going to be reproposed in the next block. | ||
msgs.extend(ckpts); | ||
Ok(msgs) | ||
} | ||
|
||
/// Perform finality checks on top-down transactions and availability checks on bottom-up transactions. | ||
async fn process(&self, pool: Self::State, msgs: Vec<Self::Message>) -> anyhow::Result<bool> { | ||
async fn process(&self, state: Self::State, msgs: Vec<Self::Message>) -> anyhow::Result<bool> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
async fn process(&self, state: Self::State, msgs: Vec<Self::Message>) -> anyhow::Result<bool> { | |
async fn process(&self, (pool, provider): Self::State, msgs: Vec<Self::Message>) -> anyhow::Result<bool> { |
I think this works and would make the rest of the code more readable. state.0.get_status
is a head scracher.
if atomically_or_err(|| state.1.check_proposal(&prop)) | ||
.await | ||
.is_err() |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
check_proposal
should return Stm<bool>
🙏
pub trait ParentFinalityProvider: ParentViewProvider { | ||
/// Obtains the last committed finality | ||
fn last_committed_finality(&self) -> StmResult<IPCParentFinality, Error>; | ||
/// Latest proposal for parent finality | ||
fn next_proposal(&self) -> StmResult<Option<IPCParentFinality>, Error>; | ||
/// Check if the target proposal is valid | ||
fn check_proposal(&self, proposal: &IPCParentFinality) -> StmResult<(), Error>; | ||
/// Called when finality is committed | ||
fn on_finality_committed(&self, finality: &IPCParentFinality) -> StmResult<(), Error>; | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
pub trait ParentFinalityProvider: ParentViewProvider { | |
/// Obtains the last committed finality | |
fn last_committed_finality(&self) -> StmResult<IPCParentFinality, Error>; | |
/// Latest proposal for parent finality | |
fn next_proposal(&self) -> StmResult<Option<IPCParentFinality>, Error>; | |
/// Check if the target proposal is valid | |
fn check_proposal(&self, proposal: &IPCParentFinality) -> StmResult<(), Error>; | |
/// Called when finality is committed | |
fn on_finality_committed(&self, finality: &IPCParentFinality) -> StmResult<(), Error>; | |
} | |
pub trait ParentFinalityProvider: ParentViewProvider { | |
/// Obtains the last committed finality | |
fn last_committed_finality(&self) -> Stm<IPCParentFinality>; | |
/// Latest proposal for parent finality | |
fn next_proposal(&self) -> Stm<Option<IPCParentFinality>>; | |
/// Check if the target proposal is valid | |
fn check_proposal(&self, proposal: &IPCParentFinality) -> Stm<bool>; | |
/// Called when finality is committed | |
fn on_finality_committed(&self, finality: &IPCParentFinality) -> Stm<()>; | |
} |
fn latest_height(&self) -> StmResult<Option<BlockHeight>, Error>; | ||
/// Get the block hash at height | ||
fn block_hash(&self, height: BlockHeight) -> StmResult<Option<BlockHash>, Error>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fn latest_height(&self) -> StmResult<Option<BlockHeight>, Error>; | |
/// Get the block hash at height | |
fn block_hash(&self, height: BlockHeight) -> StmResult<Option<BlockHash>, Error>; | |
fn latest_height(&self) -> Stm<Option<BlockHeight>>; | |
/// Get the block hash at height | |
fn block_hash(&self, height: BlockHeight) -> Stm<Option<BlockHash>>; |
/// Get the validator set at height | ||
fn validator_set(&self, height: BlockHeight) -> StmResult<Option<ValidatorSet>, Error>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
/// Get the validator set at height | |
fn validator_set(&self, height: BlockHeight) -> StmResult<Option<ValidatorSet>, Error>; |
/// Get the validator set at height | ||
fn validator_set(&self, height: BlockHeight) -> StmResult<Option<ValidatorSet>, Error>; | ||
/// Get the top down messages at height | ||
fn top_down_msgs(&self, height: BlockHeight) -> StmResult<Vec<CrossMsg>, Error>; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fn top_down_msgs(&self, height: BlockHeight) -> StmResult<Vec<CrossMsg>, Error>; | |
fn top_down_msgs(&self, height: BlockHeight) -> Stm<Option<Vec<CrossMsg>>>; |
So the fixes to the comments in this PR are supposed to be in #236 |
Constantly syncs with the parent using a polling mechanism. The ipc-agent pending integration.