Skip to content
This repository has been archived by the owner on Jan 11, 2024. It is now read-only.

Top Down Checkpoint - Parent Sync #206

Closed
wants to merge 108 commits into from
Closed

Top Down Checkpoint - Parent Sync #206

wants to merge 108 commits into from

Conversation

cryptoAtwill
Copy link
Contributor

Constantly syncs with the parent using a polling mechanism. The ipc-agent pending integration.


#[derive(Parser, Debug)]
pub struct Options {
/// The URL of the ipc agent's RPC endpoint.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah, okay, I figured later this would be the agent.

You know what would be super useful is to add a README.md to the topdown crate with a section about how to run this example, which should include instructions for starting both Lotus and the agent, or at least links to where this is explained, because I don't think readers will know, which limits the usefulness of this example.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Have you thought this?

#[arg(long, short)]
pub subnet_id: String,

/// The subnet id expressed a string
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wrong docs. If this is between mainnet and testnet, maybe it could say which value corresponds to what.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

☝️

}

let new_parent_views =
get_new_parent_views(agent_proxy, starting_height, ending_height).await?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to be somewhat careful here that if a node hasn't seen anything previously then it will I think start from the Genesis of Lotus and load everything into memory.

The IPC genesis of Fendermint should have a non-zero starting height (the height at which this subnet has been created), and as we discussed earlier this routine must not be started while the node is syncing with others, so new joiners don't fetch the entire history of the subnet.

Copy link
Contributor Author

@cryptoAtwill cryptoAtwill Sep 7, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an issue that tracks node is syncing with others.
Regarding the genesis, I think adding into the config should do it.

Comment on lines +141 to +144
let validator_set = agent_proxy
.get_validator_set(h)
.await
.context("cannot fetch validator set")?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would be worth creating an issue to optimise this so that if there's no change then it only records the configuration ID. Like a method that says get_validator_set_if_changed(h, current_id) and it returns None if it's the same as the current_id, so we don't store a copy of the entire power table at every height in memory if nothing happened. cc @adlrocha

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think for all configuration id related issues, we should handle them altogether

return abort(Error::HeightThresholdNotReached);
}

let height = latest_height - self.config.chain_head_delay;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems to be not needed any more, because the syncer only adds data that is final (or should only do so anyway, like I said you should add docstrings to communicate this).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Really nice catch. I'm doing the integration testing, and with all the updates, I think the initial logic might have to update a little.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, I thought this would change.

let finality = self
.last_committed_finality
.read_clone()?
.ok_or(StmError::Abort(Error::CommittedFinalityNotReady))?;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

By the way I'm surprised that this is possible. There must always be something that we can consider final on the parent: genesis; either that of the entire parent chain, or where the subnet was conceived.

Copy link
Contributor

@aakoshh aakoshh Sep 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is what we discussed turning into a panic in the new phased initialisation:

  1. Internally the last_committed_finality is a TVar<Option<...>> and starts empty because we don't want to peak into the genesis file, and we might be restoring from a snapshot anyway that CometBFT needs to fetch for us.
  2. Before the genesis is applied or the snapshot is loaded, CometBFT will not give us blocks to process, so the App should not call this method.
  3. If it did, that would be a programming error, and we can just panic! if the last_committed_finality is None.
  4. During genesis or loading a snapshot, the interpreter should call set_committed_finality to cause this to go become Some.
  5. After that it's safe to query and return Stm<IPCParentFinality>
  6. On subsequent starts we can query the ledger to know where to resume from.

Notably we discussed that we won't query in a loop because then there is a race condition between filling the cache and querying it.

cryptoAtwill and others added 3 commits September 7, 2023 13:33
* prepare and process

* prepare and process

* lint and format

* format code

* use dev branch

* Update fendermint/app/config/default.toml

Co-authored-by: Akosh Farkash <aakoshh@gmail.com>

* Update fendermint/vm/interpreter/src/chain.rs

Co-authored-by: Akosh Farkash <aakoshh@gmail.com>

* review feedback

* lint

* fmt code

* fix test

* format code

* update default subnet id

* merge with parent branch

* format code

---------

Co-authored-by: Akosh Farkash <aakoshh@gmail.com>

[ipc.config]
# The number of blocks behind chain head to propose parent finality.
chain_head_delay = 10
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
chain_head_delay = 10
chain_head_delay = 900

It's Filecoin, and this is the safe default.

@@ -21,6 +23,11 @@ cmd! {
}
}

async fn create_parent_finality(settings: &Settings) -> anyhow::Result<InMemoryFinalityProvider> {
Copy link
Contributor

@aakoshh aakoshh Sep 11, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just curious why this has to be async.
Or indeed why it has to return a Result.

Comment on lines +101 to +104
match atomically_or_err::<_, fendermint_vm_topdown::Error, _>(|| {
finality_provider.next_proposal()
})
.await?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Based on this code it still doesn't look like we're on the same page.

What I have been trying to say in my PR comments, Slack, and Zoom is that

  • next_proposal should not return an Error, it should be fn next_proposal(&self) -> Stm<Option<Proposal>>.
  • it should be called with atomically, not atomically_or_err
  • if you use ? in an interpreter and it fails it will crash Fendermint, so only ever use it for fatal errors
  • none of the error conditions I remember are fatal, some were temporary, some config based

Sorry if I seem stuck on this issue, it's not because I am trying to be super pedantic, but because we have to be so careful about what we treat as an error. In other cases it's easy to say just return anyhow::Result, who cares, ? will make it disappear, but it's precisely that pattern that gets us into trouble here. @adlrocha asked the same thing first, if it's okay to return an error. It's okay if there is nothing else you could return. Like you can't reach the database, so what will you do? We must make sure that if we return errors its' because a) we want the call site to handle them or b) we want the world to stop basically.

block_hash: proposal.block_hash,
})))
}
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Funny how you wedged this in between the creation and the use of ckpts.

// Append at the end - if we run out of block space, these are going to be reproposed in the next block.
msgs.extend(ckpts);
Ok(msgs)
}

/// Perform finality checks on top-down transactions and availability checks on bottom-up transactions.
async fn process(&self, pool: Self::State, msgs: Vec<Self::Message>) -> anyhow::Result<bool> {
async fn process(&self, state: Self::State, msgs: Vec<Self::Message>) -> anyhow::Result<bool> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
async fn process(&self, state: Self::State, msgs: Vec<Self::Message>) -> anyhow::Result<bool> {
async fn process(&self, (pool, provider): Self::State, msgs: Vec<Self::Message>) -> anyhow::Result<bool> {

I think this works and would make the rest of the code more readable. state.0.get_status is a head scracher.

Comment on lines +150 to +152
if atomically_or_err(|| state.1.check_proposal(&prop))
.await
.is_err()
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

check_proposal should return Stm<bool> 🙏

Comment on lines +65 to +74
pub trait ParentFinalityProvider: ParentViewProvider {
/// Obtains the last committed finality
fn last_committed_finality(&self) -> StmResult<IPCParentFinality, Error>;
/// Latest proposal for parent finality
fn next_proposal(&self) -> StmResult<Option<IPCParentFinality>, Error>;
/// Check if the target proposal is valid
fn check_proposal(&self, proposal: &IPCParentFinality) -> StmResult<(), Error>;
/// Called when finality is committed
fn on_finality_committed(&self, finality: &IPCParentFinality) -> StmResult<(), Error>;
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
pub trait ParentFinalityProvider: ParentViewProvider {
/// Obtains the last committed finality
fn last_committed_finality(&self) -> StmResult<IPCParentFinality, Error>;
/// Latest proposal for parent finality
fn next_proposal(&self) -> StmResult<Option<IPCParentFinality>, Error>;
/// Check if the target proposal is valid
fn check_proposal(&self, proposal: &IPCParentFinality) -> StmResult<(), Error>;
/// Called when finality is committed
fn on_finality_committed(&self, finality: &IPCParentFinality) -> StmResult<(), Error>;
}
pub trait ParentFinalityProvider: ParentViewProvider {
/// Obtains the last committed finality
fn last_committed_finality(&self) -> Stm<IPCParentFinality>;
/// Latest proposal for parent finality
fn next_proposal(&self) -> Stm<Option<IPCParentFinality>>;
/// Check if the target proposal is valid
fn check_proposal(&self, proposal: &IPCParentFinality) -> Stm<bool>;
/// Called when finality is committed
fn on_finality_committed(&self, finality: &IPCParentFinality) -> Stm<()>;
}

Comment on lines +48 to +50
fn latest_height(&self) -> StmResult<Option<BlockHeight>, Error>;
/// Get the block hash at height
fn block_hash(&self, height: BlockHeight) -> StmResult<Option<BlockHash>, Error>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
fn latest_height(&self) -> StmResult<Option<BlockHeight>, Error>;
/// Get the block hash at height
fn block_hash(&self, height: BlockHeight) -> StmResult<Option<BlockHash>, Error>;
fn latest_height(&self) -> Stm<Option<BlockHeight>>;
/// Get the block hash at height
fn block_hash(&self, height: BlockHeight) -> Stm<Option<BlockHash>>;

Comment on lines +51 to +52
/// Get the validator set at height
fn validator_set(&self, height: BlockHeight) -> StmResult<Option<ValidatorSet>, Error>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
/// Get the validator set at height
fn validator_set(&self, height: BlockHeight) -> StmResult<Option<ValidatorSet>, Error>;

/// Get the validator set at height
fn validator_set(&self, height: BlockHeight) -> StmResult<Option<ValidatorSet>, Error>;
/// Get the top down messages at height
fn top_down_msgs(&self, height: BlockHeight) -> StmResult<Vec<CrossMsg>, Error>;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
fn top_down_msgs(&self, height: BlockHeight) -> StmResult<Vec<CrossMsg>, Error>;
fn top_down_msgs(&self, height: BlockHeight) -> Stm<Option<Vec<CrossMsg>>>;

@aakoshh
Copy link
Contributor

aakoshh commented Sep 12, 2023

So the fixes to the comments in this PR are supposed to be in #236

@cryptoAtwill cryptoAtwill deleted the parent-sync branch September 27, 2023 01:34
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants