Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spurious checkpoint verifier JoinError panic on zebrad shutdown #1576

Closed
3 tasks
teor2345 opened this issue Jan 11, 2021 · 2 comments · Fixed by #1637
Closed
3 tasks

Spurious checkpoint verifier JoinError panic on zebrad shutdown #1576

teor2345 opened this issue Jan 11, 2021 · 2 comments · Fixed by #1637
Labels
A-rust Area: Updates to Rust code C-bug Category: This is a bug
Milestone

Comments

@teor2345
Copy link
Contributor

teor2345 commented Jan 11, 2021

Version

zebrad 1.0.0-alpha.0

Commit a418284 "Create the global span immediately after activating tracing" from PR #1568

Platform

Linux ... 5.4.83 #1-NixOS SMP Fri Dec 11 12:23:33 UTC 2020 x86_64 GNU/Linux

Description

zebrad can panic with a spurious commit_finalized_block should not panic: JoinError::Cancelled error in the checkpointer on shutdown.

I tried this:

Shutting down zebrad using Control-C (SIGINT)

I expected to see this happen:

zebrad exits without panicking

Instead, this happened:

zebrad panics with a spurious commit_finalized_block should not panic: JoinError::Cancelled in the CheckpointVerifier.

This invariant is stronger than required, because the syncer is shutting down.

The FinalizedState already ensures the necessary invariants:

  • commit continuous blocks to the finalized state
  • commit complete block data for each block to the finalized state (using a RocksDB transaction)

Solution

  • Ignore the checkpoint invariant during shutdown
  • Add an acceptance test that makes sure we can restart in the middle of a checkpoint:
  • Add an acceptance test that waits for 10 (?) seconds, kills zebrad, and then restarts it (rather than using debug_stop_at_height)

Restarting in the middle of a checkpoint:

  1. Add a height parameter to the restart_stop_at_height test
  2. Run the test for heights 0, zebra_consensus::MAX_CHECKPOINT_HEIGHT_GAP / 2, and zebra_consensus::MAX_CHECKPOINT_HEIGHT_GAP

That way, we'll test that Zebra can restart:

  • after genesis
  • after an incomplete checkpoint
  • after a full checkpoint
  • after being killed (rather than exiting normally)

Commands

zebrad start from a debug build, using the default config

Related Issues

We should re-test the fix for this issue after #1351, because it will change the interrupt handler so it works even when zebrad is busy.

Spurious MustUseOneshotSender panic on zebrad shutdown #1574 is a similar issue in the network code

Logs

Jan 11 18:24:16.025  INFO {zebrad="a4182846" net=Mainnet}:sync:extend_tips:checkpoint: zebra_consensus::checkpoint: verified checkpoint range block_count=1600 current_range=(Excluded(Height(4800)), Included(Height(6400)))
Jan 11 18:24:18.863  INFO {zebrad="a4182846" net=Mainnet}:sync: zebrad::components::sync: waiting for pending blocks tips.len=1 in_flight=2372 lookahead_limit=2000
Jan 11 18:24:18.866  INFO {zebrad="a4182846" net=Mainnet}:sync: zebrad::components::sync: extending tips tips.len=1 in_flight=2000 lookahead_limit=2000
^CJan 11 18:24:21.751  INFO {zebrad="a4182846" net=Mainnet}:sig{kind=SignalKind(2) name="SIGINT"}: zebrad::signal: received SIGINT, starting shutdown
The application panicked (crashed).
Message:  commit_finalized_block should not panic: JoinError::Cancelled
Location: /home/dev/zebra/zebra-consensus/src/checkpoint.rs:902

  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ SPANTRACE ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
  
   0: zebrad::components::sync::extend_tips
      at zebrad/src/components/sync.rs:381
   1: zebrad::components::sync::sync
      at zebrad/src/components/sync.rs:150
   2: zebrad::application:: with zebrad="a4182846" net=Mainnet
      at zebrad/src/application.rs:240
@teor2345 teor2345 added C-bug Category: This is a bug A-rust Area: Updates to Rust code S-needs-triage Status: A bug report needs triage labels Jan 11, 2021
@mpguerra mpguerra removed the S-needs-triage Status: A bug report needs triage label Jan 13, 2021
@oxarbitrage
Copy link
Contributor

@teor2345 this is not a priority but if you get some time, can you provide some hint on how to implement the checkpoint rollback?

@teor2345
Copy link
Contributor Author

teor2345 commented Jan 27, 2021

can you provide some hint on how to implement the checkpoint rollback?

@oxarbitrage when I originally wrote the ticket, I thought we needed to preserve the "whole checkpoint" invariant.

But we're shutting down the syncer, and it's the only component that cares about checkpoints. So we only need to preserve the "continuous blocks" and "complete block" invariants, which are much weaker.

We've already implemented these invariants:

continuous blocks: make sure blocks are committed in order

The FinalizedState queues non-continuous blocks and does not write them to disk:

    pub fn queue_and_commit_finalized(&mut self, queued: QueuedFinalized) {
        let prev_hash = queued.0.block.header.previous_block_hash;
        let height = queued.0.height;
        self.queued_by_prev_hash.insert(prev_hash, queued);

        while let Some(queued_block) = self.queued_by_prev_hash.remove(&self.finalized_tip_hash()) {
            self.commit_finalized(queued_block);
            ...
        }
    ...
    }

https://github.com/ZcashFoundation/zebra/blob/main/zebra-state/src/service/finalized_state.rs#L111

We also assert if we try to commit any blocks to disk out of order:

        // Assert that callers (including unit tests) get the chain order correct
        if self.is_empty(hash_by_height) {
            assert_eq!(
                block::Hash([0; 32]),
                block.header.previous_block_hash,
                "the first block added to an empty state must be a genesis block"
            );
            assert_eq!(
                block::Height(0),
                height,
                "cannot commit genesis: invalid height"
            );
        } else {
            assert_eq!(
                self.finalized_tip_height()
                    .expect("state must have a genesis block committed")
                    + 1,
                Some(height),
                "committed block height must be 1 more than the finalized tip height"
            );

            assert_eq!(
                self.finalized_tip_hash(),
                block.header.previous_block_hash,
                "committed block must be a child of the finalized tip"
            );
        }

https://github.com/ZcashFoundation/zebra/blob/main/zebra-state/src/service/finalized_state.rs#L180

complete block: write the data for each block in a database transaction that covers all column families

        let batch = prepare_commit();

        let result = self.db.write(batch).map(|()| hash);

https://github.com/ZcashFoundation/zebra/blob/main/zebra-state/src/service/finalized_state.rs#L271

So we can just ignore this panic as well. If any blocks are incomplete, the database will roll them back.

I've edited the ticket, and added some tests to make sure zebrad behaves like we're expecting. These are important tests, because they make sure that we don't corrupt the database. But they're not urgent, this ticket can wait until a later sprint.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-rust Area: Updates to Rust code C-bug Category: This is a bug
Projects
None yet
Development

Successfully merging a pull request may close this issue.

3 participants