Spurious checkpoint verifier JoinError panic on zebrad shutdown #1576

teor2345 · 2021-01-11T08:36:08Z

Version

zebrad 1.0.0-alpha.0

Commit a418284 "Create the global span immediately after activating tracing" from PR #1568

Platform

Linux ... 5.4.83 #1-NixOS SMP Fri Dec 11 12:23:33 UTC 2020 x86_64 GNU/Linux

Description

zebrad can panic with a spurious commit_finalized_block should not panic: JoinError::Cancelled error in the checkpointer on shutdown.

I tried this:

Shutting down zebrad using Control-C (SIGINT)

I expected to see this happen:

zebrad exits without panicking

Instead, this happened:

zebrad panics with a spurious commit_finalized_block should not panic: JoinError::Cancelled in the CheckpointVerifier.

This invariant is stronger than required, because the syncer is shutting down.

The FinalizedState already ensures the necessary invariants:

commit continuous blocks to the finalized state
commit complete block data for each block to the finalized state (using a RocksDB transaction)

Solution

Ignore the checkpoint invariant during shutdown
Add an acceptance test that makes sure we can restart in the middle of a checkpoint:
Add an acceptance test that waits for 10 (?) seconds, kills zebrad, and then restarts it (rather than using debug_stop_at_height)

Restarting in the middle of a checkpoint:

Add a height parameter to the restart_stop_at_height test
Run the test for heights 0, zebra_consensus::MAX_CHECKPOINT_HEIGHT_GAP / 2, and zebra_consensus::MAX_CHECKPOINT_HEIGHT_GAP

That way, we'll test that Zebra can restart:

after genesis
after an incomplete checkpoint
after a full checkpoint
after being killed (rather than exiting normally)

Commands

zebrad start from a debug build, using the default config

Related Issues

We should re-test the fix for this issue after #1351, because it will change the interrupt handler so it works even when zebrad is busy.

Spurious MustUseOneshotSender panic on zebrad shutdown #1574 is a similar issue in the network code

Logs

Jan 11 18:24:16.025  INFO {zebrad="a4182846" net=Mainnet}:sync:extend_tips:checkpoint: zebra_consensus::checkpoint: verified checkpoint range block_count=1600 current_range=(Excluded(Height(4800)), Included(Height(6400)))
Jan 11 18:24:18.863  INFO {zebrad="a4182846" net=Mainnet}:sync: zebrad::components::sync: waiting for pending blocks tips.len=1 in_flight=2372 lookahead_limit=2000
Jan 11 18:24:18.866  INFO {zebrad="a4182846" net=Mainnet}:sync: zebrad::components::sync: extending tips tips.len=1 in_flight=2000 lookahead_limit=2000
^CJan 11 18:24:21.751  INFO {zebrad="a4182846" net=Mainnet}:sig{kind=SignalKind(2) name="SIGINT"}: zebrad::signal: received SIGINT, starting shutdown
The application panicked (crashed).
Message:  commit_finalized_block should not panic: JoinError::Cancelled
Location: /home/dev/zebra/zebra-consensus/src/checkpoint.rs:902

  ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ SPANTRACE ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━
  
   0: zebrad::components::sync::extend_tips
      at zebrad/src/components/sync.rs:381
   1: zebrad::components::sync::sync
      at zebrad/src/components/sync.rs:150
   2: zebrad::application:: with zebrad="a4182846" net=Mainnet
      at zebrad/src/application.rs:240

The text was updated successfully, but these errors were encountered:

oxarbitrage · 2021-01-26T20:22:10Z

@teor2345 this is not a priority but if you get some time, can you provide some hint on how to implement the checkpoint rollback?

teor2345 · 2021-01-27T00:51:17Z

can you provide some hint on how to implement the checkpoint rollback?

@oxarbitrage when I originally wrote the ticket, I thought we needed to preserve the "whole checkpoint" invariant.

But we're shutting down the syncer, and it's the only component that cares about checkpoints. So we only need to preserve the "continuous blocks" and "complete block" invariants, which are much weaker.

We've already implemented these invariants:

continuous blocks: make sure blocks are committed in order

The FinalizedState queues non-continuous blocks and does not write them to disk:

    pub fn queue_and_commit_finalized(&mut self, queued: QueuedFinalized) {
        let prev_hash = queued.0.block.header.previous_block_hash;
        let height = queued.0.height;
        self.queued_by_prev_hash.insert(prev_hash, queued);

        while let Some(queued_block) = self.queued_by_prev_hash.remove(&self.finalized_tip_hash()) {
            self.commit_finalized(queued_block);
            ...
        }
    ...
    }

https://github.com/ZcashFoundation/zebra/blob/main/zebra-state/src/service/finalized_state.rs#L111

We also assert if we try to commit any blocks to disk out of order:

        // Assert that callers (including unit tests) get the chain order correct
        if self.is_empty(hash_by_height) {
            assert_eq!(
                block::Hash([0; 32]),
                block.header.previous_block_hash,
                "the first block added to an empty state must be a genesis block"
            );
            assert_eq!(
                block::Height(0),
                height,
                "cannot commit genesis: invalid height"
            );
        } else {
            assert_eq!(
                self.finalized_tip_height()
                    .expect("state must have a genesis block committed")
                    + 1,
                Some(height),
                "committed block height must be 1 more than the finalized tip height"
            );

            assert_eq!(
                self.finalized_tip_hash(),
                block.header.previous_block_hash,
                "committed block must be a child of the finalized tip"
            );
        }

https://github.com/ZcashFoundation/zebra/blob/main/zebra-state/src/service/finalized_state.rs#L180

complete block: write the data for each block in a database transaction that covers all column families

        let batch = prepare_commit();

        let result = self.db.write(batch).map(|()| hash);

https://github.com/ZcashFoundation/zebra/blob/main/zebra-state/src/service/finalized_state.rs#L271

So we can just ignore this panic as well. If any blocks are incomplete, the database will roll them back.

I've edited the ticket, and added some tests to make sure zebrad behaves like we're expecting. These are important tests, because they make sure that we don't corrupt the database. But they're not urgent, this ticket can wait until a later sprint.

teor2345 added C-bug Category: This is a bug A-rust Area: Updates to Rust code S-needs-triage Status: A bug report needs triage labels Jan 11, 2021

mpguerra removed the S-needs-triage Status: A bug report needs triage label Jan 13, 2021

teor2345 mentioned this issue Jan 16, 2021

Spurious MustUseOneshotSender panic on zebrad shutdown #1574

Closed

oxarbitrage mentioned this issue Jan 29, 2021

Fix shutdown panics #1637

Merged

7 tasks

mpguerra added this to the 2021 Sprint 2 milestone Feb 1, 2021

teor2345 mentioned this issue Feb 3, 2021

Ignore some panics during shutdown using the panic handler #1677

Closed

3 tasks

teor2345 closed this as completed in #1637 Feb 3, 2021

teor2345 mentioned this issue May 3, 2021

Fatal panic: state service commit block failed: verified checkpoints must be committed transactionally: CommitFinalized(Closed) #2055

Closed

teor2345 mentioned this issue Nov 12, 2021

Design and implement graceful shutdown for Zebra #1678

Closed

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Spurious checkpoint verifier JoinError panic on zebrad shutdown #1576

Spurious checkpoint verifier JoinError panic on zebrad shutdown #1576

teor2345 commented Jan 11, 2021 •

edited

Loading

oxarbitrage commented Jan 26, 2021

teor2345 commented Jan 27, 2021 •

edited

Loading

Spurious checkpoint verifier JoinError panic on zebrad shutdown #1576

Spurious checkpoint verifier JoinError panic on zebrad shutdown #1576

Comments

teor2345 commented Jan 11, 2021 • edited Loading

oxarbitrage commented Jan 26, 2021

teor2345 commented Jan 27, 2021 • edited Loading

teor2345 commented Jan 11, 2021 •

edited

Loading

teor2345 commented Jan 27, 2021 •

edited

Loading