Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Security: Wait for RocksDB tasks to exit on shutdown #164

Merged
merged 5 commits into from
Jan 25, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
67 changes: 52 additions & 15 deletions zebra-state/src/service/finalized_state.rs
Original file line number Diff line number Diff line change
Expand Up @@ -500,16 +500,9 @@ impl FinalizedState {
"stopping at configured height, flushing database to disk"
);

// We'd like to drop the database here, because that closes the
// column families and the database. But Rust's ownership rules
// make that difficult, so we just flush and delete ephemeral data instead.

// TODO: remove these extra logs once bugs like #2905 are fixed
self.db.flush().expect("flush is successful");
tracing::info!("flushed database to disk");

self.delete_ephemeral();
self.shutdown();

// TODO: replace with a graceful shutdown (#1678)
Self::exit_process();
}

Expand All @@ -519,6 +512,7 @@ impl FinalizedState {
/// Exit the host process.
///
/// Designed for debugging and tests.
/// TODO: replace with a graceful shutdown (#1678)
fn exit_process() -> ! {
tracing::info!("exiting Zebra");

Expand Down Expand Up @@ -830,16 +824,59 @@ impl FinalizedState {

self.db.write(batch).unwrap();
}

/// Shut down the database, cleaning up background tasks and ephemeral data.
fn shutdown(&mut self) {
// Drop isn't guaranteed to run, such as when we panic, or if the tokio shutdown times out.
//
// Zebra's data should be fine if we don't clean up, because:
// - the database flushes regularly anyway
// - Zebra commits each block in a database transaction, any incomplete blocks get rolled back
// - ephemeral files are placed in the os temp dir and should be cleaned up automatically eventually
tracing::info!("flushing database to disk");
self.db.flush().expect("flush is successful");

// But we should call `cancel_all_background_work` before Zebra exits.
// If we don't, we see these kinds of errors:
// ```
// pthread lock: Invalid argument
// pure virtual method called
// terminate called without an active exception
// pthread destroy mutex: Device or resource busy
// Aborted (core dumped)
// ```
//
// The RocksDB wiki says:
// > Q: Is it safe to close RocksDB while another thread is issuing read, write or manual compaction requests?
// >
// > A: No. The users of RocksDB need to make sure all functions have finished before they close RocksDB.
// > You can speed up the waiting by calling CancelAllBackgroundWork().
//
// https://github.com/facebook/rocksdb/wiki/RocksDB-FAQ
tracing::info!("stopping background database tasks");
self.db.cancel_all_background_work(true);

// We'd like to drop the database before deleting its files,
// because that closes the column families and the database correctly.
// But Rust's ownership rules make that difficult,
// so we just flush and delete ephemeral data instead.
//
// The RocksDB wiki says:
// > rocksdb::DB instances need to be destroyed before your main function exits.
// > RocksDB instances usually depend on some internal static variables.
// > Users need to make sure rocksdb::DB instances are destroyed before those static variables.
//
// https://github.com/facebook/rocksdb/wiki/Known-Issues
//
// But our current code doesn't seem to cause any issues.
// We might want to explicitly drop the database as part of graceful shutdown (#1678).
self.delete_ephemeral();
}
}

// Drop isn't guaranteed to run, such as when we panic, or if someone stored
// their FinalizedState in a static, but it should be fine if we don't clean
// this up since the files are placed in the os temp dir and should be cleaned
// up automatically eventually.
impl Drop for FinalizedState {
fn drop(&mut self) {
self.db.cancel_all_background_work(false);
self.delete_ephemeral();
self.shutdown();
}
}

Expand Down
61 changes: 61 additions & 0 deletions zebra-test/scripts/shutdown-errors
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
#!/usr/bin/env bash

# test that Zebra can shut down without errors

# print every command run by this script
#set -x

# initial shutdown delay
# interesting delays are between 1-30 seconds on most machines
SHUTDOWN_DELAY=0

# each test increases the shutdown delay by this much
SHUTDOWN_DELAY_INCREMENT=1

# final shutdown delay
# the test stops after a successful run with a delay this long
SHUTDOWN_DELAY_LIMIT=30


cargo build --bin zebrad || exit $?

EXIT_STATUS=0
while [ $EXIT_STATUS -eq 0 ] && [ $SHUTDOWN_DELAY -le $SHUTDOWN_DELAY_LIMIT ]; do
# remove previously downloaded Zcash parameter files
#
# if you don't have these downloaded already, the killed downloads will be incomplete,
# which causes an error in Zebra
# rm -r ~/.zcash-params

echo
echo "Running Zebra for $SHUTDOWN_DELAY seconds"
echo

# shut down Zebra if this script exits while it's running,
# but ignore "no such job" errors if Zebra has already exited
trap "kill %?zebrad 2> /dev/null" EXIT

target/debug/zebrad start &
sleep $SHUTDOWN_DELAY

echo
echo "Killing Zebra after $SHUTDOWN_DELAY seconds"
echo

kill %?zebrad
wait %?zebrad
EXIT_STATUS=$?

# fix up the exit status caused by 'kill'
if [ $EXIT_STATUS -eq 143 ]; then
EXIT_STATUS=0
fi

echo
echo "Killing Zebra after $SHUTDOWN_DELAY seconds exited with $EXIT_STATUS"
echo

SHUTDOWN_DELAY=$[SHUTDOWN_DELAY + SHUTDOWN_DELAY_INCREMENT]
done

exit $EXIT_STATUS
13 changes: 10 additions & 3 deletions zebrad/src/components/tokio.rs
Original file line number Diff line number Diff line change
Expand Up @@ -3,9 +3,12 @@
use crate::prelude::*;
use abscissa_core::{Application, Component, FrameworkError, Shutdown};
use color_eyre::Report;
use std::future::Future;
use std::{future::Future, time::Duration};
use tokio::runtime::Runtime;

/// When Zebra is shutting down, wait this long for tokio tasks to finish.
const TOKIO_SHUTDOWN_TIMEOUT: Duration = std::time::Duration::from_secs(20);

/// An Abscissa component which owns a Tokio runtime.
///
/// The runtime is stored as an `Option` so that when it's time to enter an async
Expand Down Expand Up @@ -57,8 +60,12 @@ impl RuntimeRun for Runtime {
}
});

// Enforce shutdown by avoiding long blocking tasks
self.shutdown_timeout(std::time::Duration::from_secs(5));
// Don't wait for long blocking tasks before shutting down
tracing::info!(
?TOKIO_SHUTDOWN_TIMEOUT,
"waiting for async tokio tasks to shut down"
);
self.shutdown_timeout(TOKIO_SHUTDOWN_TIMEOUT);

match result {
Ok(()) => {
Expand Down