Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Dump state of every epoch to S3 #8661

Merged
merged 20 commits into from
Mar 10, 2023

Conversation

nikurt
Copy link
Contributor

@nikurt nikurt commented Mar 1, 2023

  • Start a thread per shard to do the dumping
  • AWS credentials are provided as environment variables: AWS_ACCESS_KEY_ID and AWS_SECRET_ACCESS_KEY
  • In config.json specify both config.state_sync.s3_bucket and config.state_sync.s3_region to enable the new behavior.
  • No changes to the behavior of the node if those options are not enabled in config.json.
  • State is persisted to RocksDB such that restarts of the node are well handled.
  • Some useful metrics are exported.
  • The node assumes it's the only node in the this and all alternative universes that does the dumping.
    • Unclear how to use multiple nodes to complete the dump faster
  • TODO: Speed this up by doing things in parallel: obtain parts, upload parts, set tags
    • Do we even need tags?

@nikurt
Copy link
Contributor Author

nikurt commented Mar 1, 2023

Tags on state parts look like this:
image

@nikurt
Copy link
Contributor Author

nikurt commented Mar 1, 2023

cc: @mm-near

@nikurt nikurt requested review from mm-near and mzhangmzz March 2, 2023 14:19
@nikurt nikurt marked this pull request as ready for review March 2, 2023 14:19
@nikurt nikurt requested a review from a team as a code owner March 2, 2023 14:19
core/chain-configs/src/client_config.rs Outdated Show resolved Hide resolved
core/primitives/src/syncing.rs Show resolved Hide resolved
core/primitives/src/syncing.rs Show resolved Hide resolved
nearcore/src/state_sync.rs Show resolved Hide resolved
// Create a connection to S3.
let s3_bucket = config.client_config.state_sync_s3_bucket.clone();
let s3_region = config.client_config.state_sync_s3_region.clone();
let bucket = s3::Bucket::new(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

here we assume that credentials are magically set in the environment?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, added a comment.
Added a suggestion to an error message.
Will also add to the documentation.

nearcore/src/state_sync.rs Outdated Show resolved Hide resolved
// The actual dumping of state to S3.
tracing::info!(target: "state_sync_dump", shard_id, ?epoch_id, epoch_height, %sync_hash, %state_root, parts_dumped, num_parts, "Creating parts and dumping them");
let mut res = None;
for part_id in parts_dumped..num_parts {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: I'd move this code (until 242) into separate method

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Moved as much as I can, because I can't move all of it.
The function will need to be async because it calls put_object. But passing chain to an async function is not an option, because Chain isn't Send.

nearcore/src/state_sync.rs Outdated Show resolved Hide resolved
chain/chain/src/store.rs Show resolved Hide resolved
/// for example 'STATE_SYNC_DUMP:2' for shard_id=2.
fn state_sync_dump_progress_key(shard_id: ShardId) -> Vec<u8> {
let mut key = b"STATE_SYNC_DUMP:".to_vec();
key.extend(shard_id.to_le_bytes());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I also wonder if this key shouldn't contain the epoch id information

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not including epoch id in the key, because only one epoch dump per shard is possible at a time.
I can think of adding epoch id to the key for the purpose of storing the history of dumping epochs to external storage, but doesn't seem necessary.

@near-bulldozer near-bulldozer bot merged commit 787268b into near:master Mar 10, 2023
nikurt added a commit to nikurt/nearcore that referenced this pull request Mar 15, 2023
* Start a thread per shard to do the dumping
* AWS credentials are provided as environment variables: `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY`
* In `config.json` specify both `config.state_sync.s3_bucket` and `config.state_sync.s3_region` to enable the new behavior.
* No changes to the behavior of the node if those options are not enabled in `config.json`.
* State is persisted to RocksDB such that restarts of the node are well handled.
* Some useful metrics are exported.
* The node assumes it's the only node in the this and all alternative universes that does the dumping.
* * Unclear how to use multiple nodes to complete the dump faster
* TODO: Speed this up by doing things in parallel: obtain parts, upload parts, set tags
* * Do we even need tags?
nikurt added a commit to nikurt/nearcore that referenced this pull request Mar 15, 2023
* Start a thread per shard to do the dumping
* AWS credentials are provided as environment variables: `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY`
* In `config.json` specify both `config.state_sync.s3_bucket` and `config.state_sync.s3_region` to enable the new behavior.
* No changes to the behavior of the node if those options are not enabled in `config.json`.
* State is persisted to RocksDB such that restarts of the node are well handled.
* Some useful metrics are exported.
* The node assumes it's the only node in the this and all alternative universes that does the dumping.
* * Unclear how to use multiple nodes to complete the dump faster
* TODO: Speed this up by doing things in parallel: obtain parts, upload parts, set tags
* * Do we even need tags?
nikurt added a commit to nikurt/nearcore that referenced this pull request Mar 20, 2023
* Start a thread per shard to do the dumping
* AWS credentials are provided as environment variables: `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY`
* In `config.json` specify both `config.state_sync.s3_bucket` and `config.state_sync.s3_region` to enable the new behavior.
* No changes to the behavior of the node if those options are not enabled in `config.json`.
* State is persisted to RocksDB such that restarts of the node are well handled.
* Some useful metrics are exported.
* The node assumes it's the only node in the this and all alternative universes that does the dumping.
* * Unclear how to use multiple nodes to complete the dump faster
* TODO: Speed this up by doing things in parallel: obtain parts, upload parts, set tags
* * Do we even need tags?
nikurt added a commit to nikurt/nearcore that referenced this pull request Mar 23, 2023
* Start a thread per shard to do the dumping
* AWS credentials are provided as environment variables: `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY`
* In `config.json` specify both `config.state_sync.s3_bucket` and `config.state_sync.s3_region` to enable the new behavior.
* No changes to the behavior of the node if those options are not enabled in `config.json`.
* State is persisted to RocksDB such that restarts of the node are well handled.
* Some useful metrics are exported.
* The node assumes it's the only node in the this and all alternative universes that does the dumping.
* * Unclear how to use multiple nodes to complete the dump faster
* TODO: Speed this up by doing things in parallel: obtain parts, upload parts, set tags
* * Do we even need tags?
nikurt added a commit to nikurt/nearcore that referenced this pull request Apr 6, 2023
* Start a thread per shard to do the dumping
* AWS credentials are provided as environment variables: `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY`
* In `config.json` specify both `config.state_sync.s3_bucket` and `config.state_sync.s3_region` to enable the new behavior.
* No changes to the behavior of the node if those options are not enabled in `config.json`.
* State is persisted to RocksDB such that restarts of the node are well handled.
* Some useful metrics are exported.
* The node assumes it's the only node in the this and all alternative universes that does the dumping.
* * Unclear how to use multiple nodes to complete the dump faster
* TODO: Speed this up by doing things in parallel: obtain parts, upload parts, set tags
* * Do we even need tags?
nikurt added a commit to nikurt/nearcore that referenced this pull request Apr 13, 2023
* Start a thread per shard to do the dumping
* AWS credentials are provided as environment variables: `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY`
* In `config.json` specify both `config.state_sync.s3_bucket` and `config.state_sync.s3_region` to enable the new behavior.
* No changes to the behavior of the node if those options are not enabled in `config.json`.
* State is persisted to RocksDB such that restarts of the node are well handled.
* Some useful metrics are exported.
* The node assumes it's the only node in the this and all alternative universes that does the dumping.
* * Unclear how to use multiple nodes to complete the dump faster
* TODO: Speed this up by doing things in parallel: obtain parts, upload parts, set tags
* * Do we even need tags?
nikurt added a commit that referenced this pull request Apr 14, 2023
* Start a thread per shard to do the dumping
* AWS credentials are provided as environment variables: `AWS_ACCESS_KEY_ID` and `AWS_SECRET_ACCESS_KEY`
* In `config.json` specify both `config.state_sync.s3_bucket` and `config.state_sync.s3_region` to enable the new behavior.
* No changes to the behavior of the node if those options are not enabled in `config.json`.
* State is persisted to RocksDB such that restarts of the node are well handled.
* Some useful metrics are exported.
* The node assumes it's the only node in the this and all alternative universes that does the dumping.
* * Unclear how to use multiple nodes to complete the dump faster
* TODO: Speed this up by doing things in parallel: obtain parts, upload parts, set tags
* * Do we even need tags?
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants