Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

replay: reload tower if set-identity during startup #35173

Merged
merged 3 commits into from
Feb 20, 2024

Conversation

AshwinSekar
Copy link
Contributor

Problem

if set-identity is called before the first loop of replay, we panic when pushing vote using a tower of the old identity

Summary of Changes

reload tower upon replay initialization if identity has changed

Fixes #35152

@AshwinSekar AshwinSekar added v1.18 PRs that should be backported to v1.18 v1.17 PRs that should be backported to v1.17 labels Feb 11, 2024
Copy link
Contributor

mergify bot commented Feb 11, 2024

Backports to the stable branch are to be avoided unless absolutely necessary for fixing bugs, security issues, and perf regressions. Changes intended for backport should be structured such that a minimum effective diff can be committed separately from any refactoring, plumbing, cleanup, etc that are not strictly necessary to achieve the goal. Any of the latter should go only into master and ride the normal stabilization schedule.

Copy link
Contributor

mergify bot commented Feb 11, 2024

Backports to the beta branch are to be avoided unless absolutely necessary for fixing bugs, security issues, and perf regressions. Changes intended for backport should be structured such that a minimum effective diff can be committed separately from any refactoring, plumbing, cleanup, etc that are not strictly necessary to achieve the goal. Any of the latter should go only into master and ride the normal stabilization schedule. Exceptions include CI/metrics changes, CLI improvements and documentation updates on a case by case basis.

Copy link

codecov bot commented Feb 11, 2024

Codecov Report

Attention: 16 lines in your changes are missing coverage. Please review.

Comparison is base (d87e7bc) 81.6% compared to head (25affc3) 81.6%.
Report is 1 commits behind head on master.

❗ Current head 25affc3 differs from pull request most recent head d63fd4f. Consider uploading reports for the commit d63fd4f to get more accurate results

Additional details and impacted files
@@           Coverage Diff            @@
##           master   #35173    +/-   ##
========================================
  Coverage    81.6%    81.6%            
========================================
  Files         833      833            
  Lines      224768   224898   +130     
========================================
+ Hits       183475   183599   +124     
- Misses      41293    41299     +6     

"Identity changed from {} to {}",
startup_identity, my_pubkey
);
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need a warn! here but not in the loop?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

we do warn in loop see line 1011

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, if you always warn you can put that in the common function I guess

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would prefer not to, as the helper is just a generic load of the tower. it doesn't need to know anything about the previous pubkey.

)
} else {
error!("Failed to load tower for {}: {}", new_identity, err);
std::process::exit(1);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why exit instead of panic! ? Would returning an error be better?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's just rust style guidelines:
"For planned shutdowns, use exit(). Do note that a known error is considered a planned shutdown
For unplanned shutdowns (i.e. exceptional failures) consider panic!(), as you'll both benefit from being able to get a stack trace when this happens, and the failure case should be exceptional enough that it is effectively unaccounted for and stems from an unplanned scenario"
not being able to load tower due to corrupted disk is an expected failure condition, we should always shutdown if we do not have a working tower.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

however i haven't really had to debug these scenarios before, if you think the unwind from panic is helpful we can go with that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the difference is small, but panic! is probably better style because we use it more:
https://stackoverflow.com/questions/39228685/when-is-stdprocessexit-o-k-to-use

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sounds good, i'll change to panic in a separate PR, don't want to backport changes to existing behavior.

@@ -578,6 +580,20 @@ impl ReplayStage {
let _exit = Finalizer::new(exit.clone());
let mut identity_keypair = cluster_info.keypair().clone();
let mut my_pubkey = identity_keypair.pubkey();
if my_pubkey != startup_identity {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So if set-identity comes in after this line could we successfully update it in the loop?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah since my_pubkey is cached until the condition is checked in the loop we are fine if rpc executes between here and the loop

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

smaller change is to just load my_pubkey from tower.node_pubkey and let the existing set-identity handle things. But this might be better for quick turnaround

}
})
}

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Add a test maybe?

@AshwinSekar AshwinSekar force-pushed the set-identity branch 3 times, most recently from a7c3aac to 25affc3 Compare February 14, 2024 16:31
vote_account: &Pubkey,
bank_forks: &Arc<RwLock<BankForks>>,
) -> Tower {
Tower::restore(tower_storage, new_identity)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm suddenly confused, let's say some sloppy validator operators is playing with set-identity back and forth, is the following scenario possible:
using identity A, saved tower at time t0
set-identity to B, saving new tower at time t1
set-identity to A, loaded the old tower at time t0

Is it correct behavior to restore an old tower if you play with set-identity?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tower file is operator responsibility. This is possible now too. The purpose of this PR is to remove the validator from crashing when changing identity. In any case, my experience shows that changing the identity takes a few seconds. Also, yes, looks like fully correct.

@@ -8598,4 +8625,60 @@ pub(crate) mod tests {
assert_eq!(reset_fork, Some(4));
assert_eq!(failures, vec![HeaviestForkFailures::LockedOut(4),]);
}

#[test]
fn test_tower_reload_missing() {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: make tests name more descriptive that we are switching identity for tower here?

&bank_forks,
);
let expected_tower = Tower::new_for_tests(VOTE_THRESHOLD_DEPTH, VOTE_THRESHOLD_SIZE);
assert_eq!(tower.vote_state, expected_tower.vote_state);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hmm, do we test that the new tower has new identity somewhere?

@carllin
Copy link
Contributor

carllin commented Feb 19, 2024

I was wondering why the current set-identity had to be called inside the reset bank condition: #18182 (comment). Seems like it's to ensure the new identity doesn't attempt to make the blocks for the previous leader.

@@ -578,6 +580,20 @@ impl ReplayStage {
let _exit = Finalizer::new(exit.clone());
let mut identity_keypair = cluster_info.keypair().clone();
let mut my_pubkey = identity_keypair.pubkey();
if my_pubkey != startup_identity {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

smaller change is to just load my_pubkey from tower.node_pubkey and let the existing set-identity handle things. But this might be better for quick turnaround

@@ -546,6 +547,7 @@ impl ReplayStage {
popular_pruned_forks_receiver: PopularPrunedForksReceiver,
) -> Result<Self, String> {
let ReplayStageConfig {
startup_identity,
vote_account,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we just check tower.node_pubkey

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice catch

@AshwinSekar
Copy link
Contributor Author

Updated to just use tower.node_pubkey instead of passing around the pubkey.

I still kept the initial check before the loop since there could be some weird stuff before we hit the check in the loop. since the tower doesn't know the full keypair we might end up sending out a vote/refresh signed with the new keypair for the previous identity's vote.

tower & poh_recorder should now be correct as soon as the loop starts.

@AshwinSekar AshwinSekar merged commit befe8b9 into solana-labs:master Feb 20, 2024
45 checks passed
@AshwinSekar AshwinSekar deleted the set-identity branch February 20, 2024 17:30
mergify bot pushed a commit that referenced this pull request Feb 20, 2024
* replay: reload tower if set-identity during startup

* pr feedback: add unit tests

* pr feedback: use tower.node_pubkey, more descriptive names

(cherry picked from commit befe8b9)
mergify bot pushed a commit that referenced this pull request Feb 20, 2024
* replay: reload tower if set-identity during startup

* pr feedback: add unit tests

* pr feedback: use tower.node_pubkey, more descriptive names

(cherry picked from commit befe8b9)
AshwinSekar added a commit that referenced this pull request Feb 20, 2024
#35173) (#35257)

replay: reload tower if set-identity during startup (#35173)

* replay: reload tower if set-identity during startup

* pr feedback: add unit tests

* pr feedback: use tower.node_pubkey, more descriptive names

(cherry picked from commit befe8b9)

Co-authored-by: Ashwin Sekar <ashwin@solana.com>
AshwinSekar added a commit that referenced this pull request Feb 20, 2024
#35173) (#35256)

replay: reload tower if set-identity during startup (#35173)

* replay: reload tower if set-identity during startup

* pr feedback: add unit tests

* pr feedback: use tower.node_pubkey, more descriptive names

(cherry picked from commit befe8b9)

Co-authored-by: Ashwin Sekar <ashwin@solana.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
v1.17 PRs that should be backported to v1.17 v1.18 PRs that should be backported to v1.18
Projects
None yet
Development

Successfully merging this pull request may close these issues.

set-identity causes undesirable behavior when used while validator is waiting for supermajority.
4 participants