-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
replay: reload tower if set-identity during startup #35173
Conversation
Backports to the stable branch are to be avoided unless absolutely necessary for fixing bugs, security issues, and perf regressions. Changes intended for backport should be structured such that a minimum effective diff can be committed separately from any refactoring, plumbing, cleanup, etc that are not strictly necessary to achieve the goal. Any of the latter should go only into master and ride the normal stabilization schedule. |
Backports to the beta branch are to be avoided unless absolutely necessary for fixing bugs, security issues, and perf regressions. Changes intended for backport should be structured such that a minimum effective diff can be committed separately from any refactoring, plumbing, cleanup, etc that are not strictly necessary to achieve the goal. Any of the latter should go only into master and ride the normal stabilization schedule. Exceptions include CI/metrics changes, CLI improvements and documentation updates on a case by case basis. |
Codecov ReportAttention:
Additional details and impacted files@@ Coverage Diff @@
## master #35173 +/- ##
========================================
Coverage 81.6% 81.6%
========================================
Files 833 833
Lines 224768 224898 +130
========================================
+ Hits 183475 183599 +124
- Misses 41293 41299 +6 |
"Identity changed from {} to {}", | ||
startup_identity, my_pubkey | ||
); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need a warn! here but not in the loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
we do warn in loop see line 1011
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, if you always warn you can put that in the common function I guess
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would prefer not to, as the helper is just a generic load of the tower. it doesn't need to know anything about the previous pubkey.
) | ||
} else { | ||
error!("Failed to load tower for {}: {}", new_identity, err); | ||
std::process::exit(1); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why exit instead of panic! ? Would returning an error be better?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's just rust style guidelines:
"For planned shutdowns, use exit(). Do note that a known error is considered a planned shutdown
For unplanned shutdowns (i.e. exceptional failures) consider panic!(), as you'll both benefit from being able to get a stack trace when this happens, and the failure case should be exceptional enough that it is effectively unaccounted for and stems from an unplanned scenario"
not being able to load tower due to corrupted disk is an expected failure condition, we should always shutdown if we do not have a working tower.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
however i haven't really had to debug these scenarios before, if you think the unwind from panic is helpful we can go with that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the difference is small, but panic! is probably better style because we use it more:
https://stackoverflow.com/questions/39228685/when-is-stdprocessexit-o-k-to-use
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sounds good, i'll change to panic in a separate PR, don't want to backport changes to existing behavior.
core/src/replay_stage.rs
Outdated
@@ -578,6 +580,20 @@ impl ReplayStage { | |||
let _exit = Finalizer::new(exit.clone()); | |||
let mut identity_keypair = cluster_info.keypair().clone(); | |||
let mut my_pubkey = identity_keypair.pubkey(); | |||
if my_pubkey != startup_identity { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So if set-identity comes in after this line could we successfully update it in the loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah since my_pubkey is cached until the condition is checked in the loop we are fine if rpc executes between here and the loop
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
smaller change is to just load my_pubkey
from tower.node_pubkey
and let the existing set-identity handle things. But this might be better for quick turnaround
} | ||
}) | ||
} | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Add a test maybe?
a7c3aac
to
25affc3
Compare
core/src/replay_stage.rs
Outdated
vote_account: &Pubkey, | ||
bank_forks: &Arc<RwLock<BankForks>>, | ||
) -> Tower { | ||
Tower::restore(tower_storage, new_identity) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm suddenly confused, let's say some sloppy validator operators is playing with set-identity back and forth, is the following scenario possible:
using identity A, saved tower at time t0
set-identity to B, saving new tower at time t1
set-identity to A, loaded the old tower at time t0
Is it correct behavior to restore an old tower if you play with set-identity?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tower file is operator responsibility. This is possible now too. The purpose of this PR is to remove the validator from crashing when changing identity. In any case, my experience shows that changing the identity takes a few seconds. Also, yes, looks like fully correct.
core/src/replay_stage.rs
Outdated
@@ -8598,4 +8625,60 @@ pub(crate) mod tests { | |||
assert_eq!(reset_fork, Some(4)); | |||
assert_eq!(failures, vec![HeaviestForkFailures::LockedOut(4),]); | |||
} | |||
|
|||
#[test] | |||
fn test_tower_reload_missing() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: make tests name more descriptive that we are switching identity for tower here?
&bank_forks, | ||
); | ||
let expected_tower = Tower::new_for_tests(VOTE_THRESHOLD_DEPTH, VOTE_THRESHOLD_SIZE); | ||
assert_eq!(tower.vote_state, expected_tower.vote_state); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hmm, do we test that the new tower has new identity somewhere?
I was wondering why the current set-identity had to be called inside the reset bank condition: #18182 (comment). Seems like it's to ensure the new identity doesn't attempt to make the blocks for the previous leader. |
core/src/replay_stage.rs
Outdated
@@ -578,6 +580,20 @@ impl ReplayStage { | |||
let _exit = Finalizer::new(exit.clone()); | |||
let mut identity_keypair = cluster_info.keypair().clone(); | |||
let mut my_pubkey = identity_keypair.pubkey(); | |||
if my_pubkey != startup_identity { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
smaller change is to just load my_pubkey
from tower.node_pubkey
and let the existing set-identity handle things. But this might be better for quick turnaround
core/src/replay_stage.rs
Outdated
@@ -546,6 +547,7 @@ impl ReplayStage { | |||
popular_pruned_forks_receiver: PopularPrunedForksReceiver, | |||
) -> Result<Self, String> { | |||
let ReplayStageConfig { | |||
startup_identity, | |||
vote_account, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we just check tower.node_pubkey
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nice catch
25affc3
to
d63fd4f
Compare
Updated to just use I still kept the initial check before the loop since there could be some weird stuff before we hit the check in the loop. since the tower doesn't know the full keypair we might end up sending out a vote/refresh signed with the new keypair for the previous identity's vote. tower & poh_recorder should now be correct as soon as the loop starts. |
* replay: reload tower if set-identity during startup * pr feedback: add unit tests * pr feedback: use tower.node_pubkey, more descriptive names (cherry picked from commit befe8b9)
* replay: reload tower if set-identity during startup * pr feedback: add unit tests * pr feedback: use tower.node_pubkey, more descriptive names (cherry picked from commit befe8b9)
#35173) (#35257) replay: reload tower if set-identity during startup (#35173) * replay: reload tower if set-identity during startup * pr feedback: add unit tests * pr feedback: use tower.node_pubkey, more descriptive names (cherry picked from commit befe8b9) Co-authored-by: Ashwin Sekar <ashwin@solana.com>
#35173) (#35256) replay: reload tower if set-identity during startup (#35173) * replay: reload tower if set-identity during startup * pr feedback: add unit tests * pr feedback: use tower.node_pubkey, more descriptive names (cherry picked from commit befe8b9) Co-authored-by: Ashwin Sekar <ashwin@solana.com>
Problem
if set-identity is called before the first loop of replay, we panic when pushing vote using a tower of the old identity
Summary of Changes
reload tower upon replay initialization if identity has changed
Fixes #35152