replay: reload tower if set-identity during startup #35173

AshwinSekar · 2024-02-11T05:44:29Z

Problem

if set-identity is called before the first loop of replay, we panic when pushing vote using a tower of the old identity

Summary of Changes

reload tower upon replay initialization if identity has changed

mergify · 2024-02-11T05:45:48Z

Backports to the stable branch are to be avoided unless absolutely necessary for fixing bugs, security issues, and perf regressions. Changes intended for backport should be structured such that a minimum effective diff can be committed separately from any refactoring, plumbing, cleanup, etc that are not strictly necessary to achieve the goal. Any of the latter should go only into master and ride the normal stabilization schedule.

mergify · 2024-02-11T05:45:49Z

Backports to the beta branch are to be avoided unless absolutely necessary for fixing bugs, security issues, and perf regressions. Changes intended for backport should be structured such that a minimum effective diff can be committed separately from any refactoring, plumbing, cleanup, etc that are not strictly necessary to achieve the goal. Any of the latter should go only into master and ride the normal stabilization schedule. Exceptions include CI/metrics changes, CLI improvements and documentation updates on a case by case basis.

codecov · 2024-02-11T07:28:15Z

Codecov Report

Attention: 16 lines in your changes are missing coverage. Please review.

Comparison is base (d87e7bc) 81.6% compared to head (25affc3) 81.6%.
Report is 1 commits behind head on master.

❗ Current head 25affc3 differs from pull request most recent head d63fd4f. Consider uploading reports for the commit d63fd4f to get more accurate results

Additional details and impacted files

@@           Coverage Diff            @@
##           master   #35173    +/-   ##
========================================
  Coverage    81.6%    81.6%            
========================================
  Files         833      833            
  Lines      224768   224898   +130     
========================================
+ Hits       183475   183599   +124     
- Misses      41293    41299     +6

wen-coding · 2024-02-11T23:15:22Z

core/src/replay_stage.rs

+                    "Identity changed from {} to {}",
+                    startup_identity, my_pubkey
+                );
+            }


Why do we need a warn! here but not in the loop?

we do warn in loop see line 1011

Ok, if you always warn you can put that in the common function I guess

would prefer not to, as the helper is just a generic load of the tower. it doesn't need to know anything about the previous pubkey.

wen-coding · 2024-02-11T23:18:34Z

core/src/replay_stage.rs

+                    )
+                } else {
+                    error!("Failed to load tower for {}: {}", new_identity, err);
+                    std::process::exit(1);


Why exit instead of panic! ? Would returning an error be better?

I think it's just rust style guidelines:
"For planned shutdowns, use exit(). Do note that a known error is considered a planned shutdown
For unplanned shutdowns (i.e. exceptional failures) consider panic!(), as you'll both benefit from being able to get a stack trace when this happens, and the failure case should be exceptional enough that it is effectively unaccounted for and stems from an unplanned scenario"
not being able to load tower due to corrupted disk is an expected failure condition, we should always shutdown if we do not have a working tower.

however i haven't really had to debug these scenarios before, if you think the unwind from panic is helpful we can go with that.

I think the difference is small, but panic! is probably better style because we use it more:
https://stackoverflow.com/questions/39228685/when-is-stdprocessexit-o-k-to-use

sounds good, i'll change to panic in a separate PR, don't want to backport changes to existing behavior.

wen-coding · 2024-02-11T23:20:57Z

core/src/replay_stage.rs

@@ -578,6 +580,20 @@ impl ReplayStage {
            let _exit = Finalizer::new(exit.clone());
            let mut identity_keypair = cluster_info.keypair().clone();
            let mut my_pubkey = identity_keypair.pubkey();
+            if my_pubkey != startup_identity {


So if set-identity comes in after this line could we successfully update it in the loop?

yeah since my_pubkey is cached until the condition is checked in the loop we are fine if rpc executes between here and the loop

smaller change is to just load my_pubkey from tower.node_pubkey and let the existing set-identity handle things. But this might be better for quick turnaround

wen-coding · 2024-02-11T23:24:58Z

core/src/replay_stage.rs

+                }
+            })
+    }
+


Add a test maybe?

wen-coding · 2024-02-15T00:05:59Z

core/src/replay_stage.rs

+        vote_account: &Pubkey,
+        bank_forks: &Arc<RwLock<BankForks>>,
+    ) -> Tower {
+        Tower::restore(tower_storage, new_identity)


I'm suddenly confused, let's say some sloppy validator operators is playing with set-identity back and forth, is the following scenario possible:
using identity A, saved tower at time t0
set-identity to B, saving new tower at time t1
set-identity to A, loaded the old tower at time t0

Is it correct behavior to restore an old tower if you play with set-identity?

Tower file is operator responsibility. This is possible now too. The purpose of this PR is to remove the validator from crashing when changing identity. In any case, my experience shows that changing the identity takes a few seconds. Also, yes, looks like fully correct.

wen-coding · 2024-02-15T00:06:29Z

core/src/replay_stage.rs

@@ -8598,4 +8625,60 @@ pub(crate) mod tests {
        assert_eq!(reset_fork, Some(4));
        assert_eq!(failures, vec![HeaviestForkFailures::LockedOut(4),]);
    }
+
+    #[test]
+    fn test_tower_reload_missing() {


nit: make tests name more descriptive that we are switching identity for tower here?

wen-coding · 2024-02-15T00:07:59Z

core/src/replay_stage.rs

+            &bank_forks,
+        );
+        let expected_tower = Tower::new_for_tests(VOTE_THRESHOLD_DEPTH, VOTE_THRESHOLD_SIZE);
+        assert_eq!(tower.vote_state, expected_tower.vote_state);


Hmm, do we test that the new tower has new identity somewhere?

carllin · 2024-02-19T06:06:48Z

I was wondering why the current set-identity had to be called inside the reset bank condition: #18182 (comment). Seems like it's to ensure the new identity doesn't attempt to make the blocks for the previous leader.

carllin · 2024-02-19T06:21:49Z

core/src/replay_stage.rs

@@ -578,6 +580,20 @@ impl ReplayStage {
            let _exit = Finalizer::new(exit.clone());
            let mut identity_keypair = cluster_info.keypair().clone();
            let mut my_pubkey = identity_keypair.pubkey();
+            if my_pubkey != startup_identity {


smaller change is to just load my_pubkey from tower.node_pubkey and let the existing set-identity handle things. But this might be better for quick turnaround

carllin · 2024-02-19T06:22:03Z

core/src/replay_stage.rs

@@ -546,6 +547,7 @@ impl ReplayStage {
        popular_pruned_forks_receiver: PopularPrunedForksReceiver,
    ) -> Result<Self, String> {
        let ReplayStageConfig {
+            startup_identity,
            vote_account,


can we just check tower.node_pubkey

AshwinSekar · 2024-02-20T04:32:18Z

Updated to just use tower.node_pubkey instead of passing around the pubkey.

I still kept the initial check before the loop since there could be some weird stuff before we hit the check in the loop. since the tower doesn't know the full keypair we might end up sending out a vote/refresh signed with the new keypair for the previous identity's vote.

tower & poh_recorder should now be correct as soon as the loop starts.

* replay: reload tower if set-identity during startup * pr feedback: add unit tests * pr feedback: use tower.node_pubkey, more descriptive names (cherry picked from commit befe8b9)

#35173) (#35257) replay: reload tower if set-identity during startup (#35173) * replay: reload tower if set-identity during startup * pr feedback: add unit tests * pr feedback: use tower.node_pubkey, more descriptive names (cherry picked from commit befe8b9) Co-authored-by: Ashwin Sekar <ashwin@solana.com>

#35173) (#35256) replay: reload tower if set-identity during startup (#35173) * replay: reload tower if set-identity during startup * pr feedback: add unit tests * pr feedback: use tower.node_pubkey, more descriptive names (cherry picked from commit befe8b9) Co-authored-by: Ashwin Sekar <ashwin@solana.com>

AshwinSekar requested review from carllin and wen-coding February 11, 2024 05:45

AshwinSekar added v1.18 PRs that should be backported to v1.18 v1.17 PRs that should be backported to v1.17 labels Feb 11, 2024

wen-coding reviewed Feb 11, 2024

View reviewed changes

AshwinSekar force-pushed the set-identity branch 3 times, most recently from a7c3aac to 25affc3 Compare February 14, 2024 16:31

wen-coding reviewed Feb 15, 2024

View reviewed changes

carllin reviewed Feb 19, 2024

View reviewed changes

AshwinSekar added 3 commits February 20, 2024 04:03

replay: reload tower if set-identity during startup

c3ee683

pr feedback: add unit tests

3c404d9

pr feedback: use tower.node_pubkey, more descriptive names

d63fd4f

AshwinSekar force-pushed the set-identity branch from 25affc3 to d63fd4f Compare February 20, 2024 04:24

AshwinSekar requested a review from carllin February 20, 2024 04:32

carllin approved these changes Feb 20, 2024

View reviewed changes

AshwinSekar merged commit befe8b9 into solana-labs:master Feb 20, 2024
45 checks passed

AshwinSekar deleted the set-identity branch February 20, 2024 17:30

mergify bot mentioned this pull request Feb 20, 2024

v1.17: replay: reload tower if set-identity during startup (backport of #35173) #35256

Merged

AshwinSekar mentioned this pull request Feb 20, 2024

validator: ignore too old tower error #35229

Merged

mergify bot mentioned this pull request Feb 20, 2024

v1.18: replay: reload tower if set-identity during startup (backport of #35173) #35257

Merged

AshwinSekar mentioned this pull request Feb 20, 2024

replay: gracefully exit if tower load fails #35269

Merged

HaoranYi mentioned this pull request Apr 8, 2024

pr634 1.17 new filter anza-xyz/agave#657

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

replay: reload tower if set-identity during startup #35173

replay: reload tower if set-identity during startup #35173

AshwinSekar commented Feb 11, 2024

mergify bot commented Feb 11, 2024

mergify bot commented Feb 11, 2024

codecov bot commented Feb 11, 2024 •

edited

Loading

wen-coding Feb 11, 2024

AshwinSekar Feb 13, 2024

wen-coding Feb 14, 2024

AshwinSekar Feb 20, 2024

wen-coding Feb 11, 2024

AshwinSekar Feb 13, 2024

AshwinSekar Feb 13, 2024

wen-coding Feb 14, 2024

AshwinSekar Feb 20, 2024

wen-coding Feb 11, 2024

AshwinSekar Feb 13, 2024

carllin Feb 19, 2024

wen-coding Feb 11, 2024

wen-coding Feb 15, 2024

diman-io Feb 19, 2024

wen-coding Feb 15, 2024

wen-coding Feb 15, 2024

carllin commented Feb 19, 2024

carllin Feb 19, 2024

carllin Feb 19, 2024

AshwinSekar Feb 20, 2024

AshwinSekar commented Feb 20, 2024

replay: reload tower if set-identity during startup #35173

replay: reload tower if set-identity during startup #35173

Conversation

AshwinSekar commented Feb 11, 2024

Problem

Summary of Changes

mergify bot commented Feb 11, 2024

mergify bot commented Feb 11, 2024

codecov bot commented Feb 11, 2024 • edited Loading

Codecov Report

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

carllin commented Feb 19, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

AshwinSekar commented Feb 20, 2024

codecov bot commented Feb 11, 2024 •

edited

Loading