Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Shred Repair Request #34771

Merged
merged 2 commits into from
Jan 17, 2024
Merged

Shred Repair Request #34771

merged 2 commits into from
Jan 17, 2024

Conversation

bw-solana
Copy link
Contributor

@bw-solana bw-solana commented Jan 12, 2024

Problem

It would be nice to have the ability to manually trigger a shred repair

Summary of Changes

Admin RPC/CLI interface for manually initiating shred repair. Inputs include pubkey of node to request repair from, slot number, and shred index.

Added a test to confirm that the serialization / translation from inputs to packet to Repair Request object is working.

Tested that this was working on testnet by spinning up a node, issuing a repair call to 9YVpEeZf8uBoUtzCFC6SSFDDqPt16uKFubNhLvGxeUDy, and logging incoming repaired shred through TVU

@bw-solana bw-solana force-pushed the shred_repair branch 3 times, most recently from 61eacd0 to 5f7dc3d Compare January 17, 2024 04:15
@bw-solana bw-solana marked this pull request as ready for review January 17, 2024 04:15
@@ -1691,20 +1709,6 @@ pub fn main() {
} else {
(None, None)
};
admin_rpc_service::run(
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

curious why the rpc service is initialized before Node::new. i wonder if we break anything by moving it after?

Copy link
Contributor Author

@bw-solana bw-solana Jan 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I was weary of this. Haven't been able to spot any problems or noticed any negative effects in my testing.

Here's what's now happening before Admin RPC service starts up that used to happen after:

  1. Initialize gossip host IP address (from command line param, entrypoints, or localhost)
  2. Initialize gossip address (from gossip host IP + command line port or available port 0/1)
  3. Initialize TPU address (from command line param)
  4. Initialize TPU forward address (from command line param)
  5. Initialize cluster entrypoints contact info vector
  6. Initialize node struct w/ contact and socket info

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Decided to move the repair socket to post init structure for consistency, which will also allow us to not change the order here

core/src/repair/repair_service.rs Show resolved Hide resolved
Copy link

codecov bot commented Jan 17, 2024

Codecov Report

Attention: 41 lines in your changes are missing coverage. Please review.

Comparison is base (32f260b) 81.8% compared to head (fca1ad8) 81.7%.
Report is 1 commits behind head on master.

Additional details and impacted files
@@            Coverage Diff            @@
##           master   #34771     +/-   ##
=========================================
- Coverage    81.8%    81.7%   -0.1%     
=========================================
  Files         825      825             
  Lines      223160   223249     +89     
=========================================
+ Hits       182553   182593     +40     
- Misses      40607    40656     +49     

@bw-solana
Copy link
Contributor Author

Things appear to be working as expected after adding outstanding repair requests. I'm seeing the following logs:

[2024-01-17T17:58:11.094064030Z WARN  solana_core::repair::repair_service] BW: sending repair request to 72.20.18.57:8009/9YVpEeZf8uBoUtzCFC6SSFDDqPt16uKFubNhLvGxeUDy: WindowIndex { header: RepairRequestHeader { signature: 1111111111111111111111111111111111111111111111111111111111111111, sender: 6NRzoBpU181zULrTHEtRfvwvUHPTSd55gAo16AuRrdHj, recipient: 9YVpEeZf8uBoUtzCFC6SSFDDqPt16uKFubNhLvGxeUDy, timestamp: 1705514291094, nonce: 3896009303 }, slot: 247309612, shred_index: 1 }
[2024-01-17T17:58:11.094188016Z WARN  solana_core::repair::repair_service] BW: successfully sent repair request!
[2024-01-17T17:58:11.198185456Z WARN  solana_core::shred_fetch_stage] BW: received repair packet from 72.20.18.57 for 247309612/1
[2024-01-17T17:58:11.198355079Z WARN  solana_core::window_service] BW: inserting repair packet from 72.20.18.57 for 247309612/1

Running with the following command:
solana-validator --ledger validator-ledger/ repair-shred-from-peer --pubkey 9YVpEeZf8uBoUtzCFC6SSFDDqPt16uKFubNhLvGxeUDy --slot 247309612 --shred 1

Performed testing with the following temp logging diff:

diff --git a/core/src/repair/repair_service.rs b/core/src/repair/repair_service.rs
index 509b1e1b6c..a1d6f39656 100644
--- a/core/src/repair/repair_service.rs
+++ b/core/src/repair/repair_service.rs
@@ -729,6 +729,7 @@ impl RepairService {
             slot,
             shred_index,
         };
+        warn!("BW: sending repair request to {address}/{pubkey}: {request_proto:?}");
         let packet_buf =
             ServeRepair::repair_proto_to_bytes(&request_proto, &identity_keypair).unwrap();
 
@@ -738,7 +739,7 @@ impl RepairService {
         // Send packet batch
         match batch_send(repair_socket, &reqs[..]) {
             Ok(()) => {
-                trace!("successfully sent repair request!");
+                warn!("BW: successfully sent repair request!");
             }
             Err(SendPktsError::IoError(err, _num_failed)) => {
                 error!("batch_send failed to send packet - error = {:?}", err);
diff --git a/core/src/shred_fetch_stage.rs b/core/src/shred_fetch_stage.rs
index fd72b8b8ee..b68853e1c6 100644
--- a/core/src/shred_fetch_stage.rs
+++ b/core/src/shred_fetch_stage.rs
@@ -120,6 +120,17 @@ impl ShredFetchStage {
                 {
                     packet.meta_mut().set_discard(true);
                 } else {
+                    if flags == PacketFlags::REPAIR {
+                        let expected_ip = std::net::IpAddr::V4(std::net::Ipv4Addr::new(72, 20, 18, 57));
+                        let actual_ip = packet.meta().addr;
+                        if expected_ip == actual_ip {
+                            use solana_ledger::shred::layout;
+                            let shred = layout::get_shred(packet).unwrap();
+                            let slot = layout::get_slot(shred).unwrap();
+                            let index = layout::get_index(shred).unwrap();
+                            warn!("BW: received repair packet from {actual_ip} for {slot}/{index}");
+                        }
+                    }
                     packet.meta_mut().flags.insert(flags);
                 }
             }
diff --git a/core/src/window_service.rs b/core/src/window_service.rs
index aa801b7ebd..4201077f59 100644
--- a/core/src/window_service.rs
+++ b/core/src/window_service.rs
@@ -265,6 +265,7 @@ fn prune_shreds_invalid_repair(
             )
                 .0;
             if !should_keep {
+                warn!("BW: pruning invalid repair shred for {}/{}", shred.slot(), shred.index());
                 removed.insert(i - 1);
             }
             should_keep
@@ -306,6 +307,15 @@ where
         let shred = shred::layout::get_shred(packet)?;
         let shred = Shred::new_from_serialized_shred(shred.to_vec()).ok()?;
         if packet.meta().repair() {
+            let expected_ip = std::net::IpAddr::V4(std::net::Ipv4Addr::new(72, 20, 18, 57));
+            let actual_ip = packet.meta().addr;
+            if expected_ip == actual_ip {
+                use solana_ledger::shred::layout;
+                let shred = layout::get_shred(packet).unwrap();
+                let slot = layout::get_slot(shred).unwrap();
+                let index = layout::get_index(shred).unwrap();
+                warn!("BW: inserting repair packet from {actual_ip} for {slot}/{index}");
+            }
             let repair_info = RepairMeta {
                 // If can't parse the nonce, dump the packet.
                 nonce: repair_response::nonce(packet)?,
diff --git a/ledger/src/shred.rs b/ledger/src/shred.rs
index 1ce6c7ccc1..cd3a7fd8bd 100644
--- a/ledger/src/shred.rs
+++ b/ledger/src/shred.rs
@@ -611,7 +611,7 @@ pub mod layout {
     }
 
     #[inline]
-    pub(super) fn get_index(shred: &[u8]) -> Option<u32> {
+    pub fn get_index(shred: &[u8]) -> Option<u32> {
         <[u8; 4]>::try_from(shred.get(OFFSET_OF_SHRED_INDEX..)?.get(..4)?)
             .map(u32::from_le_bytes)
             .ok()

Copy link
Contributor

@AshwinSekar AshwinSekar left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

awesome, test results and code looks good.

@bw-solana bw-solana merged commit 8fe4f5d into solana-labs:master Jan 17, 2024
35 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants