Shred Repair Request #34771

bw-solana · 2024-01-12T20:26:11Z

Problem

It would be nice to have the ability to manually trigger a shred repair

Summary of Changes

Admin RPC/CLI interface for manually initiating shred repair. Inputs include pubkey of node to request repair from, slot number, and shred index.

Added a test to confirm that the serialization / translation from inputs to packet to Repair Request object is working.

Tested that this was working on testnet by spinning up a node, issuing a repair call to 9YVpEeZf8uBoUtzCFC6SSFDDqPt16uKFubNhLvGxeUDy, and logging incoming repaired shred through TVU

AshwinSekar · 2024-01-17T08:11:57Z

validator/src/main.rs

@@ -1691,20 +1709,6 @@ pub fn main() {
        } else {
            (None, None)
        };
-    admin_rpc_service::run(


curious why the rpc service is initialized before Node::new. i wonder if we break anything by moving it after?

Yeah, I was weary of this. Haven't been able to spot any problems or noticed any negative effects in my testing.

Here's what's now happening before Admin RPC service starts up that used to happen after:

Initialize gossip host IP address (from command line param, entrypoints, or localhost)

Initialize gossip address (from gossip host IP + command line port or available port 0/1)

Initialize TPU address (from command line param)

Initialize TPU forward address (from command line param)

Initialize cluster entrypoints contact info vector

Initialize node struct w/ contact and socket info

Decided to move the repair socket to post init structure for consistency, which will also allow us to not change the order here

core/src/repair/repair_service.rs

codecov · 2024-01-17T16:24:28Z

Codecov Report

Attention: 41 lines in your changes are missing coverage. Please review.

Comparison is base (32f260b) 81.8% compared to head (fca1ad8) 81.7%.
Report is 1 commits behind head on master.

Additional details and impacted files

@@            Coverage Diff            @@
##           master   #34771     +/-   ##
=========================================
- Coverage    81.8%    81.7%   -0.1%     
=========================================
  Files         825      825             
  Lines      223160   223249     +89     
=========================================
+ Hits       182553   182593     +40     
- Misses      40607    40656     +49

bw-solana · 2024-01-17T18:00:54Z

Things appear to be working as expected after adding outstanding repair requests. I'm seeing the following logs:

[2024-01-17T17:58:11.094064030Z WARN  solana_core::repair::repair_service] BW: sending repair request to 72.20.18.57:8009/9YVpEeZf8uBoUtzCFC6SSFDDqPt16uKFubNhLvGxeUDy: WindowIndex { header: RepairRequestHeader { signature: 1111111111111111111111111111111111111111111111111111111111111111, sender: 6NRzoBpU181zULrTHEtRfvwvUHPTSd55gAo16AuRrdHj, recipient: 9YVpEeZf8uBoUtzCFC6SSFDDqPt16uKFubNhLvGxeUDy, timestamp: 1705514291094, nonce: 3896009303 }, slot: 247309612, shred_index: 1 }
[2024-01-17T17:58:11.094188016Z WARN  solana_core::repair::repair_service] BW: successfully sent repair request!
[2024-01-17T17:58:11.198185456Z WARN  solana_core::shred_fetch_stage] BW: received repair packet from 72.20.18.57 for 247309612/1
[2024-01-17T17:58:11.198355079Z WARN  solana_core::window_service] BW: inserting repair packet from 72.20.18.57 for 247309612/1

Running with the following command:
solana-validator --ledger validator-ledger/ repair-shred-from-peer --pubkey 9YVpEeZf8uBoUtzCFC6SSFDDqPt16uKFubNhLvGxeUDy --slot 247309612 --shred 1

Performed testing with the following temp logging diff:

diff --git a/core/src/repair/repair_service.rs b/core/src/repair/repair_service.rs
index 509b1e1b6c..a1d6f39656 100644
--- a/core/src/repair/repair_service.rs
+++ b/core/src/repair/repair_service.rs
@@ -729,6 +729,7 @@ impl RepairService {
             slot,
             shred_index,
         };
+        warn!("BW: sending repair request to {address}/{pubkey}: {request_proto:?}");
         let packet_buf =
             ServeRepair::repair_proto_to_bytes(&request_proto, &identity_keypair).unwrap();
 
@@ -738,7 +739,7 @@ impl RepairService {
         // Send packet batch
         match batch_send(repair_socket, &reqs[..]) {
             Ok(()) => {
-                trace!("successfully sent repair request!");
+                warn!("BW: successfully sent repair request!");
             }
             Err(SendPktsError::IoError(err, _num_failed)) => {
                 error!("batch_send failed to send packet - error = {:?}", err);
diff --git a/core/src/shred_fetch_stage.rs b/core/src/shred_fetch_stage.rs
index fd72b8b8ee..b68853e1c6 100644
--- a/core/src/shred_fetch_stage.rs
+++ b/core/src/shred_fetch_stage.rs
@@ -120,6 +120,17 @@ impl ShredFetchStage {
                 {
                     packet.meta_mut().set_discard(true);
                 } else {
+                    if flags == PacketFlags::REPAIR {
+                        let expected_ip = std::net::IpAddr::V4(std::net::Ipv4Addr::new(72, 20, 18, 57));
+                        let actual_ip = packet.meta().addr;
+                        if expected_ip == actual_ip {
+                            use solana_ledger::shred::layout;
+                            let shred = layout::get_shred(packet).unwrap();
+                            let slot = layout::get_slot(shred).unwrap();
+                            let index = layout::get_index(shred).unwrap();
+                            warn!("BW: received repair packet from {actual_ip} for {slot}/{index}");
+                        }
+                    }
                     packet.meta_mut().flags.insert(flags);
                 }
             }
diff --git a/core/src/window_service.rs b/core/src/window_service.rs
index aa801b7ebd..4201077f59 100644
--- a/core/src/window_service.rs
+++ b/core/src/window_service.rs
@@ -265,6 +265,7 @@ fn prune_shreds_invalid_repair(
             )
                 .0;
             if !should_keep {
+                warn!("BW: pruning invalid repair shred for {}/{}", shred.slot(), shred.index());
                 removed.insert(i - 1);
             }
             should_keep
@@ -306,6 +307,15 @@ where
         let shred = shred::layout::get_shred(packet)?;
         let shred = Shred::new_from_serialized_shred(shred.to_vec()).ok()?;
         if packet.meta().repair() {
+            let expected_ip = std::net::IpAddr::V4(std::net::Ipv4Addr::new(72, 20, 18, 57));
+            let actual_ip = packet.meta().addr;
+            if expected_ip == actual_ip {
+                use solana_ledger::shred::layout;
+                let shred = layout::get_shred(packet).unwrap();
+                let slot = layout::get_slot(shred).unwrap();
+                let index = layout::get_index(shred).unwrap();
+                warn!("BW: inserting repair packet from {actual_ip} for {slot}/{index}");
+            }
             let repair_info = RepairMeta {
                 // If can't parse the nonce, dump the packet.
                 nonce: repair_response::nonce(packet)?,
diff --git a/ledger/src/shred.rs b/ledger/src/shred.rs
index 1ce6c7ccc1..cd3a7fd8bd 100644
--- a/ledger/src/shred.rs
+++ b/ledger/src/shred.rs
@@ -611,7 +611,7 @@ pub mod layout {
     }
 
     #[inline]
-    pub(super) fn get_index(shred: &[u8]) -> Option<u32> {
+    pub fn get_index(shred: &[u8]) -> Option<u32> {
         <[u8; 4]>::try_from(shred.get(OFFSET_OF_SHRED_INDEX..)?.get(..4)?)
             .map(u32::from_le_bytes)
             .ok()

AshwinSekar

awesome, test results and code looks good.

bw-solana force-pushed the shred_repair branch 3 times, most recently from 61eacd0 to 5f7dc3d Compare January 17, 2024 04:15

bw-solana marked this pull request as ready for review January 17, 2024 04:15

bw-solana requested a review from AshwinSekar January 17, 2024 04:19

AshwinSekar reviewed Jan 17, 2024

View reviewed changes

bw-solana force-pushed the shred_repair branch from 61c54a7 to 5cbaeb0 Compare January 17, 2024 18:11

bw-solana requested a review from AshwinSekar January 17, 2024 18:11

bw-solana added 2 commits January 17, 2024 10:11

shred repair admin rpc

9a261a8

add repair request to outstanding

fca1ad8

bw-solana force-pushed the shred_repair branch from 5cbaeb0 to fca1ad8 Compare January 17, 2024 18:11

AshwinSekar approved these changes Jan 17, 2024

View reviewed changes

bw-solana merged commit 8fe4f5d into solana-labs:master Jan 17, 2024
35 checks passed

bw-solana mentioned this pull request Jan 18, 2024

Sample peers if input peer missing for repair shred #34818

Merged

willhickey mentioned this pull request Mar 28, 2024

v1.18 commits - please ignore anza-xyz/agave#475

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Shred Repair Request #34771

Shred Repair Request #34771

bw-solana commented Jan 12, 2024 •

edited

Loading

AshwinSekar Jan 17, 2024

bw-solana Jan 17, 2024 •

edited

Loading

bw-solana Jan 17, 2024

codecov bot commented Jan 17, 2024 •

edited

Loading

bw-solana commented Jan 17, 2024

AshwinSekar left a comment

Shred Repair Request #34771

Shred Repair Request #34771

Conversation

bw-solana commented Jan 12, 2024 • edited Loading

Problem

Summary of Changes

AshwinSekar Jan 17, 2024

Choose a reason for hiding this comment

bw-solana Jan 17, 2024 • edited Loading

Choose a reason for hiding this comment

bw-solana Jan 17, 2024

Choose a reason for hiding this comment

codecov bot commented Jan 17, 2024 • edited Loading

Codecov Report

bw-solana commented Jan 17, 2024

AshwinSekar left a comment

Choose a reason for hiding this comment

bw-solana commented Jan 12, 2024 •

edited

Loading

bw-solana Jan 17, 2024 •

edited

Loading

codecov bot commented Jan 17, 2024 •

edited

Loading