switch to ContactInfo propagation in PullRequests #2894
Closed
+3
−13
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Problem
tvu_peers
set of unstaked nodes is unreliable and keeps changing, causing nodes to appear offline or out of gossipCause
In v2.0.8,
self.nodes
incrds
only storesContactInfo
, it does not storeLegacyContactInfo
, but v1.18.23 storesLegacyContactInfo
incrds.nodes
.master:
agave/gossip/src/crds.rs
Lines 254 to 255 in 37671df
v1.18.23:
agave/gossip/src/crds.rs
Lines 246 to 247 in 8c42fa8
self.nodes aka crds.nodes:
agave/gossip/src/crds.rs
Line 74 in 37671df
In both versions, we send our
LegacyContactInfo
in everyPullRequest
sent. However, in v2.0.8, thisLegacyContactInfo
, no longer gets stored incrds.nodes
even though it is sent. So when unstaked nodes are sending pull requests, theirLegacyContactInfo
is not getting propagated.It seems that unstaked nodes are relying heavily on pull requests to propagated their
ContactInfo
. But with the update to v2.0.8, receiving nodes are no longer registeringLegacyContactInfo
in their set oftvu_peer
s becausecrds.nodes
doesn't hold them anymore.Summary of Changes
Since mb has upgraded to v1.18.23 and can handle
ContactInfo
in aPullRequest
, we can switch over to propagatingContactInfo
inPullRequest
. We can also remove the[ignore]
on the previously flaky test caused by this same issue.Data
@steviez applied this patch to
tw2NAkBpTwST2c4NDW9xFmWgqSL5ErPy5QYbfYa3KXi
, which was previously experiencing repair issues as mentioned above and got the following result:From Steve:
^ Shred percentages through turbine/recovery/repair over the last hour:
Vertical white line is where process came back online with patch
Orange = turbine
Blue = recovery
Purple = repair
Almost entirely repair previously and now almost no repair / all turbine + recovery (recovery technically under turbine too)
This patch also causes
tw2NAkBpTwST2c4NDW9xFmWgqSL5ErPy5QYbfYa3KXi
to consistently and immediately appear intvu_peers
.PR made with @steviez