-
Notifications
You must be signed in to change notification settings - Fork 4.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
prunes turbine QUIC connections #33663
prunes turbine QUIC connections #33663
Conversation
315c5b0
to
a8e91ec
Compare
Codecov Report
@@ Coverage Diff @@
## master #33663 +/- ##
=========================================
- Coverage 81.8% 81.8% -0.1%
=========================================
Files 807 807
Lines 218024 218117 +93
=========================================
+ Hits 178438 178495 +57
- Misses 39586 39622 +36 |
a8e91ec
to
8f2d168
Compare
The commit implements lazy eviction for turbine QUIC connections. The cache is allowed to grow to 2 x capacity at which point at least half of the entries with lowest stake are evicted, resulting in an amortized O(1) performance.
8f2d168
to
ffff0e9
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks great! Just a couple of questions
@@ -32,6 +37,7 @@ use { | |||
|
|||
const CLIENT_CHANNEL_BUFFER: usize = 1 << 14; | |||
const ROUTER_CHANNEL_BUFFER: usize = 64; | |||
const CONNECTION_CACHE_CAPACITY: usize = 3072; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this number chosen because it's roughly the size of mainnet currently? It seems reasonable, but I wanted to be sure
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yeah, basically.
I think it would be too slow to have to reconnect before sending the shred, so we effectively need to cache all the connections.
); | ||
prune_cache_pending.store(false, Ordering::Relaxed); | ||
} | ||
router.write().await.retain(|_, sender| !sender.is_closed()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I've been trying to see when the receiver would be closed in order to prune these senders properly, and I want to make sure that this will work as intended.
The receiver side is plumbed down into send_datagram_task
, which loops on receiver.recv()
, so it would only error when it receives data and then tries to call connection.send_datagram
, at which point it gets dropped. This would be bad because it won't get closed immediately.
On the other side, read_datagram_task
will fail immediately on the next read after close, which will make both tasks terminate immediately since they're joined with try_join
. This means that the receiver is dropped correctly, so that you can prune the senders on this side.
Do I have that right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That is a good question and might have spotted a bug in this code.
I think I need a patch like below to stop waiting on receiver if the connection is already closed.
I will do this in a separate pr because it is orthogonal to the cache pruning logic here.
https://github.com/solana-labs/solana/blob/c98c24bd6/turbine/src/quic_endpoint.rs#L343
diff --git a/turbine/src/quic_endpoint.rs b/turbine/src/quic_endpoint.rs
index 0f362fd1a3..a911aef5ad 100644
--- a/turbine/src/quic_endpoint.rs
+++ b/turbine/src/quic_endpoint.rs
@@ -389,10 +389,19 @@ async fn send_datagram_task(
connection: Connection,
mut receiver: AsyncReceiver<Bytes>,
) -> Result<(), Error> {
- while let Some(bytes) = receiver.recv().await {
- connection.send_datagram(bytes)?;
+ loop {
+ tokio::select! {
+ biased;
+
+ bytes = receiver.recv() => {
+ match bytes {
+ None => return Ok(()),
+ Some(bytes) => connection.send_datagram(bytes)?,
+ }
+ }
+ err = connection.closed() => return Err(Error::from(err)),
+ }
}
- Ok(())
}
async fn make_connection_task(
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I need a patch like below to stop waiting on receiver if the connection is already closed.
Yeah that makes sense to me, it'll make the code clearer about the error condition. The implicit cancellation of this task through the error in the other task with try_join
was a little tricky to understand.
Feel free to do it in a follow-up PR!
The commit implements lazy eviction for turbine QUIC connections. The cache is allowed to grow to 2 x capacity at which point at least half of the entries with lowest stake are evicted, resulting in an amortized O(1) performance. (cherry picked from commit e0b59a6)
The commit implements lazy eviction for turbine QUIC connections. The cache is allowed to grow to 2 x capacity at which point at least half of the entries with lowest stake are evicted, resulting in an amortized O(1) performance. (cherry picked from commit e0b59a6)
prunes turbine QUIC connections (#33663) The commit implements lazy eviction for turbine QUIC connections. The cache is allowed to grow to 2 x capacity at which point at least half of the entries with lowest stake are evicted, resulting in an amortized O(1) performance. (cherry picked from commit e0b59a6) Co-authored-by: behzad nouri <behzadnouri@gmail.com>
Problem
The number of outstanding connections can grow boundlessly.
Summary of Changes
The commit implements lazy eviction for turbine QUIC connections.
The cache is allowed to grow to 2 x capacity at which point at least
half of the entries with lowest stake are evicted, resulting in an
amortized O(1) performance.