Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

prunes turbine QUIC connections #33663

Merged

Conversation

behzadnouri
Copy link
Contributor

Problem

The number of outstanding connections can grow boundlessly.

Summary of Changes

The commit implements lazy eviction for turbine QUIC connections.
The cache is allowed to grow to 2 x capacity at which point at least
half of the entries with lowest stake are evicted, resulting in an
amortized O(1) performance.

@codecov
Copy link

codecov bot commented Oct 11, 2023

Codecov Report

Merging #33663 (ffff0e9) into master (ce8ad77) will decrease coverage by 0.1%.
The diff coverage is 59.5%.

@@            Coverage Diff            @@
##           master   #33663     +/-   ##
=========================================
- Coverage    81.8%    81.8%   -0.1%     
=========================================
  Files         807      807             
  Lines      218024   218117     +93     
=========================================
+ Hits       178438   178495     +57     
- Misses      39586    39622     +36     

The commit implements lazy eviction for turbine QUIC connections.
The cache is allowed to grow to 2 x capacity at which point at least
half of the entries with lowest stake are evicted, resulting in an
amortized O(1) performance.
Copy link
Contributor

@joncinque joncinque left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks great! Just a couple of questions

@@ -32,6 +37,7 @@ use {

const CLIENT_CHANNEL_BUFFER: usize = 1 << 14;
const ROUTER_CHANNEL_BUFFER: usize = 64;
const CONNECTION_CACHE_CAPACITY: usize = 3072;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this number chosen because it's roughly the size of mainnet currently? It seems reasonable, but I wanted to be sure

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yeah, basically.
I think it would be too slow to have to reconnect before sending the shred, so we effectively need to cache all the connections.

);
prune_cache_pending.store(false, Ordering::Relaxed);
}
router.write().await.retain(|_, sender| !sender.is_closed());
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been trying to see when the receiver would be closed in order to prune these senders properly, and I want to make sure that this will work as intended.

The receiver side is plumbed down into send_datagram_task, which loops on receiver.recv(), so it would only error when it receives data and then tries to call connection.send_datagram, at which point it gets dropped. This would be bad because it won't get closed immediately.

On the other side, read_datagram_task will fail immediately on the next read after close, which will make both tasks terminate immediately since they're joined with try_join. This means that the receiver is dropped correctly, so that you can prune the senders on this side.

Do I have that right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is a good question and might have spotted a bug in this code.
I think I need a patch like below to stop waiting on receiver if the connection is already closed.
I will do this in a separate pr because it is orthogonal to the cache pruning logic here.

https://github.com/solana-labs/solana/blob/c98c24bd6/turbine/src/quic_endpoint.rs#L343

diff --git a/turbine/src/quic_endpoint.rs b/turbine/src/quic_endpoint.rs
index 0f362fd1a3..a911aef5ad 100644
--- a/turbine/src/quic_endpoint.rs
+++ b/turbine/src/quic_endpoint.rs
@@ -389,10 +389,19 @@ async fn send_datagram_task(
     connection: Connection,
     mut receiver: AsyncReceiver<Bytes>,
 ) -> Result<(), Error> {
-    while let Some(bytes) = receiver.recv().await {
-        connection.send_datagram(bytes)?;
+    loop {
+        tokio::select! {
+            biased;
+
+            bytes = receiver.recv() => {
+                match bytes {
+                    None => return Ok(()),
+                    Some(bytes) => connection.send_datagram(bytes)?,
+                }
+            }
+            err = connection.closed() => return Err(Error::from(err)),
+        }
     }
-    Ok(())
 }
 
 async fn make_connection_task(

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I need a patch like below to stop waiting on receiver if the connection is already closed.

Yeah that makes sense to me, it'll make the code clearer about the error condition. The implicit cancellation of this task through the error in the other task with try_join was a little tricky to understand.

Feel free to do it in a follow-up PR!

@behzadnouri behzadnouri merged commit e0b59a6 into solana-labs:master Oct 20, 2023
31 checks passed
@behzadnouri behzadnouri deleted the turbine-quic-cache-eviction branch October 20, 2023 21:52
@behzadnouri behzadnouri added the v1.17 PRs that should be backported to v1.17 label Oct 20, 2023
mergify bot pushed a commit that referenced this pull request Oct 20, 2023
The commit implements lazy eviction for turbine QUIC connections.
The cache is allowed to grow to 2 x capacity at which point at least
half of the entries with lowest stake are evicted, resulting in an
amortized O(1) performance.

(cherry picked from commit e0b59a6)
behzadnouri added a commit that referenced this pull request Oct 23, 2023
The commit implements lazy eviction for turbine QUIC connections.
The cache is allowed to grow to 2 x capacity at which point at least
half of the entries with lowest stake are evicted, resulting in an
amortized O(1) performance.

(cherry picked from commit e0b59a6)
mergify bot added a commit that referenced this pull request Oct 23, 2023
prunes turbine QUIC connections (#33663)

The commit implements lazy eviction for turbine QUIC connections.
The cache is allowed to grow to 2 x capacity at which point at least
half of the entries with lowest stake are evicted, resulting in an
amortized O(1) performance.

(cherry picked from commit e0b59a6)

Co-authored-by: behzad nouri <behzadnouri@gmail.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
v1.17 PRs that should be backported to v1.17
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants