-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Quic update identity #33865
Quic update identity #33865
Conversation
4f50ac5
to
a77d78d
Compare
Codecov Report
Additional details and impacted files@@ Coverage Diff @@
## master #33865 +/- ##
=========================================
- Coverage 81.9% 81.9% -0.1%
=========================================
Files 819 819
Lines 220013 220085 +72
=========================================
Hits 180353 180353
- Misses 39660 39732 +72 |
let mut map = self.map.write().unwrap(); | ||
map.clear(); | ||
let mut connection_manager = self.connection_manager.write().unwrap(); | ||
connection_manager.update_key(key); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Doesn't this cause a dead-lock somewhere?
i.e. don't we have anywhere else in the code that locks self.connection_manager
and then later locks self.map
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I can see, the order is always self.map -> self.connection_manager so this should not cause a deadlock.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we change the order of locking and release the lock so we don't hold the lock on both?
something like:
connection_manager.write().unwrap().update_key(key);
self.map.write().unwrap().clear();
This might discard some ok connections created between the 1st and the 2nd lock but I think that is better than the possibility of a dead-lock in the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just added a comment stating the required locking order. I think that is the better solution, since it ensures stricter correctness, and it's quite common to simply establish a locking order to prevent deadlocks, rather than not lock multiple locks at all.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no way to enforce the comment.
Someone might change/add code later unaware of that comment there.
What is the problem with my suggestion above? #33865 (comment)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This might discard some ok connections created between the 1st and the 2nd lock
There's also no way to enforce that only 1 of the locks can be held at a time (and indeed, in create_connection_internal, we need to hold both). It's not uncommon to just specify a locking order that future programmers must follow when dealing with multiple locks. Not a massive deal, but since we have 2 locks anyway, I would prefer to do the most strictly correct thing and clear the cache then update the key while holding the cache lock.
quic-client/src/lib.rs
Outdated
ipaddr: Option<IpAddr>, | ||
) -> Result<(), RcgenError> { | ||
let (cert, priv_key) = new_self_signed_tls_certificate(keypair, ipaddr)?; | ||
let (cert, priv_key) = if let Some(ip) = ipaddr { | ||
let res = new_self_signed_tls_certificate(keypair, ip)?; | ||
self.ip = ip; | ||
res | ||
} else { | ||
new_self_signed_tls_certificate(keypair, self.ip)? | ||
}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
EndpointKeyUpdater
in streamer is storing the gossip_host
ip address.
Can't we do a similar thing for connection-cache so that this method doesn't have to do the Option<IpAddr>
hack?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I found it more convenient to just store the IP in QuicConfig since that's where we the socket anyway, and bubbling it up just to store it somewhere else seems like a little too much added complexity.
@@ -47,7 +48,7 @@ pub struct ConnectionCache< | |||
> { | |||
name: &'static str, | |||
map: Arc<RwLock<IndexMap<SocketAddr, /*ConnectionPool:*/ R>>>, | |||
connection_manager: Arc<S>, | |||
connection_manager: Arc<RwLock<S>>, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The RwLock
here could cause deadlocks or lock-contention.
Is there an alternative way avoiding this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agreed that this is not ideal. Since the map is always locked when connection_manager is locked (but not vice versa), it should be possible to just use 1 lock to protect both. However, while I have tried to remove this lock, I have not been able to make Rust happy (non-buiilding code at https://github.com/ryleung-solana/solana/tree/quic-update-identity-test if you have suggestions).
quic-client/src/lib.rs
Outdated
res | ||
} else { | ||
new_self_signed_tls_certificate(keypair, self.ip)? | ||
}; | ||
self.client_certificate = Arc::new(QuicClientCertificate { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
when does this call into set_default_client_config
?
https://docs.rs/quinn/0.10.2/quinn/struct.Endpoint.html#method.set_default_client_config
this doesn't take effect until set_default_client_config
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See QuicLazyInitializedEndpoint::create_endpoint
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Right, but that is only called once:
https://github.com/solana-labs/solana/blob/a96be5d2f/quic-client/src/nonblocking/quic_client.rs#L142-L143
and the cached in:
endpoint: OnceCell<Arc<Endpoint>>,
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ryleung-solana you marked this resolved, but I do not know what have you changed to resolve this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In QuicConfig::create_endpoint we create a new QuicLazyInitializedEndpoint struct. This, combined with clearing the connection cache means that the first time we get a new connection from the connection cache, we will call set_default_client_config on the Endpoint
0e39688
to
703cab5
Compare
let mut map = self.map.write().unwrap(); | ||
map.clear(); | ||
let mut connection_manager = self.connection_manager.write().unwrap(); | ||
connection_manager.update_key(key); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is no way to enforce the comment.
Someone might change/add code later unaware of that comment there.
What is the problem with my suggestion above? #33865 (comment)
d31b3fa
to
7d5b1cf
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
lgtm, but can you also please ask @lijunwangs to take a look and confirm there is no perf concern?
maybe also run the bench-tps with this change just to be safe.
This reverts commit 7ebb8da.
af978b4
to
bf8a2ca
Compare
Ran bench-tps, perf characteristics look pretty much the same. First is with the change, second without. |
Ok(()) | ||
} | ||
|
||
pub fn update_keypair(&self, keypair: &Keypair) -> Result<(), RcgenError> { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This update works only when a new connection pool is created. How do we have handle existing connection cached?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
See connection-cache/src/connection_cache.rs line 141
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I understand that, but I think it won't fix the existing connection's endpoints. We would similar things you did to the server side: like
let (config, _) = configure_server(key, self.gossip_host)?;
self.endpoint.set_server_config(Some(config));
Ok(())
Connection refs are only held by callers to get_connection as long as they
are used to do a send task, afterwards they are dropped. The endpoint is
configured the first time a connection pool is used. Therefore by clearing
the map, the first time get_connection is called afterwards, a new
connection pool is created, and the first time a connection is obtained
from the connection pool the endpoint will be given the new configuration.
…On Fri, Dec 8, 2023, 5:14 AM Lijun Wang ***@***.***> wrote:
***@***.**** commented on this pull request.
------------------------------
In quic-client/src/lib.rs
<#33865 (comment)>:
> @@ -143,7 +156,23 @@ impl QuicConfig {
ipaddr: IpAddr,
) -> Result<(), RcgenError> {
let (cert, priv_key) = new_self_signed_tls_certificate(keypair, ipaddr)?;
- self.client_certificate = Arc::new(QuicClientCertificate {
+ self.addr = ipaddr;
+
+ let mut cert_guard = self.client_certificate.write().unwrap();
+
+ *cert_guard = Arc::new(QuicClientCertificate {
+ certificate: cert,
+ key: priv_key,
+ });
+ Ok(())
+ }
+
+ pub fn update_keypair(&self, keypair: &Keypair) -> Result<(), RcgenError> {
I understand that, but I think it won't fix the existing connection's
endpoints. We would similar things you did to the server side: like
let (config, _) = configure_server(key, self.gossip_host)?;
self.endpoint.set_server_config(Some(config));
Ok(())
—
Reply to this email directly, view it on GitHub
<#33865 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/AV5GU66RUQEJLZTPE57GQO3YIIWS3AVCNFSM6AAAAAA6PTGDTOVHI2DSMVQWIX3LMV43YUDVNRWFEZLROVSXG5CSMV3GSZLXHMYTONZRGE2DANBRGU>
.
You are receiving this because you were mentioned.Message ID:
***@***.***>
|
Right, I missed the part you cleared the cache. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me
Update the Quic transport layer keypair and identity when the Validator's identity keypair is updated
Update the Quic transport layer keypair and identity when the Validator's identity keypair is updated
Update the Quic transport layer keypair and identity when the Validator's identity keypair is updated
Update the Quic transport layer keypair and identity when the Validator's identity keypair is updated Quic update identity (solana-labs#33865) Update the Quic transport layer keypair and identity when the Validator's identity keypair is updated
Problem
See #31557
Summary of Changes
Implements key hotswapping for the quic server and connection cache
Fixes #