Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Quic update identity #33865

Merged

Conversation

ryleung-solana
Copy link
Contributor

Problem

See #31557

Summary of Changes

Implements key hotswapping for the quic server and connection cache

Fixes #

@ryleung-solana ryleung-solana changed the title Quic update identity new Quic update identity Oct 26, 2023
Copy link

codecov bot commented Nov 6, 2023

Codecov Report

Merging #33865 (ddd1efb) into master (b97b3dd) will decrease coverage by 0.1%.
Report is 38 commits behind head on master.
The diff coverage is 65.1%.

Additional details and impacted files
@@            Coverage Diff            @@
##           master   #33865     +/-   ##
=========================================
- Coverage    81.9%    81.9%   -0.1%     
=========================================
  Files         819      819             
  Lines      220013   220085     +72     
=========================================
  Hits       180353   180353             
- Misses      39660    39732     +72     

connection-cache/src/connection_cache.rs Outdated Show resolved Hide resolved
Comment on lines 142 to 145
let mut map = self.map.write().unwrap();
map.clear();
let mut connection_manager = self.connection_manager.write().unwrap();
connection_manager.update_key(key);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Doesn't this cause a dead-lock somewhere?
i.e. don't we have anywhere else in the code that locks self.connection_manager and then later locks self.map?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

As far as I can see, the order is always self.map -> self.connection_manager so this should not cause a deadlock.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we change the order of locking and release the lock so we don't hold the lock on both?
something like:

connection_manager.write().unwrap().update_key(key);
self.map.write().unwrap().clear();

This might discard some ok connections created between the 1st and the 2nd lock but I think that is better than the possibility of a dead-lock in the future.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just added a comment stating the required locking order. I think that is the better solution, since it ensures stricter correctness, and it's quite common to simply establish a locking order to prevent deadlocks, rather than not lock multiple locks at all.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no way to enforce the comment.
Someone might change/add code later unaware of that comment there.
What is the problem with my suggestion above? #33865 (comment)

Copy link
Contributor Author

@ryleung-solana ryleung-solana Nov 15, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This might discard some ok connections created between the 1st and the 2nd lock

There's also no way to enforce that only 1 of the locks can be held at a time (and indeed, in create_connection_internal, we need to hold both). It's not uncommon to just specify a locking order that future programmers must follow when dealing with multiple locks. Not a massive deal, but since we have 2 locks anyway, I would prefer to do the most strictly correct thing and clear the cache then update the key while holding the cache lock.

quic-client/src/lib.rs Outdated Show resolved Hide resolved
sdk/src/quic.rs Outdated Show resolved Hide resolved
Comment on lines 145 to 153
ipaddr: Option<IpAddr>,
) -> Result<(), RcgenError> {
let (cert, priv_key) = new_self_signed_tls_certificate(keypair, ipaddr)?;
let (cert, priv_key) = if let Some(ip) = ipaddr {
let res = new_self_signed_tls_certificate(keypair, ip)?;
self.ip = ip;
res
} else {
new_self_signed_tls_certificate(keypair, self.ip)?
};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

EndpointKeyUpdater in streamer is storing the gossip_host ip address.
Can't we do a similar thing for connection-cache so that this method doesn't have to do the Option<IpAddr> hack?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I found it more convenient to just store the IP in QuicConfig since that's where we the socket anyway, and bubbling it up just to store it somewhere else seems like a little too much added complexity.

streamer/src/quic.rs Outdated Show resolved Hide resolved
@@ -47,7 +48,7 @@ pub struct ConnectionCache<
> {
name: &'static str,
map: Arc<RwLock<IndexMap<SocketAddr, /*ConnectionPool:*/ R>>>,
connection_manager: Arc<S>,
connection_manager: Arc<RwLock<S>>,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The RwLock here could cause deadlocks or lock-contention.
Is there an alternative way avoiding this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Agreed that this is not ideal. Since the map is always locked when connection_manager is locked (but not vice versa), it should be possible to just use 1 lock to protect both. However, while I have tried to remove this lock, I have not been able to make Rust happy (non-buiilding code at https://github.com/ryleung-solana/solana/tree/quic-update-identity-test if you have suggestions).

res
} else {
new_self_signed_tls_certificate(keypair, self.ip)?
};
self.client_certificate = Arc::new(QuicClientCertificate {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

when does this call into set_default_client_config?
https://docs.rs/quinn/0.10.2/quinn/struct.Endpoint.html#method.set_default_client_config

this doesn't take effect until set_default_client_config.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See QuicLazyInitializedEndpoint::create_endpoint

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@ryleung-solana you marked this resolved, but I do not know what have you changed to resolve this.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In QuicConfig::create_endpoint we create a new QuicLazyInitializedEndpoint struct. This, combined with clearing the connection cache means that the first time we get a new connection from the connection cache, we will call set_default_client_config on the Endpoint

connection-cache/src/connection_cache.rs Outdated Show resolved Hide resolved
connection-cache/src/connection_cache.rs Outdated Show resolved Hide resolved
Comment on lines 142 to 145
let mut map = self.map.write().unwrap();
map.clear();
let mut connection_manager = self.connection_manager.write().unwrap();
connection_manager.update_key(key);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is no way to enforce the comment.
Someone might change/add code later unaware of that comment there.
What is the problem with my suggestion above? #33865 (comment)

local-cluster/src/local_cluster.rs Outdated Show resolved Hide resolved
local-cluster/src/local_cluster.rs Outdated Show resolved Hide resolved
local-cluster/src/local_cluster.rs Outdated Show resolved Hide resolved
quic-client/src/lib.rs Outdated Show resolved Hide resolved
quic-client/src/lib.rs Outdated Show resolved Hide resolved
quic-client/src/lib.rs Outdated Show resolved Hide resolved
validator/src/admin_rpc_service.rs Outdated Show resolved Hide resolved
validator/src/admin_rpc_service.rs Outdated Show resolved Hide resolved
quic-client/src/lib.rs Outdated Show resolved Hide resolved
Copy link
Contributor

@behzadnouri behzadnouri left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm, but can you also please ask @lijunwangs to take a look and confirm there is no perf concern?
maybe also run the bench-tps with this change just to be safe.

core/src/validator.rs Outdated Show resolved Hide resolved
@ryleung-solana
Copy link
Contributor Author

lgtm, but can you also please ask @lijunwangs to take a look and confirm there is no perf concern? maybe also run the bench-tps with this change just to be safe.

Ran bench-tps, perf characteristics look pretty much the same. First is with the change, second without.
Screenshot from 2023-12-02 03-36-52

Screenshot from 2023-12-02 05-00-08

validator/src/main.rs Outdated Show resolved Hide resolved
Ok(())
}

pub fn update_keypair(&self, keypair: &Keypair) -> Result<(), RcgenError> {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This update works only when a new connection pool is created. How do we have handle existing connection cached?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

See connection-cache/src/connection_cache.rs line 141

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I understand that, but I think it won't fix the existing connection's endpoints. We would similar things you did to the server side: like

    let (config, _) = configure_server(key, self.gossip_host)?;
    self.endpoint.set_server_config(Some(config));
    Ok(())

@ryleung-solana
Copy link
Contributor Author

ryleung-solana commented Dec 7, 2023 via email

@lijunwangs
Copy link
Contributor

Connection refs are only held by callers to get_connection as long as they are used to do a send task, afterwards they are dropped. The endpoint is configured the first time a connection pool is used. Therefore by clearing the map, the first time get_connection is called afterwards, a new connection pool is created, and the first time a connection is obtained from the connection pool the endpoint will be given the new configuration.

Right, I missed the part you cleared the cache.

Copy link
Contributor

@lijunwangs lijunwangs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me

@ryleung-solana ryleung-solana merged commit 132c910 into solana-labs:master Dec 8, 2023
44 checks passed
lijunwangs pushed a commit to lijunwangs/solana that referenced this pull request Apr 2, 2024
Update the Quic transport layer keypair and identity when the Validator's identity keypair is updated
lijunwangs pushed a commit to lijunwangs/solana that referenced this pull request Apr 2, 2024
Update the Quic transport layer keypair and identity when the Validator's identity keypair is updated
lijunwangs pushed a commit to lijunwangs/solana that referenced this pull request Apr 2, 2024
Update the Quic transport layer keypair and identity when the Validator's identity keypair is updated
lijunwangs pushed a commit to lijunwangs/solana that referenced this pull request Apr 18, 2024
Update the Quic transport layer keypair and identity when the Validator's identity keypair is updated

Quic update identity (solana-labs#33865)

Update the Quic transport layer keypair and identity when the Validator's identity keypair is updated
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants