-
Notifications
You must be signed in to change notification settings - Fork 239
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove ThinClient
from LocalCluster
#1300
Remove ThinClient
from LocalCluster
#1300
Conversation
13c4b28
to
3aa9a50
Compare
3aa9a50
to
dd7ad73
Compare
Codecov ReportAll modified and coverable lines are covered by tests ✅
Additional details and impacted files@@ Coverage Diff @@
## master #1300 +/- ##
========================================
Coverage 82.1% 82.1%
========================================
Files 899 899
Lines 237263 237334 +71
========================================
+ Hits 194810 194920 +110
+ Misses 42453 42414 -39 |
tpu-client/src/tpu_client.rs
Outdated
pub fn send_and_confirm_transaction_with_retries<T: Signers + ?Sized>( | ||
&self, | ||
keypairs: &T, | ||
transaction: &mut Transaction, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we really need a reference or can consume here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
think it needs a reference because it's retrying, and resigning the tx
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
got it
@@ -95,6 +105,66 @@ where | |||
self.invoke(self.tpu_client.try_send_transaction(transaction)) | |||
} | |||
|
|||
/// Serialize and send transaction to the current and upcoming leader TPUs according to fanout |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would use .
and s
for 3rd person verbs endings everywhere or nowhere (not sure how we do in another places):
/// Serialize and send transaction to the current and upcoming leader TPUs according to fanout
/// size.
/// Attempt to send and confirm tx `tries` times.
/// Wait for signature confirmation before returning.
/// Return the transaction signature.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good call on the tense! will fix. None of the other TPU client descriptions use a period so I left them out
tpu-client/src/tpu_client.rs
Outdated
let mut num_confirmed = 0; | ||
let mut wait_time = MAX_PROCESSING_AGE; | ||
// resend the same transaction until the transaction has no chance of succeeding | ||
let wire_transaction = |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
maybe move serialization outside of the loop?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
as is I can't move it out of the loop because at the end of the loop (if the transaction fails to land), the transaction is modified to contain the latest blockhash to send again. See: https://github.com/gregcusack/solana/blob/dd7ad734635ae427916706438c3c0013165b1afe/tpu-client/src/tpu_client.rs#L157-L159
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ah, you are right
tpu-client/src/tpu_client.rs
Outdated
tries: usize, | ||
pending_confirmations: usize, | ||
) -> TransportResult<Signature> { | ||
for x in 0..tries { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would rename x
to something more informative
tpu-client/src/tpu_client.rs
Outdated
for tpu_address in &leaders { | ||
let cache = self.tpu_client.get_connection_cache(); | ||
let conn = cache.get_connection(tpu_address); | ||
conn.send_data_async(wire_transaction.clone())?; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like clone
is not needed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think the clone
is needed here because send_data_async()
takes ownership over wire_transaction
. And we are looping over the leaders and sending wire_transaction
to the upcoming leaders. plus it won't compile without the clone.
tpu-client/src/tpu_client.rs
Outdated
bincode::serialize(&transaction).expect("transaction serialization failed"); | ||
|
||
while now.elapsed().as_secs() < wait_time as u64 { | ||
let leaders = self |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't fully understand why you want to use leaders and connection cache directly instead of using TpuClient::send_wire_transaction
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good question. this is part of the issue I don't fully understand. If send_wire_transaction()
uses the non blocking TPUClient. Down the stack it calls:
let conn = connection_cache.get_nonblocking_connection(addr);
conn.send_data(&wire_transaction).await
And this gives the error described above in the PR description.
So i changed tpu_client to call: cache.get_connection(tpu_address)
which calls connection.new_blocking_connection()
. And this blocking connection works. but the nonblocking does not
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It sort of worrying. Do you know if this new method is going to be used only for integration tests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it is concerning for sure. Although the concern I have is not necessarily with send_and_confirm_transaction_with_retries()
but with try_send_transaction()
, send_wire_transaction()
, and the nonblocking tpu client. Since calling send_wire_transaction()
even on its own or with a couple retries fails.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this function is supposed to be used for integration tests, I would add some comments explaining the situation and proceed. If this PR is not blocking your work, we can try to look together into it next week.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ya its just used/been tested with integration tests. I just added a note for this function explaining.
tpu-client/src/tpu_client.rs
Outdated
); | ||
} | ||
} | ||
log::info!("{x} tries failed transfer"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
would import log
if needed instead
transaction: VersionedTransaction, | ||
) -> TransportResult<Signature> { | ||
let wire_transaction = | ||
bincode::serialize(&transaction).expect("serialize Transaction in send_batch"); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the expect message should use "should" by convention.
M: ConnectionManager<ConnectionPool = P, NewConnectionConfig = C>, | ||
C: NewConnectionConfig, | ||
{ | ||
fn async_send_versioned_transaction( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like these 2 functions are not used anywhere in this PR. Could you comment on why they are included?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ya these methods are required/used by client.async_transfer()
. See: https://github.com/gregcusack/solana/blob/57cede46cdd58e5f67e38bcef5783c670f12171d/local-cluster/tests/local_cluster.rs#L2925 and https://github.com/gregcusack/solana/blob/57cede46cdd58e5f67e38bcef5783c670f12171d/local-cluster/tests/local_cluster.rs#L2936
async_transfer()
is a function in the AsyncClient
trait. And AsyncClient
is defined for ThinClient
but not for TpuClient
. So when we switched to using TpuClient
we need to implement AsyncClient
for TpuClient
Arc::new(RpcClient::new(rpc_url)), | ||
rpc_pubsub_url.as_str(), | ||
TpuClientConfig::default(), | ||
cache.clone(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
is it necessary to clone here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ya it is because cache
is of type &Arc<ConnectionCache>
and new_with_connection_cache()
requires ownership over the Arc
aka Arc<ConnectionCache>
. If I try to match instead on connection_cache
instead of &*connection_cache
, then I end up matching on the Arc
instead of the underlying connection_cache
57cede4
to
bb85f32
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd like to see if we can slim the number or complexity of pub fns being added
tpu-client/src/tpu_client.rs
Outdated
/// They both invoke the nonblocking TPUClient and both fail when calling "transfer_with_client()" multiple times | ||
/// I do not full understand WHY the nonblocking TPUClient fails in this specific case. But the method defined below | ||
/// does work although it has only been tested in LocalCluster integration tests | ||
pub fn send_and_confirm_transaction_with_retries<T: Signers + ?Sized>( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is the minimal amount of logic in this function that is necessary to avoid the ConnectError(EndpointStopping)
error in local cluster?
Is it getting the ConnectionCache and using send_data_async()
directly that resolves the error?
I am wondering if we can put the retry/resign/confirm logic in a local-cluster function, instead of adding such a new, heavy api in TpuClient
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good question/point. I'll look into this. I had designed this to be similar to what ThinClient
had with its retries. But I guess there is a reason we are getting rid of ThinClient
lol.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
update: we do need some retry logic here. using the ConnectionCache
and send_data_async()
alone does not resolve the error. Going to try and put retry logic in a separate LocalCluster function
EDIT: we actually do NOT need retries here. This has been updated to just send transactions to the upcoming leaders in the schedule. The local cluster tests themselves will poll for the signature confirmation if they want to. TpuClient API is updated to be simple, take in tx, send to next leaders.
3d3f905
to
fab2cbe
Compare
fab2cbe
to
d7a650d
Compare
* setup tpu client methods required for localcluster to use TpuClient * add new_tpu_quic_client() for local cluster tests * update local-cluster src files to use TpuClient. tests next * finish removing thinclient from localcluster * address comments * add note for send_and_confirm_transaction_with_retries * remove retry logic from tpu-client. Send directly to upcoming leaders without retry.
A Followup PR to: #258.
This is the 10th PR on the way to remove ThinClient completely.
Problem
ThinClient
is deprecated. Replacing withTpuClient
Summary of Changes
Replace
ThinClient
withTpuClient
inLocalCluster
Specifically, we use a Quic TpuClient, not a UDP TpuClient
Notes
This works but I can't say for sure why it works.
I had initially replaced all of
ThinClient
withinLocalCluster
with the nonblocking version ofTpuClient
. I had made the switch such thattransfer_with_client()
callsTpuClient.try_send_transaction()
which callssend_wire_transaction_to_addr()
.send_wire_transaction_to_addr()
uses a non-blocking connection from the connection_cache to send the transaction.However, using the changes above, I consistently ran into an issue with any test that called
add_validator()
more than once. Under the hood,add_validator()
callstransfer_with_client()
.transfer_with_client()
transfers some amount of lamports to a destination account (aka the account associated with the validator we are trying to add). However in theLocalCluster
tests,transfer_with_client()
would consistently fail when adding the second validator withadd_validator()
. It would always fail with:transport custom error: "ConnectError(EndpointStopping)
.I noticed that
ThinClient
used a blocking connection when sending transactions withretry_transfer()
(which was used to send a transaction withintransfer_with_client()
. So, I switched over to using the blockingTpuClient
and that seems to work well. I had to make a few changes to expose the leader schedule, connection cache, and fanout slots from thenonblocking/tpu_client
up to the blockingtpu_client
.I am still working on investigating. But thinks its probably a good idea to get some eyes on this since it does work as expected now.