Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove ThinClient from LocalCluster #1300

Merged
merged 7 commits into from
Jun 13, 2024

Conversation

gregcusack
Copy link

A Followup PR to: #258.
This is the 10th PR on the way to remove ThinClient completely.

Problem

ThinClient is deprecated. Replacing with TpuClient

Summary of Changes

Replace ThinClient with TpuClient in LocalCluster
Specifically, we use a Quic TpuClient, not a UDP TpuClient

Notes

This works but I can't say for sure why it works.
I had initially replaced all of ThinClient within LocalCluster with the nonblocking version of TpuClient. I had made the switch such that transfer_with_client() calls TpuClient.try_send_transaction() which calls send_wire_transaction_to_addr(). send_wire_transaction_to_addr() uses a non-blocking connection from the connection_cache to send the transaction.

However, using the changes above, I consistently ran into an issue with any test that called add_validator() more than once. Under the hood, add_validator() calls transfer_with_client(). transfer_with_client() transfers some amount of lamports to a destination account (aka the account associated with the validator we are trying to add). However in the LocalCluster tests, transfer_with_client() would consistently fail when adding the second validator with add_validator(). It would always fail with: transport custom error: "ConnectError(EndpointStopping).

I noticed that ThinClient used a blocking connection when sending transactions with retry_transfer() (which was used to send a transaction within transfer_with_client(). So, I switched over to using the blocking TpuClient and that seems to work well. I had to make a few changes to expose the leader schedule, connection cache, and fanout slots from the nonblocking/tpu_client up to the blocking tpu_client.

I am still working on investigating. But thinks its probably a good idea to get some eyes on this since it does work as expected now.

@gregcusack gregcusack force-pushed the tpu-client-for-local-cluster branch from 13c4b28 to 3aa9a50 Compare May 10, 2024 20:19
@gregcusack gregcusack force-pushed the tpu-client-for-local-cluster branch from 3aa9a50 to dd7ad73 Compare May 10, 2024 21:29
@codecov-commenter
Copy link

codecov-commenter commented May 10, 2024

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 82.1%. Comparing base (e621336) to head (bb85f32).
Report is 45 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff            @@
##           master    #1300    +/-   ##
========================================
  Coverage    82.1%    82.1%            
========================================
  Files         899      899            
  Lines      237263   237334    +71     
========================================
+ Hits       194810   194920   +110     
+ Misses      42453    42414    -39     

pub fn send_and_confirm_transaction_with_retries<T: Signers + ?Sized>(
&self,
keypairs: &T,
transaction: &mut Transaction,

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we really need a reference or can consume here?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

think it needs a reference because it's retrying, and resigning the tx

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

got it

@@ -95,6 +105,66 @@ where
self.invoke(self.tpu_client.try_send_transaction(transaction))
}

/// Serialize and send transaction to the current and upcoming leader TPUs according to fanout

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would use . and s for 3rd person verbs endings everywhere or nowhere (not sure how we do in another places):

/// Serialize and send transaction to the current and upcoming leader TPUs according to fanout
/// size.
/// Attempt to send and confirm tx `tries` times.
/// Wait for signature confirmation before returning.
/// Return the transaction signature.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good call on the tense! will fix. None of the other TPU client descriptions use a period so I left them out

let mut num_confirmed = 0;
let mut wait_time = MAX_PROCESSING_AGE;
// resend the same transaction until the transaction has no chance of succeeding
let wire_transaction =

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

maybe move serialization outside of the loop?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

as is I can't move it out of the loop because at the end of the loop (if the transaction fails to land), the transaction is modified to contain the latest blockhash to send again. See: https://github.com/gregcusack/solana/blob/dd7ad734635ae427916706438c3c0013165b1afe/tpu-client/src/tpu_client.rs#L157-L159

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ah, you are right

tries: usize,
pending_confirmations: usize,
) -> TransportResult<Signature> {
for x in 0..tries {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would rename x to something more informative

for tpu_address in &leaders {
let cache = self.tpu_client.get_connection_cache();
let conn = cache.get_connection(tpu_address);
conn.send_data_async(wire_transaction.clone())?;

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like clone is not needed

Copy link
Author

@gregcusack gregcusack May 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

i think the clone is needed here because send_data_async() takes ownership over wire_transaction. And we are looping over the leaders and sending wire_transaction to the upcoming leaders. plus it won't compile without the clone.

bincode::serialize(&transaction).expect("transaction serialization failed");

while now.elapsed().as_secs() < wait_time as u64 {
let leaders = self

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't fully understand why you want to use leaders and connection cache directly instead of using TpuClient::send_wire_transaction

Copy link
Author

@gregcusack gregcusack May 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good question. this is part of the issue I don't fully understand. If send_wire_transaction() uses the non blocking TPUClient. Down the stack it calls:

let conn = connection_cache.get_nonblocking_connection(addr);
conn.send_data(&wire_transaction).await

And this gives the error described above in the PR description.
So i changed tpu_client to call: cache.get_connection(tpu_address) which calls connection.new_blocking_connection(). And this blocking connection works. but the nonblocking does not

Copy link

@KirillLykov KirillLykov May 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It sort of worrying. Do you know if this new method is going to be used only for integration tests?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

it is concerning for sure. Although the concern I have is not necessarily with send_and_confirm_transaction_with_retries() but with try_send_transaction(), send_wire_transaction(), and the nonblocking tpu client. Since calling send_wire_transaction() even on its own or with a couple retries fails.

Copy link

@KirillLykov KirillLykov May 17, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this function is supposed to be used for integration tests, I would add some comments explaining the situation and proceed. If this PR is not blocking your work, we can try to look together into it next week.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ya its just used/been tested with integration tests. I just added a note for this function explaining.

);
}
}
log::info!("{x} tries failed transfer");

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

would import log if needed instead

transaction: VersionedTransaction,
) -> TransportResult<Signature> {
let wire_transaction =
bincode::serialize(&transaction).expect("serialize Transaction in send_batch");

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the expect message should use "should" by convention.

M: ConnectionManager<ConnectionPool = P, NewConnectionConfig = C>,
C: NewConnectionConfig,
{
fn async_send_versioned_transaction(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like these 2 functions are not used anywhere in this PR. Could you comment on why they are included?

Copy link
Author

@gregcusack gregcusack May 14, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ya these methods are required/used by client.async_transfer(). See: https://github.com/gregcusack/solana/blob/57cede46cdd58e5f67e38bcef5783c670f12171d/local-cluster/tests/local_cluster.rs#L2925 and https://github.com/gregcusack/solana/blob/57cede46cdd58e5f67e38bcef5783c670f12171d/local-cluster/tests/local_cluster.rs#L2936

async_transfer() is a function in the AsyncClient trait. And AsyncClient is defined for ThinClient but not for TpuClient. So when we switched to using TpuClient we need to implement AsyncClient for TpuClient

Arc::new(RpcClient::new(rpc_url)),
rpc_pubsub_url.as_str(),
TpuClientConfig::default(),
cache.clone(),

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it necessary to clone here?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

ya it is because cache is of type &Arc<ConnectionCache> and new_with_connection_cache() requires ownership over the Arc aka Arc<ConnectionCache>. If I try to match instead on connection_cache instead of &*connection_cache, then I end up matching on the Arc instead of the underlying connection_cache

@gregcusack gregcusack force-pushed the tpu-client-for-local-cluster branch from 57cede4 to bb85f32 Compare May 14, 2024 18:32
@gregcusack gregcusack requested a review from KirillLykov May 14, 2024 20:46
KirillLykov
KirillLykov previously approved these changes May 29, 2024
Copy link

@CriesofCarrots CriesofCarrots left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to see if we can slim the number or complexity of pub fns being added

/// They both invoke the nonblocking TPUClient and both fail when calling "transfer_with_client()" multiple times
/// I do not full understand WHY the nonblocking TPUClient fails in this specific case. But the method defined below
/// does work although it has only been tested in LocalCluster integration tests
pub fn send_and_confirm_transaction_with_retries<T: Signers + ?Sized>(

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is the minimal amount of logic in this function that is necessary to avoid the ConnectError(EndpointStopping) error in local cluster?
Is it getting the ConnectionCache and using send_data_async() directly that resolves the error?

I am wondering if we can put the retry/resign/confirm logic in a local-cluster function, instead of adding such a new, heavy api in TpuClient.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good question/point. I'll look into this. I had designed this to be similar to what ThinClient had with its retries. But I guess there is a reason we are getting rid of ThinClient lol.

Copy link
Author

@gregcusack gregcusack Jun 10, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

update: we do need some retry logic here. using the ConnectionCache and send_data_async() alone does not resolve the error. Going to try and put retry logic in a separate LocalCluster function

EDIT: we actually do NOT need retries here. This has been updated to just send transactions to the upcoming leaders in the schedule. The local cluster tests themselves will poll for the signature confirmation if they want to. TpuClient API is updated to be simple, take in tx, send to next leaders.

@gregcusack gregcusack force-pushed the tpu-client-for-local-cluster branch 3 times, most recently from 3d3f905 to fab2cbe Compare June 12, 2024 16:19
@gregcusack gregcusack force-pushed the tpu-client-for-local-cluster branch from fab2cbe to d7a650d Compare June 12, 2024 16:25
@gregcusack gregcusack merged commit 63fb9fe into anza-xyz:master Jun 13, 2024
40 checks passed
@gregcusack gregcusack deleted the tpu-client-for-local-cluster branch June 13, 2024 17:31
samkim-crypto pushed a commit to samkim-crypto/agave that referenced this pull request Jul 31, 2024
* setup tpu client methods required for localcluster to use TpuClient

* add new_tpu_quic_client() for local cluster tests

* update local-cluster src files to use TpuClient. tests next

* finish removing thinclient from localcluster

* address comments

* add note for send_and_confirm_transaction_with_retries

* remove retry logic from tpu-client. Send directly to upcoming leaders without retry.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants