-
Notifications
You must be signed in to change notification settings - Fork 4.4k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adjust receive window to make them linear to the count of streams #33913
Adjust receive window to make them linear to the count of streams #33913
Conversation
Codecov Report
@@ Coverage Diff @@
## master #33913 +/- ##
=======================================
Coverage 81.9% 81.9%
=======================================
Files 809 809
Lines 218062 218062
=======================================
+ Hits 178616 178660 +44
+ Misses 39446 39402 -44 |
In the bench-tps tests with thin client using the following command lijun@lijun-dev:~/solana$ ./cargo run --release --bin solana-bench-tps -- -u http://35.233.177.221:8899 --identity /home/lijun/.config/solana/id.json --tx_count 1000 --thread-batch-sleep-ms 0 -t 20 --duration 300 -n 35.233.177.221:8001 --read-client-keys /home/lijun/gce-keypairs.yaml The cluster has 5 nodes, 4 nodes running in us-west and one node in eu-west. The results for with and without the changes when the target selected is local to the client are similar. With the changes: [2023-10-30T18:37:32.379799783Z INFO solana_bench_tps::bench] Average TPS: 13596.006 Without: There are material differences when the target selected is the only one of the remote node in the cluster (eu-west). With the change: Without the fix: There are 10 times of difference. |
@@ -26,12 +26,12 @@ pub const QUIC_CONNECTION_HANDSHAKE_TIMEOUT: Duration = Duration::from_secs(60); | |||
|
|||
/// The receive window for QUIC connection from unstaked nodes is | |||
/// set to this ratio times [`solana_sdk::packet::PACKET_DATA_SIZE`] | |||
pub const QUIC_UNSTAKED_RECEIVE_WINDOW_RATIO: u64 = 1; | |||
pub const QUIC_UNSTAKED_RECEIVE_WINDOW_RATIO: u64 = 128; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not very familiar with how we use quic, so forgive my long and possibly out of scope questions. I'm trying to understand the effect these constants have.
This means an unstaked peer can send 128 * PACKET_DATA_SIZE
bytes before being blocked, right?
PACKET_DATA_SIZE
will not include IP, or Quic header bytes, so this means it would be slightly less than 128 max-length transactions?
This is a big jump from 1 to 128 for unstaked peers. Does this 128x the amount of unstaked traffic allowed by the quic-server? Are there concerns about overwhelming the server?
When does this receive window get reset? i.e. the thin-client is spamming transactions all the time during bench-tps, so I'd expect it to nearly always have more than 128 txs in-flight.
Would they get reset on acknowledgement?
What does that mean for my packets if say 1024 packets arrive very close in time to each other? Probably assume the sw quic server would not be fast enough to respond before all packets physically arrive. Would the first 128 be accepted and the rest just dropped?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Correct, the receive window is set to linear to the amount of the the streams allowed. Each connection will be allowed to consume stream count limit x the PACKET_DATA_SIZE. The amounts to about 128*1232 = 150K at maximum for a unstaked connection. We allow 500 unstaked connections -- that amounts to about 73MB max for unstaked connections. I think that is a reasonable number. In any case my thinking is the number of streams and the receive window ratio should be linearly set. Having the large amount of streams with small receive window can 1. reduce throughput because of fragmentation and 2. can actually increase the load of both CPU and memory. For the CPU, the client and server need to maintain the states and doing the pulling on these blocked streams and for memory -- we are actually buffering the data in the quic streamer anyway until the full stream data is received. So for the worst case scenario I do not think it is increasing the max memory consumption. The receive window is dynamically adjusted by the quic protocol layer as data is received and consumed -- it is sliding window.
A good read explaining this mechanism is https://docs.google.com/document/d/1F2YfdDXKpy20WVKJueEf4abn_LVZHhMUMS5gX6Pgjl4/mobilebasic.
When does this receive window get reset? i.e. the thin-client is spamming transactions all the time during bench-tps, so I'd expect it to nearly always have more than 128 txs in-flight.
Yes, if I client decide to do so -- we allow at maximum concurrently 128 streams. That is not net new change though.
What does that mean for my packets if say 1024 packets arrive very close in time to each other? Probably assume the sw quic server would not be fast enough to respond before all packets physically arrive. Would the first 128 be accepted and the rest just dropped?
QUIC protocol will issue blocked message to the stream if the receive window is being exceeded, the client will be blocked and will need to wait for notifications of the WINDOW_UPDATE. For client maliciously ignoring the receive window mechanism, the connection can be dropped by the server.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems this will increase the receive window for all the streams for a given connection. So just to understand, this can potentially increase the RAM usage for a given node. Is that correct? If so, in worst case what's the increase in RAM usage?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the worst case scenario which is malicious clients intentionally holds off finalizing the streams and the streamer actually fails to pull off the data from quinn protocol layer, the memory consumed will be
128x1232x512/1024^2 = 73MB for unstaked
512x1232x2000/1024^2 = 1203 MB for staked.
But we are very unlikely to run into this situation as we actively polling data from Quinn into streamer's internal buffers. In that case, even the client intentionally stopping finalizing the data it would not actually increase the net worst case memory pressure as the data would have been buffered at the streamer's temporary buffers anyway which are dictated by the stream_per_connection_count * connection_count.
@ryleung-solana @pgarg66 -- can you please help review this? |
…lana-labs#33913) Adjust receive window to make them linear to the count of streams to reduce fragmentations
…lana-labs#33913) Adjust receive window to make them linear to the count of streams to reduce fragmentations
…lana-labs#33913) Adjust receive window to make them linear to the count of streams to reduce fragmentations
…lana-labs#33913) Adjust receive window to make them linear to the count of streams to reduce fragmentations
…lana-labs#33913) Adjust receive window to make them linear to the count of streams to reduce fragmentations
Backports to the stable branch are to be avoided unless absolutely necessary for fixing bugs, security issues, and perf regressions. Changes intended for backport should be structured such that a minimum effective diff can be committed separately from any refactoring, plumbing, cleanup, etc that are not strictly necessary to achieve the goal. Any of the latter should go only into master and ride the normal stabilization schedule. |
…lana-labs#33913) Adjust receive window to make them linear to the count of streams to reduce fragmentations
…lana-labs#33913) Adjust receive window to make them linear to the count of streams to reduce fragmentations
#595) Adjust receive window to make them linear to the count of streams (solana-labs#33913) Adjust receive window to make them linear to the count of streams to reduce fragmentations
Problem
It has been found in across region testing, transaction sending to remote regions are some time failing with WriteError(Stopped). This is caused that after a stream is opened, the client is blocked from sending due to the following error in Quinn on blocked by connection level flow control. This is due to limited receive window among the multiple streams opened, some stream spent long time in waiting for the limit to be available. The problem is not as obvious when the latency is low as contenders might release the usage fast enough before the time out is reached. But in the case of wide area network with longer latency the problem becomes more severe. Fundamentally, if we allow that many of streams to be opened, we need to ensure they have enough bandwidth to be processed concurrently.
Summary of Changes
Make connection receive window ratio scale to the streams limit.
Without the change, with the following configuration: 4 nodes in the us-west and 1 node in eu-west, running bench-tps tests with thin-client, with duration set to > 300 seconds, we can observe send transaction failures in quic_client with WriteError(Stopped)., with the fix, the problem is no longer observed.
Fixes #