-
-
Notifications
You must be signed in to change notification settings - Fork 416
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Reduce Windows syscall overhead #882
Comments
Here's a profile I created from running the I expanded mainly one of the 2 endpoints, so the overall utilizations are pretty much double that.
The actual Quinn logic for packet ingestion ( Crypto (ChaCha 20) shows up separately and is about 6% - far from being a bottleneck. Hottest individual functions: |
Some interesting finding. I added some very basic stats to count the number of transmitted packets and their sizes. For the 128kB transmission benchmark on Linux I get the following values:
However on windows I get those:
It seems like on windows a lot 50% more packets are sent, and sizes are smaller. Not sure if those are retransmits, or if a smaller MTU is used? More stats would be helpful. |
Something is wrong here: Sender:
Receiver
The Sender sent 570 packets. However the receiver transmitted 1131 packets, containing 1128 ack frames. That sounds a bit too much |
In another run I could observe the high number of ACKs also being visible on the transmitter: Receiver:
Transmitter:
I noticed that for sending ACKs, the However I'm not sure why the receiver would sent 1129 ACKs. It shouldn't need to ACK its own ACKs. |
I found a part of the issue is due to flow control handling. #880 triggers the receiver not to respond a quinn/quinn-proto/src/connection/mod.rs Line 2789 in bd14aa1
This will lead the connection to wake up and start to create packets. The quinn/quinn-proto/src/connection/spaces.rs Line 115 in bd14aa1
MAX_DATA frame will be written due to the check at quinn/quinn-proto/src/connection/streams.rs Line 485 in bd14aa1
Therefore what is created ends up being an When applying #880 and commenting the Stats before the changeReceiver
Sender
Stats after the changeReceiver
Sender
We can observe the amount of packets send from receiver to transmitter is FixI think the proper fix for this is to apply something similar to An alternative fix could be to check after packet creation if it ended up being an |
Great find! So this isn't actually Windows-specific, then? |
Yes and no. There are some general bottlenecks regarding sending too many ACK frames. My guess is that this happens due to sending UDP datagrams on windows being super slow at the moment (it takes up more than 25% of time according to the CPU profile above). As a result of this, the receiver will likely get all datagrams one by one, and never receive multiple in a batch. Then the |
I did a bit more research and playing around, and got the windows version now to 120MB/s. The main issue is mio/tokio's use of IOCP. It will only ever enqueue a single write at once. Then it will report being blocked (not write ready) until that write returns, which requires a notification through the IO completion port. Thereby every single UDP datagram write requires a full roundtrip through the complete tokio and quinn eventloops. There's 2 things that can be improved: With this change I now also see a lot less ACK packets being transmitted by the receiving end of the connection, since it will finally get more than one datagram in one batch and can ack them all at once. Profile changes with the latter are also interesting. Now The move towards AFD poll in tokio 0.3 / mio 0.7 might also help with this. |
Yeah, let's not spend much time on things that are specific to tokio 0.2 / mio 0.6. |
For which peer? With properly relaxed ACKs and flow control updates, that's expected for the sending side, since indeed the number of outgoing packets dwarfs the number incoming. |
You are right. That was about the sender. On the receiving endpoint side things are more balanced. Times spent in UDP transfers: Sender: Receiver: |
On Windows, we're currently relying on naive send/recv using tokio's built-in functions. This is much less efficient than the batched syscalls we use on Unix platforms, severely compromising performance when CPU bound. msquic uses e.g.
WSASendMessage
andWSARecvMsg
, so those're probably a good target. It's unclear how well these fit with tokio; some effort may be needed there. See also tokio-rs/tokio#2968.The text was updated successfully, but these errors were encountered: