Reduce Windows syscall overhead #882

Ralith · 2020-11-01T00:00:14Z

On Windows, we're currently relying on naive send/recv using tokio's built-in functions. This is much less efficient than the batched syscalls we use on Unix platforms, severely compromising performance when CPU bound. msquic uses e.g. WSASendMessage and WSARecvMsg, so those're probably a good target. It's unclear how well these fit with tokio; some effort may be needed there. See also tokio-rs/tokio#2968.

The text was updated successfully, but these errors were encountered:

Matthias247 · 2020-11-01T01:46:57Z

Here's a profile I created from running the bulk benchmark in a longer loop using Visual Studio 2019:

I expanded mainly one of the 2 endpoints, so the overall utilizations are pretty much double that.
The heavy hitters are:

UDP send using mio: 2 * 17%
UDP recv using mio: 2 * 6%
Tokio parking: 2 * 10%

The actual Quinn logic for packet ingestion (handle_event) and creation (poll_transmit) is about 2*10%.

Crypto (ChaCha 20) shows up separately and is about 6% - far from being a bottleneck.

Hottest individual functions:

Matthias247 · 2020-11-01T05:37:38Z

Some interesting finding. I added some very basic stats to count the number of transmitted packets and their sizes.

For the 128kB transmission benchmark on Linux I get the following values:

Packets sent: 33, Average Size: 65.48484848484848
Packets sent: 117, Average Size: 1182.6923076923076

However on windows I get those:

Packets sent: 229, Average Size: 38.45414847161572
Packets sent: 178, Average Size: 786.9887640449438

It seems like on windows a lot 50% more packets are sent, and sizes are smaller. Not sure if those are retransmits, or if a smaller MTU is used? More stats would be helpful.
The difference on the receiving end is even more dramatic, it sends nearly 10x as many packets. If this would just be about ACKs, it should be also 50% more.

Matthias247 · 2020-11-01T06:15:16Z

Something is wrong here:

Sender:

ConnectionStats { packet_tx: PacketTransmissionStats { number of packets: 570, number of bytes: 681798, avg packet size: 1196 }, frame_tx: FrameTransmissionStats { ACK frames: 568, RESET frames: 0, STOP_SENDING: 0, MAX_STREAM_DATA frames: 0, MAX_DATA frames: 0, MAX_STREAMS_UNI frames: 0, MAX_STREAMS_BIDI frames: 0, STREAM frames: 562 } }

Receiver

ConnectionStats { packet_tx: PacketTransmissionStats { number of packets: 1131, number of bytes: 37636, avg packet size: 33 }, frame_tx: FrameTransmissionStats { ACK frames: 1128, RESET frames: 0, STOP_SENDING: 0, MAX_STREAM_DATA frames: 5, MAX_DATA frames: 0, MAX_STREAMS_UNI frames: 5, MAX_STREAMS_BIDI frames: 0, STREAM frames: 0 } }

The Sender sent 570 packets. However the receiver transmitted 1131 packets, containing 1128 ack frames. That sounds a bit too much

Matthias247 · 2020-11-02T03:49:57Z

In another run I could observe the high number of ACKs also being visible on the transmitter:

Receiver:

[quinn-proto\src\connection\mod.rs:176] &self.stats = ConnectionStats {
    packet_tx: PacketTransmissionStats {
        packets: 1129,
        bytes: 41548,
        avg_packet_size: 36,
    },
    frame_tx: FrameCounter {
        ACK: 1126,
        CRYPTO: 3,
        HANDSHAKE_DONE: 1,
        MAX_STREAM_DATA: 555,
        MAX_STREAMS_UNI: 5,
        NEW_CONNECTION_ID: 4,
        RETIRE_CONNECTION_ID: 1,
    },
}

Transmitter:

[quinn-proto\src\connection\mod.rs:176] &self.stats = ConnectionStats {
    packet_tx: PacketTransmissionStats {
        packets: 1032,
        bytes: 696905,
        avg_packet_size: 675,
    },
   frame_tx: FrameCounter {
        ACK: 1030,
        CRYPTO: 2,
        NEW_CONNECTION_ID: 5,
        PING: 4,
        STREAM: 561,
    },
}

I noticed that for sending ACKs, the pending ACKs are not cleared but just copied over (sent.acks = space.pending_acks.clone();), which probably will lead any pending non acknowledged ACK to be included in every outgoing data packet. That could explain 561 ACKs. But I'm not sure how we get to 1130.
I guess every time the receiver gets a packet containing a STREAM, it sends an ACK back for that. Then the transmitter ACKs those ACKs - but can't seem to bundle them with anything else, so it sends ACK-only frames?
That could explain one direction. And it explains why avg_packet_size is low. It degrades due to the amount of ACK-only packets sent.

However I'm not sure why the receiver would sent 1129 ACKs. It shouldn't need to ACK its own ACKs.

Matthias247 · 2020-11-02T07:37:09Z

I found a part of the issue is due to flow control handling. #880 triggers the receiver not to respond a MAX_STREAM_DATA frame in response to every STREAM frame. However there exists another issue triggered through

quinn/quinn-proto/src/connection/mod.rs

Line 2789 in bd14aa1

space.pending.max_data = true;

This will lead the connection to wake up and start to create packets. The Spaces::can_send check will yield true at

quinn/quinn-proto/src/connection/spaces.rs

Line 115 in bd14aa1

!self.pending.is_empty()

, which will cause a packet to be created. That packet will contain the ACKs for all packets previously received. But actually no MAX_DATA frame will be written due to the check at

quinn/quinn-proto/src/connection/streams.rs

Line 485 in bd14aa1

if let Ok(max) = VarInt::from_u64(self.local_max_data) {

Therefore what is created ends up being an ACK-only packet - even though that wouldn't have been allowed.

When applying #880 and commenting the space.pending.max_data = true; line I can get a throughput improvement from 30MB/s to 56MB/s.

Stats before the change

Receiver

[quinn-proto\src\connection\mod.rs:176] &self.stats = ConnectionStats {
    packet_tx: PacketTransmissionStats {
        packets: 8364,
    },
    frame_tx: FrameCounter {
        ACK: 8361,
        MAX_STREAM_DATA: 35,
    },
}

Sender

[quinn-proto\src\connection\mod.rs:176] &self.stats = ConnectionStats {
    packet_tx: PacketTransmissionStats {
        packets: 4734,
    },
    frame_tx: FrameCounter {
        ACK: 4732,
        STREAM: 4809,
    },
}

Stats after the change

Receiver

[quinn-proto\src\connection\mod.rs:176] &self.stats = ConnectionStats {
    packet_tx: PacketTransmissionStats {
        packets: 4510,
    },
    frame_tx: FrameCounter {
        ACK: 4507,
        MAX_STREAM_DATA: 37,
    },
}

Sender

[quinn-proto\src\connection\mod.rs:176] &self.stats = ConnectionStats {
    packet_tx: PacketTransmissionStats {
        packets: 4646,
    },
    frame_tx: FrameCounter {
        ACK: 4644,
        STREAM: 4638,
    },
}

We can observe the amount of packets send from receiver to transmitter is
now halfed.

Fix

I think the proper fix for this is to apply something similar to MAX_DATA as #880 for MAX_STREAM_DATA, and make sure it is only set to pending when there is actually something to transmit. The fake setting of pending and then backing off has some consequences in terms of sending ACKs. The same might actually also happen for other paths, but I didn't check it further up to now.

An alternative fix could be to check after packet creation if it ended up being an ACK-only packet even though we didn't want that, and drop if. That might have however some impact on situations where one might want to do this (retransmits, loss-probes, whatever), so I think having pending set correctly is preferrable.

djc · 2020-11-02T15:23:09Z

Great find! So this isn't actually Windows-specific, then?

Matthias247 · 2020-11-02T17:07:19Z

Great find! So this isn't actually Windows-specific, then?

Yes and no. There are some general bottlenecks regarding sending too many ACK frames.
However on windows they are more pronounced.

My guess is that this happens due to sending UDP datagrams on windows being super slow at the moment (it takes up more than 25% of time according to the CPU profile above). As a result of this, the receiver will likely get all datagrams one by one, and never receive multiple in a batch. Then the poll_transmit operation which follows each batch will yield an individual ACK packet, and never combining multiple.

Matthias247 · 2020-11-08T07:31:24Z

I did a bit more research and playing around, and got the windows version now to 120MB/s.

The main issue is mio/tokio's use of IOCP. It will only ever enqueue a single write at once. Then it will report being blocked (not write ready) until that write returns, which requires a notification through the IO completion port. Thereby every single UDP datagram write requires a full roundtrip through the complete tokio and quinn eventloops.

There's 2 things that can be improved:
a) Use IOCP as it is intended to be used, and schedule multiple writes at once.
b) IOCP actually has a feature where it can report to the caller that an operation completed synchronously, and then the caller doesn't even have to wait for the notification. This feature is used in mio for TCP (it had been later added at tokio-rs/mio@24b88f2), but not for UDP. This was the PR for the TCP implementation, which also explains some challenges with UDP: tokio-rs/mio#520 . I now added support for this to a local branch of mio 0.6.22, which allowed me to reach 120MB/s. Due to some logging I also noticed that basically all operations inside the quinn benchmark now execute synchronously.

With this change I now also see a lot less ACK packets being transmitted by the receiving end of the connection, since it will finally get more than one datagram in one batch and can ack them all at once.

Profile changes with the latter are also interesting. Now recv_msg basically drops to zero, and send_msg takes even more fraction of time.

The move towards AFD poll in tokio 0.3 / mio 0.7 might also help with this.

djc · 2020-11-08T08:18:04Z

Yeah, let's not spend much time on things that are specific to tokio 0.2 / mio 0.6.

Ralith · 2020-11-08T19:44:55Z

Profile changes with the latter are also interesting. Now recv_msg basically drops to zero, and send_msg takes even more fraction of time.

For which peer? With properly relaxed ACKs and flow control updates, that's expected for the sending side, since indeed the number of outgoing packets dwarfs the number incoming.

Matthias247 · 2020-11-08T20:34:45Z

You are right. That was about the sender. On the receiving endpoint side things are more balanced. Times spent in UDP transfers:

Sender:
sent_to: 38%
recv_from: 0.8%

Receiver:
sent_to: 6%
recv_from: 9%

Ralith added enhancement New feature or request help wanted Extra attention is needed labels Nov 1, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Reduce Windows syscall overhead #882

Reduce Windows syscall overhead #882

Ralith commented Nov 1, 2020

Matthias247 commented Nov 1, 2020 •

edited

Loading

Matthias247 commented Nov 1, 2020

Matthias247 commented Nov 1, 2020 •

edited

Loading

Matthias247 commented Nov 2, 2020

Matthias247 commented Nov 2, 2020 •

edited

Loading

djc commented Nov 2, 2020

Matthias247 commented Nov 2, 2020

Matthias247 commented Nov 8, 2020 •

edited

Loading

djc commented Nov 8, 2020

Ralith commented Nov 8, 2020

Matthias247 commented Nov 8, 2020

Reduce Windows syscall overhead #882

Reduce Windows syscall overhead #882

Comments

Ralith commented Nov 1, 2020

Matthias247 commented Nov 1, 2020 • edited Loading

Matthias247 commented Nov 1, 2020

Matthias247 commented Nov 1, 2020 • edited Loading

Sender:

Receiver

Matthias247 commented Nov 2, 2020

Receiver:

Transmitter:

Matthias247 commented Nov 2, 2020 • edited Loading

Stats before the change

Receiver

Sender

Stats after the change

Receiver

Sender

Fix

djc commented Nov 2, 2020

Matthias247 commented Nov 2, 2020

Matthias247 commented Nov 8, 2020 • edited Loading

djc commented Nov 8, 2020

Ralith commented Nov 8, 2020

Matthias247 commented Nov 8, 2020

Matthias247 commented Nov 1, 2020 •

edited

Loading

Matthias247 commented Nov 1, 2020 •

edited

Loading

Matthias247 commented Nov 2, 2020 •

edited

Loading

Matthias247 commented Nov 8, 2020 •

edited

Loading