-
-
Notifications
You must be signed in to change notification settings - Fork 400
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance under tokio multi_thread vs current_thread runtime #1433
Comments
Glad to hear it!
Under these conditions, per-packet overhead is likely to dominate, so a more relevant metric might be packets per second per core. For example, you'll likely observe that the bulk benchmark in this repository demonstrates much higher bandwidth due to larger packets.
Your inference that this is due to reduced batching is likely correct. Quinn does not presently employ any logic analogous to Nagle's algorithm: as soon as the driver task is aware of data to be send, it will be sent. If a system tends to perform many small writes, then in a multithreaded context, to a first approximation, the driver will be woken immediately after each write, whereas in a single-threaded context the driver must wait to be polled, which will tend to lead to greater batching. Supporting evidence for this hypothesis would be differences in packet size. You might also want to try the git HEAD (due to become 0.9 Real Soon Now™️) which has substantially reduced the frequency of ACKs, which is another reason you might see more packets sent regardless of application behavior. Quinn could have mechanisms to better control this. For example, perhaps we could inject configurable latency to driver task wake-ups triggered by writes. However, expected performance is higher for single-threaded regardless runtimes due to reduced contention, and such architectures scale better to multiple hosts. An ideal large-scale QUIC deployment involves a QUIC-aware load balancer and/or preferred address mechanism balancing incoming connections across a large number of endpoints, which may be distributed across any number of hosts. A 4-CPU system might host 4 endpoints, for example. That said, I understand that smaller deployments might not want to deal with that level of complexity. There is some low hanging fruit for improving Quinn's ability to parallelize processing of many active connections on a single endpoint.
Many different threads writing on the same connection is indeed a contention hazard. This architecture was motivated by simplicity and the expectation that highly scalable systems will tend to be endpoint-per-core anyway. I think more fine grained |
Are those "useful packets" (carry more user-data like non-retransmitted stream frames) or "just more packets" (like stream frames which would eitherwise be coalesced into single packets get now transmitted separately, or just more ACK packets). Without that knowledge, it's really hard to say whether 3x is a good thing or a bad thing.
That is the total consumed CPU? It's surprising! If you send more packets for the same amount of user-data, it should actually require much more CPU - because the most costly thing are networking syscalls. If your metric is however the the average load per core it might be lower however since the multithreaded runtime is able to spread load a bit between cores (and a bit isn't actually too much, it will mostly make some crypto operations run on a different thread than networking. But since networking dominates so much it won't scale that much more). |
hey @shaun-cox , if your app is throughput-centric, Is it possible for you guys to accumulate the small writes from the app layer before passing it to Quinn API? |
I don't suspect the presence of retransmitted frames, though can double check at some point when I get back to perf testing. We're on a pretty clean internal network with >=10GbE adapters/switches, and the offered load is itself paced, so I doubt that there is any bursting of sufficient size to overflow any switch buffer. The only difference between the runs is the choice of single or multi-threaded tokio runtime, so if that is the cause of retransmitted frames between before and after, that would also be curious. I would suspect more ACK frames though.
Yes, that is total CPU consumption.
Our app is a router. As such its driven by whatever is received from the network (small messages) and sends them somewhere else. We are very latency sensitive, so adding any artificial delay in the hopes of accumulating some larger batch is fundamentally at odds with our goals. Whatever is received on a given QUIC stream is generally routed to the same place(s), so to the extent we could arrange to keep these receives affinitized to the same core/queue, and not be work-stolen by any available core, it would help (I think, based on prior experience) for the batch to form there before we lump-send everything available in that poll to the routed send queues. |
You might want to grab Connection Stats and emit them as metrics or in some other way periodically. That will tell you whether more frames are transmitted, and of which type they are. |
We've built a router of sorts using Quinn, and so far, have been thrilled with how easy it's been to get up and going with Quinn's nice API.
Now we're in the phase of performance measuring and our metric of interest is "megabits per core", or how much data can we move through our router per unit of CPU consumption.
Our use case is not a typical "bulk throughput" scenario where we have all the data up front and need to flow it all quickly and measure goodput. Our scenario is small packets sent by clients at paced intervals (e.g. 160 byte chunks sent every 2.5ms, such as by an Audio codec's output). The router receives many
streams of this nature and simply relays the chunks out on some other QUIC send streams.
The surprising anomaly we've discovered is that, for the identical offered load to the router, if we switch to using
#[tokio::main(flavor = "current_thread")]
from#[tokio::main(flavor = "multi_thread", worker_threads = 4]
, we see a reduction in CPU consumed by about 71%. (1172 millicores to 335 millicores)At first, we thought this was mainly due to the appearance of
Mutex::lock_contended
as the top-most hit showing inperf
, but upon closer inspection, we've also noticed that when using the multi-threaded tokio runtime, Quinn ends up generating about 3x the number of outgoing UDP datagrams from the router.We're currently able to achieve only about 466 Mbits/core in this single-threaded runtime scenario, which seems pretty low. We'd envisioned running pods with a 4 core CPU limit, and getting higher throughput, but if we do, we only achieve 133 Mbits/core, which is way less efficient and more costly.
We're wondering if anyone can offer an explanation for the 3x increase in outgoing UDP datagrams just by switching runtimes. We're assuming it has to do with fewer opportunities for internal batching or realizing that multiple stream frames or QUIC packets can be coalesced into the same UDP datagram?
We're also curious about the choice of having to acquire that
Mutex
protectingstate
inquinn::ConnectionInner
for everypoll_read
andpoll_write
on any stream associated with a connection. Isn't this an immediate scalability killer if the tasks that read/write from Quinn streams get spread out across the multi-threaded runtime's thread pool?Thanks.
The text was updated successfully, but these errors were encountered: