-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
allocate SocketAsyncEngine less frequenty to reduce the number of epoll_wait threads #2346
Conversation
…ll_wait threads and get Plaintech TechEmpower benchmark RPS +5%
For plaintext, this change is effectively just changing the system to use a single epoll thread, correct? Can you please sync with @VSadov and @tmds? They experimented separately with just setting it to 1, always. Example of why we didn't go ahead with it: It's possible your solution mitigates that, because it's using 1 in the cases of fewer connections but more when there are many, which is the case that regressed. It'd be good to drive it to ground, though. |
@VSadov @stephentoub @adamsitnik @benaadams I'm working on an repo to try out some things with Linux socket async implementation: https://github.com/tmds/Tmds.LinuxAsync. I want to see if I can fit some of the kestrel-linux-transport implementation into the SocketAsyncEngine. Part of that would mean epoll threads are doing batched reads and writes, which gives them something more useful to do. It's a WIP. The repo may make it simpler to do some experimentation yourself. Also, ASP.NET Core Sockets transport does some rescheduling of Sockets stuff using IOQueue. I know it enhances performance, but I don't know how. Maybe it's a perf enhancement we can bring down to the Socket level. For the number of SocketAsyncEngine, we need to see based on different benchmark results. A single plaintext isn't enough to decide (cfr dotnet/corefx#36693 (comment)). |
This is correct.
This is not correct. The maximum number of runtime/src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncEngine.Unix.cs Lines 61 to 73 in 3086dcf
With my change we don't allocate a new Some examples:
I've run the Json and FortunesRaw benchmark that you have mentioned using the machine with the following spec: And it looks like JSON got +11% RPS: All number can be found here Currently, the citrine load machine is down (cc @sebastienros) so I can't run against real TechEmpower hardware, as soon as it's up again I am going to run all the benchmarks and share the updated results here. |
It is correct. I was talking about number of connections in that previous test (65K). |
Rather than queuing each non-user code callback to the ThreadPool; each thread has has its own IOQueue which it puts all the callbacks on and the IOQueue puts itself on the ThreadPool as a single item. This brings down the contention on the ThreadPool and operates in batches. However, it doesn't use this for user-code callbacks; as its unknown if they have a blocking call in them which would block the entire batch, so they are queued to the ThreadPool directly. |
Can the change affect non-pipelined mode benchmarks? |
What threads are the source threads providing the work? Are they also ThreadPool threads? I expect that to be the case on Linux? But batching from ThreadPool to ThreadPool seems weird?
Yes. |
I had picked a low value for I wonder why |
@benaadams is right that part of the reason for using IOQueues in Kestrel's Socket transport is to batch calls to ThreadPool.UQUWI when we know that each callback is a short-lived, non-blocking continuation that moves data between System.Net.Sockets and System.IO.Pipelines. Another reason IOQueues were added to the Kestrel Socket transport was to get better fairness on Windows. SocketConnection has a receive loop, so almost as soon as a call to ReceiveAsync completes, we schedule another call to ReceiveAsync for the same connection to an IOQueue. Before, when we scheduled the ReceiveAsync continuations directly to the ThreadPool, I noticed that more recently scheduled continuations would be executed before previously scheduled continuations (LIFO instead of FIFO). This was problematic, particularly for non-pipelined benchmarks like TechEmpower's JSON benchmark, because the next HTTP request would generally still be in flight when the next call to ReceiveAsync/WSARecv executed leading to the IOCP async slow path (WSA_IO_PENDING). By introducing the IOQueues (which use ConcurrentQueues underneath the covers), I was able to reduce contention by creating multiple queues segmented by connection (instead of dispatching everything to the global queue as you would get if I called ThreadPool.UQUWI with preferLocal=false which I don't think existed at the time I added the IOQueues anyway) while still maintaining fairness so that by the time ReceiveAsync is called, the next request already arrived. My tracing verified that this allowed WSAResv to return 0 for immediate success instead of indicating WSA_IO_PENDING a much greater proportion of the time. I also noticed that the standard deviation for latency reported by wrk dropped dramatically after introducing IOQueues which makes sense given it made the scheduling more fair preventing connections from being starved of ThreadPool resources. I tried to use the same approach on Linux thinking similar principles would apply, but dispatching the IOQueue on Linux hurt performance (see aspnet/KestrelHttpServer#2578) compared to running the ReceiveAsync continuations inline because we were effectively double dispatching on Linux when we started using IOQueues. I plan to try running the ReceiveAsync continuations inline on both Windows and Linux, write the data to the PipeWriter, then dispatch the next call to ReceiveAsync to an IOQueue. I think this could reduce latency while keeping fairness potentially improving perf even on Linux. I just haven't gotten around to testing this approach yet.
Yes, the source is the ThreadPool. While batching from the ThreadPool may sound weird, this gives us an easy way to coalesce reads and writes in a way that is fair and should give us better cache locality since on a given thread we're doing a bunch of I/O in a tight loop rather than going from receiving-to-parsing-to-running-app-code-to-encoding-to-writing-to-receiving-again-when-the-next-request-hasn't-arrived-yet. |
The work was evenly distributed, but it was not required as it was very little work. For async reads and writes the handling of the events is just about scheduling work on thread pool, which compared to other things we do here is quite cheap (I've measured that). If the event loop would be actually performing the reads and writes on its own thread, then even distribution would be very welcomed. According to the profile data that I've captured, every event loop thread was on average active for 1/6 of the time and blocked for 5/6. With this change, we have fewer event loop threads and we can use the saved resourced to do the actual work (json serializaiton, socket reads&writes etc). In the case of TechEmpower benchmarks with less than 1024 connections, it's a single thread. On average, it's busy and also blocked for 1/2 of the time. So we could still handle more connections on a single thread.
I currently can't answer this question because our benchmarking infra is temporarily producing broken trace files.
I did not, the profiles that I have captured told me that even with 1024 connections the event loop thread is still not busy for 100% of the time (under a high, artificial techempower load). Also, when I started working on this task I read some docs and part of The Linux Programming Interface book (to get a better understanding as I am a Linux newbie). The author of the book run an interesting experiment where he performed 100 000 calls to Actually the only thing that stopped me from increasing this value is the fact that there can be blocking synchronous reads and writes performed on this thread. |
Yes, it's not the time spent in epoll_wait that would be the concern; it's once you wake up from the wait, the delay in processing additional events while processing previous ones. If you get a 1000 events to process on one thread, or you get 500 events on each of two threads to process, in theory you can process the 500 events on each thread concurrently in less amount of time, assuming you have the system resources to do so actually in parallel. Regardless, if every benchmark you've run has shown this is an improvement, and you've run every benchmark in every configuration we care about, I'm happy with the change. |
I'm wondering if PlainTextNonPipelined would behave better on the 14 core machine if it had an additional engine.
👍 this is about running the proper benchmarks. On the topic of proper benchmarks, we should have a benchmark that measures latency vs throughput. |
There aren't any blocking operations on this thread afaik. |
@halter73 thanks for explaining what the IOQueues are for. This was a mystery to me.
Is this fairness issue Windows specific? I guess on Linux most of this is implemented in the runtime and works FIFO?
Part of #14304 is a property to allow running continuations directly on the epoll thread.
Coalescing of writes is definitely something that happens thanks to the IOQueues. Reading the output data gets delayed to the IOQueue, which means more data can be available, which leads to less syscalls. |
Mostly with the default number of connections: 256, but I've tested some TechEmpower combinations (16 | 32 | 64 | 128 | 256 | 512) and bigger like 1000 or 2000 and the proposed change always gives better numbers (as expected, the number of engines for very big number of connections will be the same as before, it changed only for < 1024).
We already measure latency. @sebastienros can we spot a latency change in the PowerBI reports?
You are right, the dispatch method can only signal waiting thread to continue processing. For some reason, I thought that it can execute a blocking call. runtime/src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncContext.Unix.cs Lines 272 to 273 in 15b0e3d
|
So on the 6 core machine, engines went from 6 (ProcessorCount / 2) to 1. PlainTextNonPipelined had the best improvement on the 6 core (+12%), and least improvement on the 14 core (+2%). It is the most Socket heavy benchmark. Maybe it would perform better on 2 engines. |
Aside: for Plaintext they do go up to 16k connections and publish the results in their data tables |
For the high connection counts, the nr of engines will be the same as before.
|
Yes. From official TE, that is the most Socket intensive one
From perspective of where CPU is spent, db load spends less in Sockets. So changes made to Sockets/Transport show more in Json/PlainTextNonPipelined. That said, db load (and other benchmarks) need to look good for these changes too. |
@adamsitnik to elaborate on this. There is a lot of focus on how many RPS on TE benchmarks. Also important, maybe even more, is latency (what shows up in Ben's screenshots). And, all these benchmarks are ran at 100% CPU. That is not how CPU is loaded in production. Behavior could be different at lower CPU because the scheduler isn't confronted with an impossible job. That's why we need to have a good set of benchmarks (TE and 'others'). |
@tmds I agree that latency is very important. It's one of the metrics that we track (RPS, Latency, Memory, Startup Time). As soon as #31869 is fixed and ASP.NET benchmarks work again we should see the updated numbers for RPS and Latency in https://msit.powerbi.com/view?r=eyJrIjoiYTZjMTk3YjEtMzQ3Yi00NTI5LTg5ZDItNmUyMGRlOTkwMGRlIiwidCI6IjcyZjk4OGJmLTg2ZjEtNDFhZi05MWFiLTJkN2NkMDExZGI0NyIsImMiOjV9 I am also going to include this metrics in my next PRs |
I have been investigating the performance of sockets on Linux using the TechEmpower Plaintext benchmark and a machine from ASP.NET lab with the following spec:
E5-1650 v3 @ 3.50 GHz w/ 15MB Cache 6 Cores / 12 Threads 32GB of RAM
I've collected some PerfCollect traces and converted them to speedscope to be able to find out what every thread was exactly doing over time (the alternative was to somehow convert PerfCollect trace file to .etl file and use Concurrency Visualizer)
What I've noticed was that for the machine with 6 cores (12 threads with turbo boost), there were 6
epoll_wait
threads:runtime/src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncEngine.Unix.cs
Lines 61 to 65 in 4f9ae42
The problem was, that most of these threads were inactive for most of the time, spending much more time waiting compared to handling new epoll events (white space in the chart below means lack of CPU samples for given thread at given time and in this example it means blocking call to epoll_wait)
epoll_wait
was designed to be able to watch MANY file descriptors for changes.I decided to change the value of the
MinHandlesForAdditionalEngine
constant which is used by the method that allocates newSocketAsyncEngine
when the number of sockets in already created engines is high. (eachSocketAsyncEngine
has it's ownepoll_wait
thread)runtime/src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncEngine.Unix.cs
Lines 140 to 147 in 4f9ae42
I've used the number which is passed to
epoll_wait
as the maximum number of new IO events:runtime/src/libraries/System.Net.Sockets/src/System/Net/Sockets/SocketAsyncEngine.Unix.cs
Lines 49 to 54 in 4f9ae42
A single epoll thread is doing a much better job and as you can see even for such artificial benchmark like plaintext from TechEmpower it's still waiting for IO events most of the time (so it can scale up to a bigger number of sockets)
Overall it gives something around +5% for Plaintext RPS using the Citrine machines (the ones that have the same hardware as TechEmpower lab).
Before:
After