Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

allocate SocketAsyncEngine less frequenty to reduce the number of epoll_wait threads #2346

Merged
merged 2 commits into from
Feb 5, 2020

Conversation

adamsitnik
Copy link
Member

I have been investigating the performance of sockets on Linux using the TechEmpower Plaintext benchmark and a machine from ASP.NET lab with the following spec:

E5-1650 v3 @ 3.50 GHz w/ 15MB Cache 6 Cores / 12 Threads 32GB of RAM

I've collected some PerfCollect traces and converted them to speedscope to be able to find out what every thread was exactly doing over time (the alternative was to somehow convert PerfCollect trace file to .etl file and use Concurrency Visualizer)

What I've noticed was that for the machine with 6 cores (12 threads with turbo boost), there were 6 epoll_wait threads:

private static readonly int s_engineCount =
#if DEBUG
2;
#else
Environment.ProcessorCount >= 6 ? Environment.ProcessorCount / 2 : 1;

The problem was, that most of these threads were inactive for most of the time, spending much more time waiting compared to handling new epoll events (white space in the chart below means lack of CPU samples for given thread at given time and in this example it means blocking call to epoll_wait)

obraz

epoll_wait was designed to be able to watch MANY file descriptors for changes.

I decided to change the value of the MinHandlesForAdditionalEngine constant which is used by the method that allocates new SocketAsyncEngine when the number of sockets in already created engines is high. (each SocketAsyncEngine has it's own epoll_wait thread)

private bool HasLowNumberOfSockets
{
get
{
return IntPtr.Size == 4 ? _outstandingHandles.ToInt32() < MinHandlesForAdditionalEngine.ToInt32() :
_outstandingHandles.ToInt64() < MinHandlesForAdditionalEngine.ToInt64();
}
}

I've used the number which is passed to epoll_wait as the maximum number of new IO events:

private const int EventBufferCount =
#if DEBUG
32;
#else
1024;
#endif

A single epoll thread is doing a much better job and as you can see even for such artificial benchmark like plaintext from TechEmpower it's still waiting for IO events most of the time (so it can scale up to a bigger number of sockets)

obraz

Overall it gives something around +5% for Plaintext RPS using the Citrine machines (the ones that have the same hardware as TechEmpower lab).

Before:

RequestsPerSecond:           3,781,374
Max CPU (%):                 99
WorkingSet (MB):             81
Avg. Latency (ms):           1.19
Startup (ms):                495
First Request (ms):          35.36
Latency (ms):                0.11
Total Requests:              57,097,881
Duration: (ms)               15,100
Socket Errors:               0
Bad Responses:               0
Build Time (ms):             3,501
Published Size (KB):         25,588
SDK:                         5.0.100-alpha.1.20073.10
Runtime:                     5.0.0-alpha.1.20075.5

After

RequestsPerSecond:           3,987,891
Max CPU (%):                 99
WorkingSet (MB):             83
Avg. Latency (ms):           0.72
Startup (ms):                324
First Request (ms):          39.13
Latency (ms):                0.1
Duration: (ms)               15,100
Socket Errors:               0
Bad Responses:               0
Build Time (ms):             4,501
Published Size (KB):         128,309
SDK:                         5.0.100-alpha.1.20073.10
Runtime:                     5.0.0-alpha.1.20075.5
ASP.NET Core:                5.0.0-alpha.1.20076.1

…ll_wait threads and get Plaintech TechEmpower benchmark RPS +5%
@stephentoub
Copy link
Member

stephentoub commented Jan 29, 2020

For plaintext, this change is effectively just changing the system to use a single epoll thread, correct?

Can you please sync with @VSadov and @tmds? They experimented separately with just setting it to 1, always. Example of why we didn't go ahead with it:
dotnet/corefx#36693 (comment)

It's possible your solution mitigates that, because it's using 1 in the cases of fewer connections but more when there are many, which is the case that regressed. It'd be good to drive it to ground, though.

@wfurt wfurt added area-System.Net.Sockets os-linux Linux OS (any supported distro) labels Jan 29, 2020
@tmds
Copy link
Member

tmds commented Jan 30, 2020

@VSadov @stephentoub @adamsitnik @benaadams I'm working on an repo to try out some things with Linux socket async implementation: https://github.com/tmds/Tmds.LinuxAsync.

I want to see if I can fit some of the kestrel-linux-transport implementation into the SocketAsyncEngine. Part of that would mean epoll threads are doing batched reads and writes, which gives them something more useful to do. It's a WIP. The repo may make it simpler to do some experimentation yourself.

Also, ASP.NET Core Sockets transport does some rescheduling of Sockets stuff using IOQueue. I know it enhances performance, but I don't know how. Maybe it's a perf enhancement we can bring down to the Socket level.

For the number of SocketAsyncEngine, we need to see based on different benchmark results. A single plaintext isn't enough to decide (cfr dotnet/corefx#36693 (comment)).

@adamsitnik
Copy link
Member Author

It's possible your solution mitigates that, because it's using 1 in the cases of fewer connections

This is correct.

but more when there are many, which is the case that regressed.

This is not correct.

The maximum number of SocketAsyncEngine instances is constant.

private static readonly int s_engineCount =
#if DEBUG
2;
#else
Environment.ProcessorCount >= 6 ? Environment.ProcessorCount / 2 : 1;
#endif
#pragma warning restore CA1802
//
// The current engines. We replace an engine when it runs out of "handle" values.
// Must be accessed under s_lock.
//
private static readonly SocketAsyncEngine[] s_currentEngines = new SocketAsyncEngine[s_engineCount];

With my change we don't allocate a new SocketAsyncEngine until we have 1024 connections opened. However, if we have reached the limit of SocketAsyncEngine instances, the round robin works the same way as it used to.

Some examples:

CPUs Connections epoll threads before epoll threads after
1 n 1 1
4 n 1 1
6 32 1 1
6 33 2 1
6 1000 3 1
6 2000 3 2
6 100000 3 3

I've run the Json and FortunesRaw benchmark that you have mentioned using the machine with the following spec: E5-1650 v3 @ 3.50 GHz w/ 15MB Cache 6 Cores / 12 Threads 32GB of RAM

And it looks like JSON got +11% RPS:

obraz

All number can be found here

Currently, the citrine load machine is down (cc @sebastienros) so I can't run against real TechEmpower hardware, as soon as it's up again I am going to run all the benchmarks and share the updated results here.

@stephentoub
Copy link
Member

stephentoub commented Jan 30, 2020

This is not correct.

It is correct. I was talking about number of connections in that previous test (65K).

@benaadams
Copy link
Member

Also, ASP.NET Core Sockets transport does some rescheduling of Sockets stuff using IOQueue. I know it enhances performance, but I don't know how.

Rather than queuing each non-user code callback to the ThreadPool; each thread has has its own IOQueue which it puts all the callbacks on and the IOQueue puts itself on the ThreadPool as a single item. This brings down the contention on the ThreadPool and operates in batches.

However, it doesn't use this for user-code callbacks; as its unknown if they have a blocking call in them which would block the entire batch, so they are queued to the ThreadPool directly.

@omariom
Copy link
Contributor

omariom commented Jan 30, 2020

Can the change affect non-pipelined mode benchmarks?

@tmds
Copy link
Member

tmds commented Jan 30, 2020

Rather than queuing each non-user code callback to the ThreadPool; each thread has has its own IOQueue which it puts all the callbacks on and the IOQueue puts itself on the ThreadPool as a single item. This brings down the contention on the ThreadPool and operates in batches.

What threads are the source threads providing the work? Are they also ThreadPool threads? I expect that to be the case on Linux? But batching from ThreadPool to ThreadPool seems weird?

Can the change affect non-pipelined mode benchmarks?

Yes.

@adamsitnik
Copy link
Member Author

Ok, I got the updated numbers (all results are available here)

obraz

The biggest winner is JSON (+8% or +11% depending on hardware), DbFortunes also got a nice boost (+7%) and PlainText has also improved (2-3%).

@wfurt PTAL and let me know if you have any questions regarding the change.

@adamsitnik adamsitnik requested a review from wfurt February 4, 2020 18:05
@tmds
Copy link
Member

tmds commented Feb 4, 2020

I had picked a low value for MinHandlesForAdditionalEngine with the rationale that it would spread out sockets more evenly across engines.
A higher value makes sense to make each epoll thread more useful.

I wonder why PlainTextNonPipelined shows a much better improvement on the 6 core than the 14 core machine. Any thoughts?
Have you tried varying the MinHandlesForAdditionalEngine and find out the optimum for some benchmarks?

@halter73
Copy link
Member

halter73 commented Feb 5, 2020

@benaadams is right that part of the reason for using IOQueues in Kestrel's Socket transport is to batch calls to ThreadPool.UQUWI when we know that each callback is a short-lived, non-blocking continuation that moves data between System.Net.Sockets and System.IO.Pipelines.

Another reason IOQueues were added to the Kestrel Socket transport was to get better fairness on Windows. SocketConnection has a receive loop, so almost as soon as a call to ReceiveAsync completes, we schedule another call to ReceiveAsync for the same connection to an IOQueue. Before, when we scheduled the ReceiveAsync continuations directly to the ThreadPool, I noticed that more recently scheduled continuations would be executed before previously scheduled continuations (LIFO instead of FIFO). This was problematic, particularly for non-pipelined benchmarks like TechEmpower's JSON benchmark, because the next HTTP request would generally still be in flight when the next call to ReceiveAsync/WSARecv executed leading to the IOCP async slow path (WSA_IO_PENDING).

By introducing the IOQueues (which use ConcurrentQueues underneath the covers), I was able to reduce contention by creating multiple queues segmented by connection (instead of dispatching everything to the global queue as you would get if I called ThreadPool.UQUWI with preferLocal=false which I don't think existed at the time I added the IOQueues anyway) while still maintaining fairness so that by the time ReceiveAsync is called, the next request already arrived.

My tracing verified that this allowed WSAResv to return 0 for immediate success instead of indicating WSA_IO_PENDING a much greater proportion of the time. I also noticed that the standard deviation for latency reported by wrk dropped dramatically after introducing IOQueues which makes sense given it made the scheduling more fair preventing connections from being starved of ThreadPool resources.

I tried to use the same approach on Linux thinking similar principles would apply, but dispatching the IOQueue on Linux hurt performance (see aspnet/KestrelHttpServer#2578) compared to running the ReceiveAsync continuations inline because we were effectively double dispatching on Linux when we started using IOQueues.

I plan to try running the ReceiveAsync continuations inline on both Windows and Linux, write the data to the PipeWriter, then dispatch the next call to ReceiveAsync to an IOQueue. I think this could reduce latency while keeping fairness potentially improving perf even on Linux. I just haven't gotten around to testing this approach yet.

What threads are the source threads providing the work? Are they also ThreadPool threads? I expect that to be the case on Linux? But batching from ThreadPool to ThreadPool seems weird?

Yes, the source is the ThreadPool. While batching from the ThreadPool may sound weird, this gives us an easy way to coalesce reads and writes in a way that is fair and should give us better cache locality since on a given thread we're doing a bunch of I/O in a tight loop rather than going from receiving-to-parsing-to-running-app-code-to-encoding-to-writing-to-receiving-again-when-the-next-request-hasn't-arrived-yet.

@adamsitnik
Copy link
Member Author

I had picked a low value for MinHandlesForAdditionalEngine with the rationale that it would spread out sockets more evenly across engines.

The work was evenly distributed, but it was not required as it was very little work. For async reads and writes the handling of the events is just about scheduling work on thread pool, which compared to other things we do here is quite cheap (I've measured that). If the event loop would be actually performing the reads and writes on its own thread, then even distribution would be very welcomed.

According to the profile data that I've captured, every event loop thread was on average active for 1/6 of the time and blocked for 5/6.

obraz

With this change, we have fewer event loop threads and we can use the saved resourced to do the actual work (json serializaiton, socket reads&writes etc).

In the case of TechEmpower benchmarks with less than 1024 connections, it's a single thread. On average, it's busy and also blocked for 1/2 of the time. So we could still handle more connections on a single thread.

obraz

I wonder why PlainTextNonPipelined shows a much better improvement on the 6 core than the 14 core machine. Any thoughts?

I currently can't answer this question because our benchmarking infra is temporarily producing broken trace files.

Have you tried varying the MinHandlesForAdditionalEngine and find out the optimum for some benchmarks?

I did not, the profiles that I have captured told me that even with 1024 connections the event loop thread is still not busy for 100% of the time (under a high, artificial techempower load).

Also, when I started working on this task I read some docs and part of The Linux Programming Interface book (to get a better understanding as I am a Linux newbie). The author of the book run an interesting experiment where he performed 100 000 calls to epoll_wait which was monitoring N file descriptors and 1 random fd was changing between the calls. For 10 monitored file descriptors, it took 0.41s, for a hundred it took 0.42s. For a thousand, it was 0.53s. For ten thousand it took only 0.66s.
So the performance of epoll hardly declines as the number of monitored file descriptors grows large.

Actually the only thing that stopped me from increasing this value is the fact that there can be blocking synchronous reads and writes performed on this thread.

@stephentoub
Copy link
Member

So the performance of epoll hardly declines as the number of monitored file descriptors grows large.

Yes, it's not the time spent in epoll_wait that would be the concern; it's once you wake up from the wait, the delay in processing additional events while processing previous ones. If you get a 1000 events to process on one thread, or you get 500 events on each of two threads to process, in theory you can process the 500 events on each thread concurrently in less amount of time, assuming you have the system resources to do so actually in parallel.

Regardless, if every benchmark you've run has shown this is an improvement, and you've run every benchmark in every configuration we care about, I'm happy with the change.

@tmds
Copy link
Member

tmds commented Feb 5, 2020

I currently can't answer this question because our benchmarking infra is temporarily producing broken trace files.

I'm wondering if PlainTextNonPipelined would behave better on the 14 core machine if it had an additional engine.
With how many connections did you run the benchmark? And what does that mean for the nr of engines?

Regardless, if every benchmark you've run has shown this is an improvement, and you've run every benchmark in every configuration we care about

👍 this is about running the proper benchmarks.

On the topic of proper benchmarks, we should have a benchmark that measures latency vs throughput.

@tmds
Copy link
Member

tmds commented Feb 5, 2020

Actually the only thing that stopped me from increasing this value is the fact that there can be blocking synchronous reads and writes performed on this thread.

There aren't any blocking operations on this thread afaik.

@tmds
Copy link
Member

tmds commented Feb 5, 2020

@halter73 thanks for explaining what the IOQueues are for. This was a mystery to me.

Another reason IOQueues were added to the Kestrel Socket transport was to get better fairness on Windows.

Is this fairness issue Windows specific? I guess on Linux most of this is implemented in the runtime and works FIFO?

I plan to try running the ReceiveAsync continuations inline on both Windows and Linux, write the data to the PipeWriter, then dispatch the next call to ReceiveAsync to an IOQueue.

Part of #14304 is a property to allow running continuations directly on the epoll thread.
Why would you dispatch the ReceiveAsync to the IOQueue?

While batching from the ThreadPool may sound weird, this gives us an easy way to coalesce reads and writes in a way that is fair

Coalescing of writes is definitely something that happens thanks to the IOQueues. Reading the output data gets delayed to the IOQueue, which means more data can be available, which leads to less syscalls.
I'm not sure if there is coalescing of reads somehow happening.

@adamsitnik
Copy link
Member Author

With how many connections did you run the benchmark? And what does that mean for the nr of engines?

Mostly with the default number of connections: 256, but I've tested some TechEmpower combinations (16 | 32 | 64 | 128 | 256 | 512) and bigger like 1000 or 2000 and the proposed change always gives better numbers (as expected, the number of engines for very big number of connections will be the same as before, it changed only for < 1024).

we should have a benchmark that measures latency vs throughput.

We already measure latency. @sebastienros can we spot a latency change in the PowerBI reports?

There aren't any blocking operations on this thread afaik.

You are right, the dispatch method can only signal waiting thread to continue processing. For some reason, I thought that it can execute a blocking call.

// Sync operation. Signal waiting thread to continue processing.
e.Set();

@adamsitnik adamsitnik merged commit 34dbb20 into dotnet:master Feb 5, 2020
@adamsitnik adamsitnik deleted the fewerEpolls branch February 5, 2020 19:46
@tmds
Copy link
Member

tmds commented Feb 5, 2020

Mostly with the default number of connections: 256

So on the 6 core machine, engines went from 6 (ProcessorCount / 2) to 1.
And on the 14 core machine, they went from 8 (256/32) to 1.

PlainTextNonPipelined had the best improvement on the 6 core (+12%), and least improvement on the 14 core (+2%). It is the most Socket heavy benchmark. Maybe it would perform better on 2 engines.

@benaadams
Copy link
Member

Mostly with the default number of connections: 256, but I've tested some TechEmpower combinations (16 | 32 | 64 | 128 | 256 | 512) and bigger like 1000 or 2000 and the proposed change always gives better numbers (as expected, the number of engines for very big number of connections will be the same as before, it changed only for < 1024).

Aside: for Plaintext they do go up to 16k connections and publish the results in their data tables

image

@tmds
Copy link
Member

tmds commented Feb 5, 2020

For the high connection counts, the nr of engines will be the same as before.

MinHandlesForAdditionalEngine changed from 32 to 1024. Based on the benchmark results, I wonder how it would perform for 128 on the 14 core machine with 256 connections for PlainTextNonPipelined.

@benaadams
Copy link
Member

I wonder how it would perform for 128 on the 14 core machine for PlainTextNonPipelined.

Json mostly serves as a proxy for that?

image

image

Also the db loads will use the sockets with two (or more) different workloads; e.g. http + db connection (selects + updates); so are also interesting.

@tmds
Copy link
Member

tmds commented Feb 6, 2020

Json mostly serves as a proxy for that?

Yes. From official TE, that is the most Socket intensive one

Also the db loads will use the sockets with two (or more) different workloads; e.g. http + db connection (selects + updates); so are also interesting.

From perspective of where CPU is spent, db load spends less in Sockets. So changes made to Sockets/Transport show more in Json/PlainTextNonPipelined. That said, db load (and other benchmarks) need to look good for these changes too.

@tmds
Copy link
Member

tmds commented Feb 6, 2020

On the topic of proper benchmarks, we should have a benchmark that measures latency vs throughput.

@adamsitnik to elaborate on this. There is a lot of focus on how many RPS on TE benchmarks. Also important, maybe even more, is latency (what shows up in Ben's screenshots). And, all these benchmarks are ran at 100% CPU. That is not how CPU is loaded in production. Behavior could be different at lower CPU because the scheduler isn't confronted with an impossible job. That's why we need to have a good set of benchmarks (TE and 'others').

@adamsitnik
Copy link
Member Author

@tmds I agree that latency is very important. It's one of the metrics that we track (RPS, Latency, Memory, Startup Time).

As soon as #31869 is fixed and ASP.NET benchmarks work again we should see the updated numbers for RPS and Latency in https://msit.powerbi.com/view?r=eyJrIjoiYTZjMTk3YjEtMzQ3Yi00NTI5LTg5ZDItNmUyMGRlOTkwMGRlIiwidCI6IjcyZjk4OGJmLTg2ZjEtNDFhZi05MWFiLTJkN2NkMDExZGI0NyIsImMiOjV9

I am also going to include this metrics in my next PRs

adamsitnik added a commit that referenced this pull request Mar 20, 2020
adamsitnik added a commit that referenced this pull request Mar 20, 2020
@karelz karelz added this to the 5.0.0 milestone Aug 18, 2020
@ghost ghost locked as resolved and limited conversation to collaborators Dec 10, 2020
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
area-System.Net.Sockets os-linux Linux OS (any supported distro)
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants