Prevent Thread Pool Starvation in Synchronous Methods #62

caleblloyd · 2016-09-24T20:28:56Z

Right now, synchronous methods are implement as:

Task.GetAwaiter().GetResult()

This causes the running thread to schedule a background thread to do the work, and block until the result is available. This can lead to thread pool starvation, especially in the case of a web server.

For example, if the thread pool size is 25 and a web request comes in, there are 24 available threads. Then Connection.Open gets synchronously called, and that calls Connection.OpenAsync().GetAwaiter().GetResult(). The current thread blocks and starts a background thread to open the connection, so now there are 23 available threads.

If a flood of more requests come in, there can be a situation where there are no available threads to schedule the Connection.OpenAsync().GetAwaiter().GetResult() calls on. The app completely locks up at this point.

I think we should look for a way to run these synchronous tasks on the currently executing thread if possible. Either that or schedule the Async work to run on a reserved background thread. @bgrainger , what's your opinion of the best approach?

You can see the thread pool starvation happen if you run the performance stress test with synchronous targets. I get lockups around 200+ Requests Per Second.

The text was updated successfully, but these errors were encountered:

caleblloyd · 2016-09-24T21:59:38Z

This post describes a Synchronization Context that could be used to run all of the tasks in the executing thread: https://blogs.msdn.microsoft.com/pfxteam/2012/01/20/await-synchronizationcontext-and-console-apps/

I will take a shot at implementing this approach and measure performance

bgrainger · 2016-09-24T23:02:54Z

Fundamentally, this is the problem with synchronous I/O, and why async is encouraged. It's not worth addressing in this library because it's a scalability bug in the caller.

This causes the running thread to schedule a background thread to do the work, and block until the result is available.

This is not true (if you mean that a call to Connection.Open() takes two threads, and not just the one that called Open).

For example, if the thread pool size is 25 and a web request comes in, there are 24 available threads. Then Connection.Open gets synchronously called, and that calls Connection.OpenAsync().GetAwaiter().GetResult(). The current thread blocks and starts a background thread to open the connection, so now there are 23 available threads.

Connection.OpenAsync() will run until its first await statement (not strictly true, but "true enough"). At that point, it returns a Task and is ready to yield its thread (which is what would happen in an async scenario). But in this case, GetAwaiter().GetResult() will block on that task, using the thread it was called on. There will be 24 available threads. Any CPU work that needs to happen (e.g., to process the bytes that are received on the socket) will temporarily use an I/O completion port thread. No background thread is started.

You can see the thread pool starvation happen if you run the performance stress test with synchronous targets.

That's why scalable web server benchmarks don't use synchronous I/O.

caleblloyd · 2016-09-25T00:10:55Z

What else would cause thread pool starvation to the point of complete application lock-up with such a low number of requests? (200 RPS isn't that much) It has to be related to the GetAwaiter().GetResult() because it only happens in the Sync code.

I agree that Async all the way is the correct approach, but we can't force this on all libraries that may want to depend on us. For example, Pomelo was hitting complete app lock-ups in Async performance tests when it was calling all Async code, but using the Synchronous transactions before #60 was merged. Asynchronous transactions aren't in the official DbTransaction interface, so we need to find a way to gracefully support synchronous transactions.

I'll investigate further as to what is causing the app lockups in sync code and see if I can find a graceful workaround. There's probably a lot of apps out there that implement Sync code, and we don't want to cause them to lock up.

caleblloyd · 2016-09-25T14:50:44Z

@bgrainger I believe I have found the issue. The TryAsyncContinuation in MySqlSession.cs#L230 and MySqlSession.cs#L261 force Async code to run over a Synchronous method.

When Async methods hit this call, they stay in the same thread and complete, but we're not Async all the way through.

When Sync methods (such as Connection.Open hit this call, they are already in a thread blocking on Connection.OpenAsync().GetAwaiter().GetResult(). This means a new thread from the thread pool must be used to service the task.GetAwaiter().GetResult() in TryAsyncContinuation and leads to Thread Pool Starvation.

I think the TryAsyncContinuation calls need to be converted to async functions. I took a look at it, but I'm not familiar with ValueTask and think that you may be able to come up with a solution quicker than me. Do you have time to look at a solution, or would you like me to try?

bgrainger · 2016-09-25T15:11:35Z

When Connection.Open() is blocking on GetResult then any other computation (such as deserializing the payload) must have to run on another thread (by necessity). All that other work should be non-blocking though.

I'm not immediately sure that TryAsyncContinuation is the issue, though. It is queued up with ContinueWith so the antecedent Task should already be completed and GetAwaiter().GetResult() should immediately return that task's result or throw.

I can definitely debug this and verify that assumption.

I should also be able to run the perf harness under IIS on Windows and drop in the (synchronous-only) MySql.Data library to see if it achieves similar (or much higher?) RPS.

caleblloyd · 2016-09-25T15:15:31Z

I should also be able to run the perf harness under IIS on Windows and drop in the (synchronous-only) MySql.Data library to see if it achieves similar (or much higher?) RPS.

Correct, remove MySqlConnector from project.json and add MySql.Data/7.0.5-IR21

Would also be interesting to see if it locks up at a certain number of RPS or just gets slower.

bgrainger · 2016-09-25T16:10:13Z

Under IIS (with this library), I was able to sustain 825 req/s for 20s. (Xeon E5-1650 v3, local MySQL 5.7.11 for Windows). Thread count peaked at 60.

With dotnet run, I was only able to reach 700 req/s; not as much as IIS, but still significantly more than you were reporting. More investigation is needed.

caleblloyd · 2016-09-25T21:38:17Z

I opened #63 to make the calls to TryAsyncContinuation asynchronous. It helps performance by reducing Sync calls from Async code and reducing context switching.

As for Sync performance - there was a bug in the stress.ps1 script that caused it to always run against Async targets, the fix is in #64. I re-tested on Linux and Windows, and get lock-ups on both. I have not tried with IIS though, just Kestrel.

caleblloyd · 2016-09-26T00:56:36Z

Here's my theory on what's happening:

If I start 200 sync calls at once, I am seeing .NET spawn 200 threads (at the rate of about 2 threads per second) and then respond to all of the requests at once the 200th thread has been created.

I think that the Managed Thread Pool is queuing requests for threads. When the flood of 200 requests hits, it allocates the entire thread pool (say 10 for example) to those incoming requests. Then it queues the remaining 190 to get threads from the pool as they become available.

Next, the first sync request executes. It runs an async task that gets scheduled to run on a different Managed Thread Pool thread. But there is a line to get an available thread, so it is queued 191st in line to get a thread.

Now some Managed Thread Pool algorithm kicks in and sees that all threads are locked up. It decides to allocate new threads to the Managed Thread Pool, at the rate of about 2 threads per second. The new threads process from the front of the queue, so this must happen 190 times before any async code gets executed.

Finally, after 95 seconds, 190 threads have been spawned. The 191st thread starts executing async code, and flies through the backlog of async code. All of the outstanding requests complete very quickly.

I have no idea how to fix this besides for making Sync code use Sync paths all of the way down (a huge effort). Any ideas?

bgrainger · 2016-09-26T01:59:38Z

Here's my theory on what's happening

Makes sense; I've seen that before. (Pretty sure I've seen some good blog posts on it too, but can't find one right now.)

I have no idea how to fix this besides for making Sync code use Sync paths all of the way down (a huge effort). Any ideas?

One thing about await is that it "returns" immediately if the awaited task is already completed. So theoretically, if a sync Socket API was used at the lowest level, the regular call stack would unwind as though everything were synchronous. I've made some comments on this before at #21.

However, if the sync sockets API still needs to schedule some code on a threadpool thread, it won't help.

caleblloyd · 2016-09-26T11:50:49Z

One thing about await is that it "returns" immediately if the awaited task is already completed. So theoretically, if a sync Socket API was used at the lowest level, the regular call stack would unwind as though everything were synchronous.

This approach works.

I've got some WIP where I've added an asyncIO argument to MySqlSession.OpenSocketAsync, PacketTransmitter.DoSendAsync, and PacketTransmitter.ReceivePacketAsync

I am able to run Sync at 500RPS for 10 seconds using this WIP.

Now what I don't like about the WIP is that I've exposed an bool asyncIO parameter on every single function that eventually touches the socket. It gets very confusing.

What are your thoughts on putting the MySqlSession into a "Sync Mode", running the synchronous operation through the PacketTransmitter with asyncIO=false, waiting for it, then putting the session back into "Async Mode"?

bgrainger · 2016-09-26T13:37:54Z

Options that come to mind:

Write synchronous versions of every method instead of having X delegate to XAsync. Cons: very high code duplication, doubles testing overhead, likely to lead to bugs.
Add asyncIO parameter to every method. Pros: made explicit everywhere. Cons: "Infects" the whole code base, less obvious what a boolean parameter means when reading the code.
Add "SyncMode" field to MySqlSession. Pros: less "invasive". Cons: Makes XAsync methods behave synchronously depending on state in a different class.

Neither of the options is particularly appealing, but I think the "Sync Mode" is worth trying.

caleblloyd · 2016-09-26T14:28:07Z

Great, I'll work up a "Sync Mode" solution (unless you want to). May take me a few days.

bgrainger · 2016-09-26T18:03:28Z

Add "SyncMode" field to MySqlSession.

Another "con" for this is that it couldn't be used before a MySqlSession is established, but we might want synchronous I/O for the Open method. That pushes me back to adding an asyncIO parameter for consistency. It also seems cleaner overall to not have the behaviour of code "mysteriously" affected by "global" state.

bgrainger · 2016-09-26T18:05:19Z

Either way, we'd want to introduce more ValueTasks to avoid allocating Tasks on the synchronous code path. (But that would be best as a second, separate commit.)

bgrainger · 2016-09-26T21:22:05Z

we'd want to introduce more ValueTasks

But we can't in methods that use await until dotnet/roslyn#7169 is available.

bgrainger · 2016-09-26T21:27:02Z

@caleblloyd I decided to work on the "asyncIO parameter" approach locally, just to see how big the changes were. It actually wasn't as bad as I was expecting, so I've pushed it here for discussion (and comparison with your WIP): ae30b7f

I haven't actually yet run a stress test to verify that work isn't blocked when the threadpool is exhausted.

doubles testing overhead

This is going to be true for any approach, since we ultimately need to verify that both the complete sync and async code paths work correctly. The most critical test cases will probably be the ones where the code needs to loop to send/receive payloads larger than a single MySQL packet; I need to add variants of those tests for both sync and async.

caleblloyd · 2016-09-26T22:06:40Z

@bgrainger I like it, the use of the enum makes things less confusing. ReadAsync(asyncIO) was much more confusing due to the dual meanings of the word async.

I am curious to know, will Synchronous IO requests that are in progress get interrupted by a CancellationToken? I was trying to figure out how to interrupt a Synchronous OpenSocket function.

We should revert to the Connection Timeout Logic that you wrote in:

bgrainger · 2016-09-27T05:02:47Z

I am curious to know, will Synchronous IO requests that are in progress get interrupted by a CancellationToken?

The synchronous Socket API predates CancellationToken so: not by default. We would have to register a callback and Dispose the socket. However, this probably isn't what the client wants, most of the time. I would assume someone cancelling a query wants to abort that query, but leave the connection (and socket) itself alive.

It's probably useful to cancel socket I/O during Open/OpenAsync, but that might be it?

caleblloyd · 2016-09-27T13:22:17Z

It's probably useful to cancel socket I/O during Open/OpenAsync, but that might be it?

Socket Open/OpenAsync needs to cancel Socket IO in the case of ConnectionTimeout
Socket Send/SendAsync and Receive/ReceiveAsync needs to cancel Socket IO in the case of DefaultCommandTimeout (I don't think we implement this yet)

From the MySQL Driver docs:

MySQL Connector/Net 6.2 introduced timeouts that are aligned with how Microsoft handles SqlCommand.CommandTimeout. This property is the cumulative timeout for all network reads and writes during command execution or processing of the results. A timeout can still occur in the MySqlReader.Read method after the first row is returned, and does not include user processing time, only IO operations. The 6.2 implementation uses the underlying stream timeout facility, so is more efficient in that it does not require the additional timer thread as was the case with the previous implementation.

It sounds like Oracle's implementation of CommandTimeout is setting Socket.SendTimeout and Socket.ReceiveTimeout. They start at the timeout value (30 seconds) and subtract only time spent in I/O. This becomes the send or receive timeout for the next socket command.

If a timeout is hit, can we reset the session and keep it in the pool?

Also, in the case of ConnectionTimeout, will Socket.ReceiveTimeout throw an exception on a connection if no response is received for the Socket.Open request?

I'll open another issue for DefaultCommandTimeout and CommandTimeout

bgrainger · 2016-09-29T04:50:19Z

We should revert to the Connection Timeout Logic

Done in d65a3ae. This was a prerequisite for switching OpenSocketAsync to synchronous I/O (in 9f5ea21).

bgrainger added a commit that referenced this issue Sep 26, 2016

Use synchronous I/O for sync methods. Fixes #62

ae30b7f

bgrainger mentioned this issue Sep 26, 2016

Add Synchronous to MySqlConnectionStringBuilder. #21

Closed

bgrainger added a commit that referenced this issue Sep 27, 2016

Use synchronous I/O for sync methods. Fixes #62

dc63b88

caleblloyd mentioned this issue Oct 2, 2016

Improve Synchronous Performance PomeloFoundation/Pomelo.EntityFrameworkCore.MySql#59

Closed

bgrainger closed this as completed in 9f5ea21 Oct 3, 2016

caleblloyd mentioned this issue Oct 5, 2016

Improve Sync IO Performance PomeloFoundation/Pomelo.EntityFrameworkCore.MySql#71

Merged

caleblloyd mentioned this issue Oct 12, 2016

Fix SideBySide.ConnectSync.ConnectTimeout under Travis CI #100

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prevent Thread Pool Starvation in Synchronous Methods #62

Prevent Thread Pool Starvation in Synchronous Methods #62

caleblloyd commented Sep 24, 2016

caleblloyd commented Sep 24, 2016

bgrainger commented Sep 24, 2016

caleblloyd commented Sep 25, 2016

caleblloyd commented Sep 25, 2016 •

edited

Loading

bgrainger commented Sep 25, 2016

caleblloyd commented Sep 25, 2016

bgrainger commented Sep 25, 2016

caleblloyd commented Sep 25, 2016

caleblloyd commented Sep 26, 2016

bgrainger commented Sep 26, 2016

caleblloyd commented Sep 26, 2016

bgrainger commented Sep 26, 2016

caleblloyd commented Sep 26, 2016

bgrainger commented Sep 26, 2016

bgrainger commented Sep 26, 2016

bgrainger commented Sep 26, 2016

bgrainger commented Sep 26, 2016

caleblloyd commented Sep 26, 2016

bgrainger commented Sep 27, 2016

caleblloyd commented Sep 27, 2016 •

edited

Loading

bgrainger commented Sep 29, 2016

Prevent Thread Pool Starvation in Synchronous Methods #62

Prevent Thread Pool Starvation in Synchronous Methods #62

Comments

caleblloyd commented Sep 24, 2016

caleblloyd commented Sep 24, 2016

bgrainger commented Sep 24, 2016

caleblloyd commented Sep 25, 2016

caleblloyd commented Sep 25, 2016 • edited Loading

bgrainger commented Sep 25, 2016

caleblloyd commented Sep 25, 2016

bgrainger commented Sep 25, 2016

caleblloyd commented Sep 25, 2016

caleblloyd commented Sep 26, 2016

bgrainger commented Sep 26, 2016

caleblloyd commented Sep 26, 2016

bgrainger commented Sep 26, 2016

caleblloyd commented Sep 26, 2016

bgrainger commented Sep 26, 2016

bgrainger commented Sep 26, 2016

bgrainger commented Sep 26, 2016

bgrainger commented Sep 26, 2016

caleblloyd commented Sep 26, 2016

bgrainger commented Sep 27, 2016

caleblloyd commented Sep 27, 2016 • edited Loading

bgrainger commented Sep 29, 2016

caleblloyd commented Sep 25, 2016 •

edited

Loading

caleblloyd commented Sep 27, 2016 •

edited

Loading