Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prevent Thread Pool Starvation in Synchronous Methods #62

Closed
caleblloyd opened this issue Sep 24, 2016 · 21 comments
Closed

Prevent Thread Pool Starvation in Synchronous Methods #62

caleblloyd opened this issue Sep 24, 2016 · 21 comments

Comments

@caleblloyd
Copy link
Contributor

Right now, synchronous methods are implement as:

Task.GetAwaiter().GetResult()

This causes the running thread to schedule a background thread to do the work, and block until the result is available. This can lead to thread pool starvation, especially in the case of a web server.

For example, if the thread pool size is 25 and a web request comes in, there are 24 available threads. Then Connection.Open gets synchronously called, and that calls Connection.OpenAsync().GetAwaiter().GetResult(). The current thread blocks and starts a background thread to open the connection, so now there are 23 available threads.

If a flood of more requests come in, there can be a situation where there are no available threads to schedule the Connection.OpenAsync().GetAwaiter().GetResult() calls on. The app completely locks up at this point.

I think we should look for a way to run these synchronous tasks on the currently executing thread if possible. Either that or schedule the Async work to run on a reserved background thread. @bgrainger , what's your opinion of the best approach?

You can see the thread pool starvation happen if you run the performance stress test with synchronous targets. I get lockups around 200+ Requests Per Second.

@caleblloyd
Copy link
Contributor Author

This post describes a Synchronization Context that could be used to run all of the tasks in the executing thread: https://blogs.msdn.microsoft.com/pfxteam/2012/01/20/await-synchronizationcontext-and-console-apps/

I will take a shot at implementing this approach and measure performance

@bgrainger
Copy link
Member

Fundamentally, this is the problem with synchronous I/O, and why async is encouraged. It's not worth addressing in this library because it's a scalability bug in the caller.

This causes the running thread to schedule a background thread to do the work, and block until the result is available.

This is not true (if you mean that a call to Connection.Open() takes two threads, and not just the one that called Open).

For example, if the thread pool size is 25 and a web request comes in, there are 24 available threads. Then Connection.Open gets synchronously called, and that calls Connection.OpenAsync().GetAwaiter().GetResult(). The current thread blocks and starts a background thread to open the connection, so now there are 23 available threads.

Connection.OpenAsync() will run until its first await statement (not strictly true, but "true enough"). At that point, it returns a Task and is ready to yield its thread (which is what would happen in an async scenario). But in this case, GetAwaiter().GetResult() will block on that task, using the thread it was called on. There will be 24 available threads. Any CPU work that needs to happen (e.g., to process the bytes that are received on the socket) will temporarily use an I/O completion port thread. No background thread is started.

You can see the thread pool starvation happen if you run the performance stress test with synchronous targets.

That's why scalable web server benchmarks don't use synchronous I/O.

@caleblloyd
Copy link
Contributor Author

What else would cause thread pool starvation to the point of complete application lock-up with such a low number of requests? (200 RPS isn't that much) It has to be related to the GetAwaiter().GetResult() because it only happens in the Sync code.

I agree that Async all the way is the correct approach, but we can't force this on all libraries that may want to depend on us. For example, Pomelo was hitting complete app lock-ups in Async performance tests when it was calling all Async code, but using the Synchronous transactions before #60 was merged. Asynchronous transactions aren't in the official DbTransaction interface, so we need to find a way to gracefully support synchronous transactions.

I'll investigate further as to what is causing the app lockups in sync code and see if I can find a graceful workaround. There's probably a lot of apps out there that implement Sync code, and we don't want to cause them to lock up.

@caleblloyd
Copy link
Contributor Author

caleblloyd commented Sep 25, 2016

@bgrainger I believe I have found the issue. The TryAsyncContinuation in MySqlSession.cs#L230 and MySqlSession.cs#L261 force Async code to run over a Synchronous method.

When Async methods hit this call, they stay in the same thread and complete, but we're not Async all the way through.

When Sync methods (such as Connection.Open hit this call, they are already in a thread blocking on Connection.OpenAsync().GetAwaiter().GetResult(). This means a new thread from the thread pool must be used to service the task.GetAwaiter().GetResult() in TryAsyncContinuation and leads to Thread Pool Starvation.

I think the TryAsyncContinuation calls need to be converted to async functions. I took a look at it, but I'm not familiar with ValueTask and think that you may be able to come up with a solution quicker than me. Do you have time to look at a solution, or would you like me to try?

@bgrainger
Copy link
Member

When Connection.Open() is blocking on GetResult then any other computation (such as deserializing the payload) must have to run on another thread (by necessity). All that other work should be non-blocking though.

I'm not immediately sure that TryAsyncContinuation is the issue, though. It is queued up with ContinueWith so the antecedent Task should already be completed and GetAwaiter().GetResult() should immediately return that task's result or throw.

I can definitely debug this and verify that assumption.

I should also be able to run the perf harness under IIS on Windows and drop in the (synchronous-only) MySql.Data library to see if it achieves similar (or much higher?) RPS.

@caleblloyd
Copy link
Contributor Author

I should also be able to run the perf harness under IIS on Windows and drop in the (synchronous-only) MySql.Data library to see if it achieves similar (or much higher?) RPS.

Correct, remove MySqlConnector from project.json and add MySql.Data/7.0.5-IR21

Would also be interesting to see if it locks up at a certain number of RPS or just gets slower.

@bgrainger
Copy link
Member

Under IIS (with this library), I was able to sustain 825 req/s for 20s. (Xeon E5-1650 v3, local MySQL 5.7.11 for Windows). Thread count peaked at 60.

With dotnet run, I was only able to reach 700 req/s; not as much as IIS, but still significantly more than you were reporting. More investigation is needed.

@caleblloyd
Copy link
Contributor Author

I opened #63 to make the calls to TryAsyncContinuation asynchronous. It helps performance by reducing Sync calls from Async code and reducing context switching.

As for Sync performance - there was a bug in the stress.ps1 script that caused it to always run against Async targets, the fix is in #64. I re-tested on Linux and Windows, and get lock-ups on both. I have not tried with IIS though, just Kestrel.

@caleblloyd
Copy link
Contributor Author

Here's my theory on what's happening:

If I start 200 sync calls at once, I am seeing .NET spawn 200 threads (at the rate of about 2 threads per second) and then respond to all of the requests at once the 200th thread has been created.

I think that the Managed Thread Pool is queuing requests for threads. When the flood of 200 requests hits, it allocates the entire thread pool (say 10 for example) to those incoming requests. Then it queues the remaining 190 to get threads from the pool as they become available.

Next, the first sync request executes. It runs an async task that gets scheduled to run on a different Managed Thread Pool thread. But there is a line to get an available thread, so it is queued 191st in line to get a thread.

Now some Managed Thread Pool algorithm kicks in and sees that all threads are locked up. It decides to allocate new threads to the Managed Thread Pool, at the rate of about 2 threads per second. The new threads process from the front of the queue, so this must happen 190 times before any async code gets executed.

Finally, after 95 seconds, 190 threads have been spawned. The 191st thread starts executing async code, and flies through the backlog of async code. All of the outstanding requests complete very quickly.

I have no idea how to fix this besides for making Sync code use Sync paths all of the way down (a huge effort). Any ideas?

@bgrainger
Copy link
Member

Here's my theory on what's happening

Makes sense; I've seen that before. (Pretty sure I've seen some good blog posts on it too, but can't find one right now.)

I have no idea how to fix this besides for making Sync code use Sync paths all of the way down (a huge effort). Any ideas?

One thing about await is that it "returns" immediately if the awaited task is already completed. So theoretically, if a sync Socket API was used at the lowest level, the regular call stack would unwind as though everything were synchronous. I've made some comments on this before at #21.

However, if the sync sockets API still needs to schedule some code on a threadpool thread, it won't help.

@caleblloyd
Copy link
Contributor Author

One thing about await is that it "returns" immediately if the awaited task is already completed. So theoretically, if a sync Socket API was used at the lowest level, the regular call stack would unwind as though everything were synchronous.

This approach works.

I've got some WIP where I've added an asyncIO argument to MySqlSession.OpenSocketAsync, PacketTransmitter.DoSendAsync, and PacketTransmitter.ReceivePacketAsync

I am able to run Sync at 500RPS for 10 seconds using this WIP.

Now what I don't like about the WIP is that I've exposed an bool asyncIO parameter on every single function that eventually touches the socket. It gets very confusing.

What are your thoughts on putting the MySqlSession into a "Sync Mode", running the synchronous operation through the PacketTransmitter with asyncIO=false, waiting for it, then putting the session back into "Async Mode"?

@bgrainger
Copy link
Member

Options that come to mind:

  • Write synchronous versions of every method instead of having X delegate to XAsync. Cons: very high code duplication, doubles testing overhead, likely to lead to bugs.
  • Add asyncIO parameter to every method. Pros: made explicit everywhere. Cons: "Infects" the whole code base, less obvious what a boolean parameter means when reading the code.
  • Add "SyncMode" field to MySqlSession. Pros: less "invasive". Cons: Makes XAsync methods behave synchronously depending on state in a different class.

Neither of the options is particularly appealing, but I think the "Sync Mode" is worth trying.

@caleblloyd
Copy link
Contributor Author

Great, I'll work up a "Sync Mode" solution (unless you want to). May take me a few days.

@bgrainger
Copy link
Member

Add "SyncMode" field to MySqlSession.

Another "con" for this is that it couldn't be used before a MySqlSession is established, but we might want synchronous I/O for the Open method. That pushes me back to adding an asyncIO parameter for consistency. It also seems cleaner overall to not have the behaviour of code "mysteriously" affected by "global" state.

@bgrainger
Copy link
Member

Either way, we'd want to introduce more ValueTasks to avoid allocating Tasks on the synchronous code path. (But that would be best as a second, separate commit.)

@bgrainger
Copy link
Member

we'd want to introduce more ValueTasks

But we can't in methods that use await until dotnet/roslyn#7169 is available.

@bgrainger
Copy link
Member

@caleblloyd I decided to work on the "asyncIO parameter" approach locally, just to see how big the changes were. It actually wasn't as bad as I was expecting, so I've pushed it here for discussion (and comparison with your WIP): ae30b7f

I haven't actually yet run a stress test to verify that work isn't blocked when the threadpool is exhausted.

doubles testing overhead

This is going to be true for any approach, since we ultimately need to verify that both the complete sync and async code paths work correctly. The most critical test cases will probably be the ones where the code needs to loop to send/receive payloads larger than a single MySQL packet; I need to add variants of those tests for both sync and async.

@caleblloyd
Copy link
Contributor Author

@bgrainger I like it, the use of the enum makes things less confusing. ReadAsync(asyncIO) was much more confusing due to the dual meanings of the word async.

I am curious to know, will Synchronous IO requests that are in progress get interrupted by a CancellationToken? I was trying to figure out how to interrupt a Synchronous OpenSocket function.

We should revert to the Connection Timeout Logic that you wrote in:

@bgrainger
Copy link
Member

I am curious to know, will Synchronous IO requests that are in progress get interrupted by a CancellationToken?

The synchronous Socket API predates CancellationToken so: not by default. We would have to register a callback and Dispose the socket. However, this probably isn't what the client wants, most of the time. I would assume someone cancelling a query wants to abort that query, but leave the connection (and socket) itself alive.

It's probably useful to cancel socket I/O during Open/OpenAsync, but that might be it?

@caleblloyd
Copy link
Contributor Author

caleblloyd commented Sep 27, 2016

It's probably useful to cancel socket I/O during Open/OpenAsync, but that might be it?

  • Socket Open/OpenAsync needs to cancel Socket IO in the case of ConnectionTimeout
  • Socket Send/SendAsync and Receive/ReceiveAsync needs to cancel Socket IO in the case of DefaultCommandTimeout (I don't think we implement this yet)

From the MySQL Driver docs:

MySQL Connector/Net 6.2 introduced timeouts that are aligned with how Microsoft handles SqlCommand.CommandTimeout. This property is the cumulative timeout for all network reads and writes during command execution or processing of the results. A timeout can still occur in the MySqlReader.Read method after the first row is returned, and does not include user processing time, only IO operations. The 6.2 implementation uses the underlying stream timeout facility, so is more efficient in that it does not require the additional timer thread as was the case with the previous implementation.

It sounds like Oracle's implementation of CommandTimeout is setting Socket.SendTimeout and Socket.ReceiveTimeout. They start at the timeout value (30 seconds) and subtract only time spent in I/O. This becomes the send or receive timeout for the next socket command.

If a timeout is hit, can we reset the session and keep it in the pool?

Also, in the case of ConnectionTimeout, will Socket.ReceiveTimeout throw an exception on a connection if no response is received for the Socket.Open request?

I'll open another issue for DefaultCommandTimeout and CommandTimeout

@bgrainger
Copy link
Member

We should revert to the Connection Timeout Logic

Done in d65a3ae. This was a prerequisite for switching OpenSocketAsync to synchronous I/O (in 9f5ea21).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

No branches or pull requests

2 participants