-
Notifications
You must be signed in to change notification settings - Fork 293
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Async opening of connections in parallel is slow/blocking #601
Comments
@saurabh500 Thank you for your comment on issue #408, I replied here to be able to refer to the code above. I do understand your point, there definitely are some valid concerns that warrant rate limiting connections. But I don't think they fully apply for our case. The main problem for us is that we are opening connections over a high latency connection to SQL server (in this case SQL Azure with a round trip latency of over 100ms between SQL Azure and our web server). So the huge majority of the waiting is done due to the latency of the network connection and not due to any load on the SQL server. In our scenario opening a single connection takes nearly one second (see above). This means that while our web server has been optimized to start (or restart) in less than 3 seconds, it takes several minutes till it is able to open sufficient connections to serve the users who have connected in the meantime. During this time only roughly a single connection per second is being opened! Of course SQL Azure is happy to open easily 10-100 times more connections per second, as is quickly proven by connecting to SQL Azure from within the same datacenter. Maybe the rate limiting should be on the number of connections opened, and should be less strict on the number of connections being in the process of being opened concurrently. Note that even in non-pooled mode only 8 connections at a time seem to be getting opened so we did not find a good workaround. So in our scenario in non-pooled mode we can only open about 8 connections per second! |
@Wraith2 I saw your comment on #408. I answered it here to be able to relate to the example above. In our application we have hundreds of users on a single server and we are able to feed 10-20 queries at once. However, for our servers that are in locations with high network latency between SQL Azure and our web servers we need many more connections to keep the same throughput. Let's assume a network round trip network latency of 2ms in a local data center and a query time of 2ms. So total time from the client perspective is 4ms. Let's now assume a location with a round trip network latency of 78ms. In this case the total time from the client perspective for the query is 80ms. So in order to achieve the same amount of queries executed per second on SQL server we need 20 times as many connections because the huge majority of the time is spent waiting due to network latency. So in order to get the same throughput in high latency environments one needs a significant larger number of connections between the application server and database server. So where in a low latency environment 10 connections might suffice, one might need 200 connections in a higher latency environment to achieve the same throughput. (Note that these number of open connections do definitely not overwhelm SQL Azure.) So that is why it would be great: 1. open new connections at a much faster rate. 2. open the connection in a truly async way (no thread blocking). |
@OQO I was able to reproduce the issue. I took the 4 seconds task delay out and replace it with actual query the result was better after, but as you mentioned the results of comparison between pooled connections and non pooled ones are not as we expect them to be. We will look into the issue and will response back soon. |
@OQO I have a question to understand better, Why are we clearing the pool each time? clearing the pool forces the driver to create another pool each time and that is time consuming. If we only clear the pool before the non pooled connection as below: this will make a huge difference in re using pooled connection. Is there any specific reason behind clearing the pool? |
In this test we want to compare two scenarios of opening multiple connections with high latency network at application startup: 1. pooled and 2. non-pooled. In order to compare these two scenarios we reset the connection pools between the experiments so that we can have a fair comparison. If you would like you could just make separate executables to run the two different scenarios (Results will be the same) The WARM UP scenario just serves to measure a baseline so that we know how long it takes to set up a single connection. Opening a single connection is done multiple times, since the first time is always slower due to the loading of the required assemblies etc. doing it then twice more shows a stable time so that proves that our measurements are no longer influenced by loading assemblies etc. Hope this clarifies it! |
@OQO I made some changes and it seems the code runs fine. Can you kindly make below changes and test your application again:
Here is the result I got with this run:
|
This is the updated version of the test program based on your feedback. Min and max pool sizes have been set. And a query is executed on the database. The query consists of a wait so that we can reliably test the connection issues. Tests are run with query time between .5 seconds and 3.5 seconds. Please set Note that in your last experiment above you are connecting to a database over a network connection with a low latency. If you are in the USA, try connecting to a database in Asia East for example to check the high latency scenario. using Microsoft.Data.SqlClient;
using System;
using System.Collections.Concurrent;
using System.Diagnostics;
using System.IO;
using System.Threading.Tasks;
using System.Transactions;
namespace ConsoleExperiments
{
class Program
{
static void Main(string[] args)
{
TestConnections().Wait();
}
static async Task TestConnections()
{
// Low latency connection
string lowLatencyConnectionString = "Server=tcp:XXXX.database.windows.net,1433;Initial Catalog=XXXXs;Persist Security Info=False;User ID=XXXX;Password=XXXX!;MultipleActiveResultSets=False;Encrypt=True;TrustServerCertificate=False;Connection Timeout=30;";
// High latency connection
string highLatencyConnectionString = "Server=tcp:XXXX.database.windows.net,1433;Initial Catalog=XXXX;Persist Security Info=False;User ID=XXXXr;Password=XXXX;MultipleActiveResultSets=False;Encrypt=True;TrustServerCertificate=False;Connection Timeout=30;";
string[] connectionStrings = new string[] { lowLatencyConnectionString, highLatencyConnectionString };
string connectionType = "LowLatency";
string csv = "";
foreach (string connectionString in connectionStrings)
{
TimeSpan queryTime = TimeSpan.FromSeconds(0.5);
for (int i = 0; i < 7; i++)
{
(TimeSpan pooledTime, TimeSpan nonPooledTime) = await SingleRun(connectionString, queryTime);
csv += $"{connectionType},{queryTime.TotalSeconds},{pooledTime.TotalSeconds},{nonPooledTime.TotalSeconds}{Environment.NewLine}";
queryTime += TimeSpan.FromSeconds(0.5);
}
connectionType = "HighLatency";
}
File.WriteAllText("ConnectionTest.csv", csv);
Console.WriteLine("\nTesting finished");
Console.ReadLine();
}
private static async Task<(TimeSpan pooledTime, TimeSpan nonPooledTime)> SingleRun(string connectionString, TimeSpan queryTime)
{
connectionString += "Min Pool Size=200;Max Pool Size=500;";
Console.WriteLine("WARM UP");
await MeasureSingleConnectionAndReuse(connectionString);
ClearPools();
await MeasureSingleConnectionAndReuse(connectionString);
ClearPools();
await MeasureSingleConnectionAndReuse(connectionString);
ClearPools();
Console.WriteLine("\n\nCONCURRENT POOLED CONNECTIONS");
TimeSpan pooledTime = MeasureParallelConnections(connectionString + "Pooling=true;", queryTime);
ClearPools();
Console.WriteLine("\n\nCONCURRENT NON-POOLED CONNECTIONS");
TimeSpan nonPooledTime = MeasureParallelConnections(connectionString + "Pooling=false;", queryTime);
ClearPools();
return (pooledTime, nonPooledTime);
}
private static void ClearPools()
{
SqlConnection.ClearAllPools();
Console.WriteLine("ALL POOLS CLEARED");
}
static ConcurrentDictionary<Guid, object> _connectionIDs = new ConcurrentDictionary<Guid, object>();
private static TimeSpan MeasureParallelConnections(string connectionString, TimeSpan queryTime)
{
Console.WriteLine("Start delay OpenAsync time Connection ID ReusedFromPool");
Stopwatch sw = new Stopwatch();
sw.Start();
int numOpens = 100;
Task[] tasks = new Task[numOpens];
Stopwatch start = new Stopwatch();
start.Start();
for (int i = 0; i < numOpens; i++)
{
tasks[i] = MeasureSingleConnection(i, start, connectionString, queryTime);
}
Task.WaitAll(tasks);
start.Stop();
Console.WriteLine($"{sw.Elapsed} {numOpens} connections opened in paralel");
return start.Elapsed;
}
private static async Task MeasureSingleConnection(int index, Stopwatch start, string connectionString, TimeSpan queryTime)
{
TimeSpan startDelay = start.Elapsed;
Stopwatch sw = new Stopwatch();
using (SqlConnection connection = new SqlConnection(connectionString))
{
sw.Start();
await connection.OpenAsync();
Console.WriteLine($"{startDelay} {sw.Elapsed} {index} {connection.ClientConnectionId} {IsReuse(connection)}");
//await Task.Delay(4000);
ExecuteQuery(connection, queryTime);
}
}
private static async Task MeasureSingleConnectionAndReuse(string connectionString)
{
Stopwatch sw = new Stopwatch();
using (SqlConnection connection = new SqlConnection(connectionString))
{
sw.Start();
await connection.OpenAsync();
Console.WriteLine($"{sw.Elapsed} {connection.ClientConnectionId} {IsReuse(connection)} Single open time ");
}
using (SqlConnection connection = new SqlConnection(connectionString))
{
sw.Restart();
await connection.OpenAsync();
Console.WriteLine($"{sw.Elapsed} {connection.ClientConnectionId} {IsReuse(connection)} Single open time with one previously opened connection");
}
}
private static bool IsReuse(SqlConnection connection)
{
return !_connectionIDs.TryAdd(connection.ClientConnectionId, null);
}
private static void ExecuteQuery(SqlConnection connection, TimeSpan queryTime)
{
SqlCommand command = connection.CreateCommand();
command.CommandText = $"WAITFOR DELAY '{queryTime:hh\\:mm\\:ss\\:fff}';";
command.ExecuteNonQuery();
}
}
} |
Here are the results of a single run (Please run it multiple times to get smoother curves). (Low latency network connection opening of single connection is around 200ms and High latency network connection opening of single connection is around 1000ms) It still shows that the non pooled approach is still faster in almost all scenarios. Only opening one connection at a time when in pooled mode when connecting to a database over a high latency connection is shown to be significantly slower than using the non pooled mode! As a separate second issue note also that in both modes the opening of connections is done by blocking threads, and not by awaiting the network traffic! This needlessly blocks threads instead of awaiting them internally. Let me know if you would like me to file a separate issue for this. |
@OQO Thanks for the update. Just a quick test recommendation: |
There is another interesting issue in here. This only happens on delays. Right now the server Waits for the queryTime. When I change the query to a buffer, for example
code runs again as expected. I will look into the part that a single connection is made and pool is created properly, but a clarifying question, we have to put the delay either in the code task or in the server? I will take a deeper look tomorrow. |
You can also change the query time to 50 milliseconds or similar value. This will lead to the same results as the query you are putting in instead of waiting. I prefer to put wait times in so that we get more reliable query times during testing. The problem observed in the graphs only occur if we keep the connections busy. If the connections are only busy with queries for very short amount of times, then they can be reused more quickly, and there will be no problems. However, in many real life scenarios (Like in ours)queries take more than 500ms to complete. |
Thanks to @OQO for raising this issue and the comment
We ran into an issue in production when switching from System.Data.SqlClient to Microsoft.Data.SqlClient with upgrading to netcore3.1 where we had some sync over async code that we have not had the time to switch over completely yet (bad I know). The end result was that when a new server came up under load, a whole bunch of threads got blocked waiting on the connections to be created for the pool and then our the threads consumed continued to increase on our servers and never recovered (partly also because our sync over async increased the threads blocked). We are working on async all the way, but even as we're making progress can still see this issue. We used the dotnet counters to look at the sql connections and the active connections ended up pegged at the max and not freed up. The retrieved/released number dropped really low when we saw this problem as well. Thanks to @JRahnama for the suggestion to warm up the connection pools with the MinPoolSize - that resolved this issue for us. We're now testing out what those numbers should be in our multi-tenant environment and how it affects our SQL servers. |
* Turned OFF connection pooling on SQL Server connection string for tests - Found several discussions on this, e.g. dotnet/SqlClient#601 - Tried turning off pooling, and it solved the problem... * Upgraded SQL server docker image used in test from 2017-latest to latest (2019 at the time) - didn't make any difference, just to stay current.
* Turned OFF connection pooling on SQL Server connection string for tests - Found several discussions on this, e.g. dotnet/SqlClient#601 - Tried turning off pooling, and it solved the problem... * Upgraded SQL server docker image used in test from 2017-latest to latest (2019 at the time) - didn't make any difference, just to stay current.
* Turned OFF connection pooling on SQL Server connection string for tests - Found several discussions on this, e.g. dotnet/SqlClient#601 - Tried turning off pooling, and it solved the problem... * Upgraded SQL server docker image used in test from 2017-latest to latest (2019 at the time) - didn't make any difference, just to stay current.
It looks like this bug might be significantly improved by #2384 ? |
@benrr101 It definitely sounds very promising to resolve the blocking part of async opening of a connection, Looking forward to test it once it is available. Note that the changes in #2384 will not solve the other core issue of the connection pool being too conservative in opening new connections for high latency connection scenarios between client and SQL server. |
I am putting some debugging notes here. I looked into the code for DbConnectionPool for async scenarios. I put a bunch of stopwatches inside the code to understand what is going on I had set out to get the following answers:
Before I explain my findings so far, here is how the connection establishment part of connection pool works
Findings
After a request is processed, the thread sets the Task completion
After this investigation, I know that the connections are drained very quickly (I tried this with an Azure SQL DB instance on East US with my client in Redmond) to simulate the high latency scenario. The DB is a free tier DB to make sure that we dont get any preferential treatment from the server with resource governance. Ask: Any suggestion for any debugging techniques to further investigate the context switch? I tried to take perfview traces and got lost, had no idea where to investigate. It was quite easy to get lost. Are there better techniques in Modern C# to make sure that we can avoid the expense of context switching between threads. I understand that writing the Code as async/await and making sure that we address async scenarios nicely and gracefully, makes sense, but I was wondering if there are cheaper ways of trying to optimize the connection pool in the context of my investigation, before a more expensive engineering exercise is taken on this. Leaving some references here in code, about the spinning up of the background thread to process the connection requests in async Open SqlClient/src/Microsoft.Data.SqlClient/netcore/src/Microsoft/Data/ProviderBase/DbConnectionPool.cs Lines 1151 to 1156 in 5cb73fd
Few edits:
|
@Wraith2 as the person with the most perf experience here. Any thoughts?
@saurabh500 |
Just a correction, I dont think I need to debug the context switch. Sorry I should have mentioned this earlier. @David-Engel Re: "Open the floodgates" My take is that we make the connection pool configurable, by default one connection open at a time. But this behavior can be configured by the app developer. I know of both kinds of users, the ones who are happy with the current behavior, and we wouldn't want want any side effects of behavior change on their SQL connectivity. I still don't have a point of view on whether this opt-in switch will be a number to say how many connections open in parallel vs a boolean which says, "just open what I am asking for" |
I've tried to understand the pooling implementation a few times and always given up. It's very complex and confusing. My general understanding matches the description @saurabh500 gives in #601 (comment) earlier. Overall I agree that:
@roji has mentioned a few times in the past that npgsql moved to a lock free pool implementation and had good speed increases. This might be worth investigating. Engineering a new pool entirely and switching to it with a strategy pattern approach is likely to be simpler than trying to carefully tweak new behaviours onto the existing pool implementation. |
Yeah, we had a custom lock-free implementation which itself wasn't trivial; but we switched to a much simpler approach based on System.IO.Channels, where the "idle connector list" is simply a Channel (aka queue) where renting out a connector is implemented as dequeueing, and returning a connection to the pool is enqueueing. It really isn't very complex and performs well.
I tend to agree... The only possible problem is various "special" behaviors implemented in the current pool that would need to be carried over (or consciously dropped) - that would require at least some level of understanding of how/what the current pool implementation does... |
Or.. a decision and documentation that the new pool strategy does not get configured by the old pool configuration |
@Wraith2 can you chime in with some examples of what is considered "special" behavior? |
@roji we have been looking into Channels as an option for pools as well (inspired by our conversations). We will start posting some design thoughts for discussion as we form points of view. |
Not really, roji mentioned them i was just responding. My pool configuration tends to start and end with size. |
Channels aren't going to help here because the problem isn't how connections are pooled, but how they're opened. SqlClient/src/Microsoft.Data.SqlClient/netcore/src/Microsoft/Data/ProviderBase/DbConnectionPool.cs Lines 1108 to 1132 in 5cb73fd
The main takeaway here is Async open: SqlClient/src/Microsoft.Data.SqlClient/netcore/src/Microsoft/Data/SqlClient/SqlConnection.cs Lines 1688 to 1711 in 5cb73fd
Sync open: SqlClient/src/Microsoft.Data.SqlClient/netcore/src/Microsoft/Data/SqlClient/SqlConnection.cs Line 1411 in 5cb73fd
I have no idea why someone would want for synchronous open to open physical connections concurrently, but for async open sequentially, but that's the current behavior. |
You're welcome to take a look at Npgsql's implementation here. Of course, as @vonzshik points out, the actual pool implementation is only part of this larger puzzle, and the problem specifically tracked by this issue may be outside of the pool per se. |
Nothing much to add here, but great to see this happening! |
Describe the bug
Multiple opens of connections are very slow and seem to be bocking. In pooled mode it seems only a very limited amount of new connections is being opened concurrently (in example below it seems only 1 seems to be opened up at a time). In non-pooled mode more connections are opened concurrently (in example below 8 connections seem to be opened at a time).
Issue has big impact when connecting to SQL Azure from location with higher latency connection. Issue does not have measurable impact when connecting to SQL Azure when in the same datacenter due to super fast connection times.
This issue has a great impact on startup time of our servers, since it takes up to several minutes before our web servers can handle all the requests due to the slow opening of the SQL connections.
To reproduce
Expected behavior
The code first connects sequentially three times to get a good base line of the OpenAsync times of a new connection and a reused connection from the pool.
Then it opens 100 connections in parallel first in pooled mode, then in non-pooled mode. It keeps every connection open for 4 seconds to simulate a slow query (note that even with lower query times the same problems are observed). Note that number of threads of application stays low the whole time, so this is not a thread pool issue.
In the parallel runs we log the time since the start of the experiment to ensure that delays are not due to slow thread pool scheduling. All these numbers are very low and show that all OpenAsync calls are more or less done at the same time. Note that CPU usage is also extremely low during these tests.
The second displayed time shows the time it took till for OpenAsync to return.
Note that doing the total test of 100 queries in non-pooled mode is on average twice as fast as in pooled mode which is very unexpected.
Even in non-pooled mode looking at the times it seems connections are created in batches (in this case of eight) and block the creation of new connections.
In pooled mode only one new connection to SQL server seems to get opened at a time.
In pooled mode we observe one worker thread at a time working (and mostly waiting/blocking) opening new connections. In pooled mode we observe 8 threads at a time working on new connections (and mostly waiting/blocking). It seems that the code internally for OpenAsync for opening connections seem to be blocking worker threads instead of using async to wait for the network. Note however, that this seems not to be the root cause of the slowness in opening new connections. Otherwise we would see an attempt at using more threads to open connections.
Further technical details
Microsoft.Data.SqlClient version: 2.0.0-preview4.20142.4 (same problem on 1.1.3 and on System.Data.SqlClient)
.NET target: .net Core 3.1 (Same issue on regular .net)
SQL Server version: SQL Azure
Operating system: Windows 10 (But also on Windows Server editions and on Ubuntu)
Additional context
Results obtained from test run:
The text was updated successfully, but these errors were encountered: