-
Notifications
You must be signed in to change notification settings - Fork 294
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
CancellationToken registration leak, regression #1810
Comments
Hi @tlecomte, thanks for bringing this to our attention. We'll look into it soon. |
Thanks @lcheunglci. |
There's a fairly simple way to solve it using a mutable ref struct to hold the registration and then the context can Take() it if needed. If the struct has a value when it disposes then you dispose the value if not then it's been taken and is now owned by the context which will clean it up. |
I'm having this issue also. After moving to 5.0.1 my Linux server would leak memory rapidly until crashing. Analyzing a memory dump revealed millions of instances of SqlCommand rooted only by the cancellation token registration (CallbackNode). Even after just a couple hours, I can confirm that reverting to 5.0.0 has solved the problem. For an in-depth analysis, including dump screen shots, system configuration, and other information, please see this issue: (Azure App Service - Linux - Memory Working Set - last 30 days) (Azure App Service - Linux - Memory Working Set - last 3 days) I recommend marking 5.0.1 as unstable on NuGet for three reasons:
|
Same here |
It's not really a leak. A leak would be if we somehow had unreferenced memory filling memory and being unreclaimable. This is unwanted retention. It's also not how it used to work. I do find it odd that you're seeing so big a problem. From my understanding of the change that caused it the retention should only happen if you pass in a cts to a query execution which then returns before any query is required, doesn't that mean that large quantities of your queries are pointless because they're not going to succeed? |
I have not been able to understand how the bug affects my code yet. The code snapshot was taken after about an hour of running. There was 400 retained instances of SqlCommand, yet there was about 4.7 million CallbackNode instances at the time of the snapshot (see here for more details). The cancellation token for these references have an application lifetime and are never signaled. The code runs periodically and executes a few queries; perhaps several. Any exceptions in any of this code would have been logged; as far as I know all the code works just fine. So I don't really see how 4 million of anything could possibly have been executed in that timeframe. Unless it was individual packets of data I suppose, and even that would be a stretch. However, 400 SQL calls over an hour for this code sounds about right. The rest of the codebase stems from ASP.NET Core and seems to be unaffected; perhaps ASP.NET Core releases the reference to the CancellationTokenSource after each request is finished. It's also possible that all the code is affected and I am simply interpreting the memory dump incorrectly. Note that all my data access is through the linq2db data access layer. I do not use EF in this project, nor do I access SqlClient directly. I'd be happy to help diagnose this further....if I can. So far I have not been able to create a small code sample that reproduces the bug. |
You could see if you can reproduce the issue with the artifacts from the PR i opened to fix this #1818 if you can then i'll need to rework, otherwise we just wait for it to be merged and released. |
We are planning to release next preview on Tuesday 8th, if you can provide the feedback before that date, we will be able to include this PR, otherwise it will go to next year after GA release of 5.1 |
@JRahnama any chance to get the TimeOnly DateOnly PR in before next preview then? |
It is included in preview2 milestone and hopefully will get merged for next preview release unless reviewers find some odd behaviors/regression in the changes. |
I installed the nuget package and will report tomorrow with a memory usage graph. I can verify that it is loaded in memory on my production server as this code now returns "5.00-pull.55891" in the live environment: private string GetSqlClientVersion()
{
// pull a linq2db data context from DI
using var scope = _serviceScopeFactory.CreateScope();
var db = scope.ServiceProvider.GetRequiredService<HiveDataContext>();
// create a DbConnection instance
using var dbConnection = db.DataProvider.CreateConnection(db.ConnectionString!);
// get the "ProductVersion" from the assembly
var assembly = dbConnection.GetType().Assembly;
var fileVersionInfo = FileVersionInfo.GetVersionInfo(assembly.Location);
return fileVersionInfo.ProductVersion;
} |
I believe I have tracked down the problem. Any call to This explains how there can be such drastic consequences to a few number of SQL calls -- for example, this code might perhaps add a hundred or more "unwanted retentions" due to the high number of integer values that can be contained within in a single data packet from the server: await db.Products.Select(x => x.Id).ToListAsync(token); As before, I would strongly urge you to consider deprecating 5.0.1 immediately for the reasons stated previously. Below is a repro sample of the code necessary to demonstrate the "unwanted retentions": Sample: Console.WriteLine($"SqlClient version: {FileVersionInfo.GetVersionInfo(typeof(SqlCommand).Assembly.Location).ProductVersion}");
var cts = new CancellationTokenSource();
var token = cts.Token;
while (true)
{
using (var sqlConnection = new SqlConnection("Server=.\\SQLEXPRESS;Initial Catalog=master;Integrated Security=true;")) {
await sqlConnection.OpenAsync(token);
using var cmd = sqlConnection.CreateCommand();
cmd.CommandText = "SELECT n FROM (VALUES (0),(1),(2),(3),(4),(5),(6),(7),(8),(9)) v(n)";
using var reader = await cmd.ExecuteReaderAsync(token);
while (await reader.ReadAsync(token)) {
// ReadAsync executes 10 times for the 10 data values and one final time to notify the caller there is no more data
// Only the first call to ReadAsync actually awaits; the buffer contains data for the other 10 calls
}
}
GC.Collect(); // for demonstration only; removing this line does not change output
Console.WriteLine($"Lost registrations: {CountLostSqlCommands(token)}");
}
static int CountLostSqlCommands(CancellationToken token)
{
var cts = (CancellationTokenSource)typeof(CancellationToken).GetField("_source", System.Reflection.BindingFlags.Instance | System.Reflection.BindingFlags.NonPublic)!.GetValue(token)!;
var registrationsField = typeof(CancellationTokenSource).GetField("_registrations", System.Reflection.BindingFlags.Instance | System.Reflection.BindingFlags.NonPublic)!;
var registrations = registrationsField.GetValue(cts);
if (registrations == null)
return 0;
var callbacksProperty = registrationsField.FieldType.GetField("Callbacks", System.Reflection.BindingFlags.Instance | System.Reflection.BindingFlags.Public)!;
var callbackNode = callbacksProperty.GetValue(registrations);
var callbackNodeType = callbacksProperty.FieldType;
var nextProperty = callbackNodeType.GetField("Next", System.Reflection.BindingFlags.Instance | System.Reflection.BindingFlags.Public)!;
var stateProperty = callbackNodeType.GetField("CallbackState", System.Reflection.BindingFlags.Instance | System.Reflection.BindingFlags.Public)!;
int lostSqlCommands = 0;
while (callbackNode != null) {
var callbackState = stateProperty.GetValue(callbackNode);
if (typeof(SqlCommand) == callbackState?.GetType()) {
lostSqlCommands += 1;
}
callbackNode = nextProperty.GetValue(callbackNode);
}
return lostSqlCommands;
} Which produces the following:
With version 5.0.0:
With version 5.0.0-pull.55891:
As promised, here is the memory usage graph for 5.0.0-pull.55891: PR #1818 seems to solve the problem, although I question the logic of repeatedly registering a cancellation callback only to immediately remove it (the logic introduced in #1781 ). It would seem that the cancellation callback should only be registered once the short-circuit logic has been evaluated. However, I have not done a full review of #1781 and #1065 . |
So fix confirmed is good. I'll investigate the position of the registration to see if it can be moved without regressing the fix that moved it up but the order of operations is clearly important for catching some timing related behaviours so while I agree it would be nice to put it as low as possible that may not be practical. |
Thanks @Wraith2 ! Perhaps it would be wise to merge #1818 (and similar for 5.0.x) as-is so as to not hold up getting a fix out for this critical issue. Then we might be able to move the position of the registration as a performance enhancement in a later PR, if such a fix is practical. I'll leave that to you. My only suggestion would be to add test(s) similar to my sample above to confirm that there are no unwanted retentions after calling ReadAsync in various scenarios. Please let me know if I can be of further assistance. |
Is this going to be released in 5.0.2 ? I do not see a matching PR for the 5.0 servicing branch. I can post a PR if desired. |
5.1.0-preview2 which will be release sometime next week and GA of 5.1.0 will be released near end of the year. |
So the plan is to leave 5.0.1 as-is with a memory leak (unwanted retention) that occurs for almost every call to ReadAsync, affecting probably every single user of this version of the library, and some critically so? |
May I suggest to release a 5.0.2 anyway? You might want to reconsider ;) Afaik Nuget automatically resolves the lowest version of a package which fits the constraints, so until a new EF Core SqlServer Package is released referencing a new version of the client package (be it 5.0.2 or 5.1) everyone need to manually reference the fixed package. |
Sounds like a good idea. We'll have to make sure the EFCore team are aware that they might get some problems with the current version and I'm sure they'll pick up the later version as soon as it's available /cc @ajcvickers @roji |
Yep, we're very aware. I think it makes sense for an EF Core patch release to take a dependency on the SqlClient 5.0.2, when that's released - we'll be discussing that. |
We are seeing crashing AKS K8S pods with OutOfMemory exceptions after upgrading to 5.0.1. Seems memory usage increases 10-20 MB for each processing run which involves efcore. You really can't leave this bug and not fix it until 5.1.0 is out. You konw the reason and how to fix it, so please come together with the efcore guys and release both 5.0.2 and an updated ef 7 package the requires 5.0.2 at the same time. Thank you. |
@paalandreassen You can use 5.1 preview 2 today and add an explicit reference to it in your EF Core project. |
Is 5.1 preview 2 production ready? I really try to avoid preview releases in production. |
Yes, it looks like what is in now will be in 5.1: https://github.com/dotnet/SqlClient/projects/11 |
Any decision about a 5.0.2? |
@stevendarby 5.1 preview is out today, and 5.1 will appear beginning of next month. |
I know but as per above discussions, 5.0.2 would seem like a really good idea and I wanted to know if a decision was made about whether that was going to happen. |
Just putting this here as it's a Test that seems to work.
|
FYI @CodeAngry the bug occurs when the SQLClient already has the data cached for the call to ReadAsync. So for a proper test, you’d want to call ReadAsync at least twice; the bug triggers on the second call. I’m your example, just reading a second time (which should return false) should be enough. You can also use reflection to determine the exact number of lost registrations without relying on GC behavior. See code snippet at #1810 (comment) Perhaps your code is triggering the bug in a way I’m unaware however. |
@Shane32 My StackOverflow post (mentioned right above) has a infinite loop that triggers it. One |
Describe the bug
After updating to SqlClient 5.0.1, we observe a memory leak in our application. Memory dumps show a huge amount of CancellationTokenSource+CallbackNode entries (we had a dump with 2 GB of those), referencing SqlCommand objects.
We believe this might be due to the changes in:
With the changes in these PRs, it seems the CancellationToken
registration
is not always disposed. There are severalif
blocks that may return before theregistration
object gets to be saved in thecontext
. And exceptions could also be thrown.To reproduce
Expected behavior
No memory leak, CancellationTokenSource+CallbackNode entries are not present in a dump.
Further technical details
Microsoft.Data.SqlClient version: 5.0.1
.NET target: .NET 6.0.3
SQL Server version: SQL Server 2019
Operating system: Windows 2019
Additional context
The text was updated successfully, but these errors were encountered: