Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

System.Data.SqlClient 4.8.2 throws TransactionAbortedException (the connection has been broken) #729

Closed
MichaelLogutov opened this issue Sep 11, 2020 · 40 comments

Comments

@MichaelLogutov
Copy link

Description

After updating System.Data.SqlClient version from old 4.6.1 to latest 4.8.2 we started to see period errors like this:

System.Transactions.TransactionAbortedException: The transaction has aborted. ---> System.InvalidOperationException: The requested operation cannot be completed because the connection has been broken.

   --- End of inner exception stack trace ---
   at System.Transactions.TransactionStateAborted.EndCommit(InternalTransaction tx)
   at System.Transactions.CommittableTransaction.Commit()
   at System.Transactions.TransactionScope.InternalDispose()
   at System.Transactions.TransactionScope.Dispose()
   ...

Those exceptions happening in pure ADO.NET code with TransactionScope created with ReadCommitted isolation level, TransactionScopeOption.Required, 1 minute timeout and TransactionScopeAsyncFlowOption.Enabled.

After rolling back to System.Data.SqlClient 4.8.1 those exceptions dissapeared but we saw another new type of exceptions we've never seen before from cassandra driver:

Exception while trying borrow a connection from a pool

System.Net.Sockets.SocketException: Transport endpoint is not connected
   at Cassandra.Connections.HostConnectionPool.GetExistingConnections()
   at Cassandra.Connections.HostConnectionPool.BorrowExistingConnection()
   at async Cassandra.Connections.HostConnectionPool.GetConnectionFromHostAsync(?)

Downgrading back to 4.6.1 seems to fixed all issues.

Configuration

It happened on aspnet core 3.1 web applications and console netcore 3.1 applications with latest 3.1 images both running in container on Linux.

Regression?

It works on System.Data.SqlClient 4.6.1

Other information

This issue happening on heavy load application and only on production. I can't reproduce this issue locally.
I wish I could find source for System.Data.SqlClient library to see the diff for 4.8.2 version.

@ghost
Copy link

ghost commented Sep 11, 2020

Tagging subscribers to this area: @cheenamalhotra, @David-Engel
See info in area-owners.md if you want to be subscribed.

@Wraith2
Copy link
Contributor

Wraith2 commented Sep 12, 2020

The System version the library has been frozen and moved out of this repository in favour of Microsoft.Data.SqlClient https://github.com/dotnet/SqlClient which can be versioned separately. It contains fixes and improvements beyond those present in the framework provided versions. If possible give it a try and see if the behaviour you're seeing has already been resolved.

@MichaelLogutov
Copy link
Author

Does it requires rewriting all namespaces Microsoft.*? Because we can't afford such rework in our libraries and all microservices.

@Wraith2
Copy link
Contributor

Wraith2 commented Sep 13, 2020

It does. You should probably consider introducing abstractions in your codebase to avoid such things causing agility limitations.

@MichaelLogutov
Copy link
Author

MichaelLogutov commented Sep 13, 2020

Well, it's not that simple when you have large code base and 20 product team developers.

Can I see the source code or at least changelog of what been changed in 4.8.2?

@MichaelLogutov
Copy link
Author

Thanks. I can see the source now but I can't seems to see the changes in 4.8.2
https://github.com/microsoft/dotnet/tree/master/releases do not seems to have release notes for individual nuget packages like System.Data.SqlClient. Or I'm missing something?

@Wraith2
Copy link
Contributor

Wraith2 commented Sep 14, 2020

It's part of the framework so it doesn't get it's own update schedule. You probably want to review all the pull requests which targeted files in that project.

@MichaelLogutov
Copy link
Author

I couldn't find the source code in github for System.Data.SqlClient. I saw only System.Data.Common

@devsko
Copy link

devsko commented Sep 14, 2020

AFAIK System.Data.SqlClient is not open sourced but source open at https://referencesource.microsoft.com/#System.Data as mentioned above.

@Wraith2
Copy link
Contributor

Wraith2 commented Sep 14, 2020

AFAIK System.Data.SqlClient is not open sourced but source open at

It absolutely was and was part of this repository. It was removed in favour of the Microsoft version as I said earlier. To see the code you'll have to look into the history prior to it's removal.

@devsko
Copy link

devsko commented Sep 14, 2020

@Wraith2 Just curious. It was the same code for core / desktop? Where lives the System.Data.SqlClient desktop code now?

@Wraith2
Copy link
Contributor

Wraith2 commented Sep 14, 2020

The desktop version was distinct from the core version. This repo only contained the core version.
The code is still here but not current which I why I say you'll need to look through the repo history to look at it. You can look back at any file that has ever existed by reviewing the git history but because it isn't currently being build and shipped it's not visible in the current master.

The code was forked into Microsoft.Data.SqlClient which also contains the desktop version. The Microsoft version supercedes the System version and is versioned independently so it can be updated without a framework update. It contains performance improvements new features and bug fixes that may not be ported back to core or desktop verisons.

@devsko
Copy link

devsko commented Sep 14, 2020

Thanks.
I totally misunderstood this issue. After reading 4.6.1 and 4.8.2 I thought it is all about .NET Framework - but it isn't. Sorry for the confusion.

@cheenamalhotra
Copy link
Member

Tagging @saurabh500 @corivera for System.Data.SqlClient (.Net Framework)

@cheenamalhotra
Copy link
Member

@MichaelLogutov

Could you please confirm if this issue occurs for SqlClient from .Net Framework or when referencing in a .Net Core Project? As it seems to me your application is ASP.NET Core but there are discussions for desktop version of SqlClient.. Just want to clarify to identify driver ownership.

@cheenamalhotra
Copy link
Member

cheenamalhotra commented Sep 14, 2020

For applications involving .Net Core, one seemingly related change that went in S.D.S. 4.8.2 (for .Net Core) is dotnet/corefx#42937 which fixes a high security issue of data corruption.

Could you also elaborate more about Cassandra driver's role in your application and how 4.8.1 version causes failures?

Is it possible to run some tests without Cassandra driver in picture to really focus on SqlClient behaviour for the failing use case?

@MichaelLogutov
Copy link
Author

@MichaelLogutov

Could you please confirm if this issue occurs for SqlClient from .Net Framework or when referencing in a .Net Core Project? As it seems to me your application is ASP.NET Core but there are discussions for desktop version of SqlClient.. Just want to clarify to identify driver ownership.

The issue is on NETCore app running in Linux container.

@MichaelLogutov
Copy link
Author

MichaelLogutov commented Sep 15, 2020

For applications involving .Net Core, one seemingly related change that went in S.D.S. 4.8.2 (for .Net Core) is dotnet/corefx#42937 which fixes a high security issue of data corruption.

Could you also elaborate more about Cassandra driver's role in your application and how 4.8.1 version causes failures?

Well, Cassandra driver exception is from some background thread and seems does not impact application at all. Maybe it's just some kind of warning from the driver background processing. Since cassandra exception does not impacts my app I don't really have issue with that. It's just I thought it's strange - I only saw those kind exceptions after upgrading System.Data.SqlClient nuget package which also gets "he connection has been broken" exceptions when working with mssql. So I thought that this could somehow be related to some global app IO pooling issues caused by updated package.

Is it possible to run some tests without Cassandra driver in picture to really focus on SqlClient behaviour for the failing use case?

We can't remove Cassandra since it's also the part of our production microservice data storage unfortunately.

@cheenamalhotra
Copy link
Member

cheenamalhotra commented Sep 15, 2020

Hi @MichaelLogutov

From the stack trace it seems like your application is performing "Commit" on an "Aborted/Ended" transaction.
As the only change of behavior from the recent PR I linked above is to kill the physical connection to the database instead of sending it back to the pool when "Transaction Ended" event is captured by the driver. So I believe there's another thread/service that triggers Transaction Ended event whereas a different thread is trying to commit this transaction.

You can try to confirm this theory by either preventing "Transaction Abort" scenarios or checking `Transaction.TransactionInformation.Status" if it's "Active" before triggering operations that perform commit. Doing so should prevent the app from performing unwanted operations and hence the exceptions.

If there's a different use-case in your case, please provide details with a sample code if possible.

@MichaelLogutov
Copy link
Author

Hi @cheenamalhotra

I'll test out your suggestion.

Meanwhile I've looked into the PR and if I understood correctly you now throw a promoteException only if transaction status is aborted. So if transaction is not in aborted state and some exception occur there is a way to still get connection "doomed" and return from method without exception - is this intended?

@cheenamalhotra
Copy link
Member

Meanwhile I've looked into the PR and if I understood correctly you now throw a promoteException only if transaction status is aborted.

The condition checks for
if (promoteException != null && Transaction.TransactionInformation.Status != TransactionStatus.Aborted)

(View Change)
We throw exception only if transaction status is NOT aborted or is an active transaction.

@cheenamalhotra cheenamalhotra transferred this issue from dotnet/runtime Sep 18, 2020
@MichaelLogutov
Copy link
Author

Oh. I must have misread something, sorry about that. Still, with new 4.8.2 code it could be possible to get no exception from Promote and some other methods if transaction is aborted. I wonder if it's the source of my problems.

You've mentioned to check if transaction is active before performing commit - how do I do that? I have code like this:

using (var transaction = TransactionScopeProvider.CreateTransactionScope())
{
    foreach (var x in orderNumbers)
    {
        await this.imagesSqlDataAccess.UpdateAsync(
            new PartialImageUpdateParams
            {
                ImageId = x.ImageId,
                SetOrderNum = x.OrderNumber
            },
            cancellationToken);
    }

    transaction.Complete();
}

Where TransactionScopeProvider.CreateTransactionScope just a helper to create transaction scope with predefined defaults:

var transaction_scope = new TransactionScope(TransactionScopeOption.Required,
    new TransactionOptions
    {
        IsolationLevel = IsolationLevel.ReadCommitted,
        Timeout = TimeSpan.FromMinutes(1)
    },
    TransactionScopeAsyncFlowOption.Enabled);

return transaction_scope;

And TransactionScope does not have Status field.
Or you're saying to look into Transaction.Current.TransactionInformation.Status ?

@cheenamalhotra
Copy link
Member

Hi @MichaelLogutov

Or you're saying to look into Transaction.Current.TransactionInformation.Status ?

Yes, that's right. Did that work for you?

@MichaelLogutov
Copy link
Author

I've updated System.Data.SqlClient to 4.8.2 and changed the code to:

using (var transaction = TransactionScopeProvider.CreateTransactionScope())
{
    foreach (var x in orderNumbers)
    {
        await this.imagesSqlDataAccess.UpdateAsync(
            new PartialImageUpdateParams
            {
                ImageId = x.ImageId,
                SetOrderNum = x.OrderNumber
            },
            cancellationToken);
    }

    LogCurrentTransactionStatus();

    transaction.Complete();
}

...

private static void LogCurrentTransactionStatus()
{
    try
    {
        var t = Transaction.Current;
        if (t == null)
            "Current transaction is null".Log(LogLevel.Warn);
        else if (t.TransactionInformation.Status != TransactionStatus.Active)
            $"Current transaction status is {t.TransactionInformation.Status}".Log(LogLevel.Warn);
    }
    catch (Exception ex)
    {
        ex.Log("Error while getting current transaction", LogLevel.Warn);
    }
}

And I've got exceptions like before:

System.Transactions.TransactionException: The operation is not valid for the state of the transaction.
   at System.Transactions.TransactionStatePSPEOperation.get_Status(InternalTransaction tx)
   at System.Transactions.TransactionInformation.get_Status()
   at System.Data.SqlClient.SqlDelegatedTransaction.Promote()
   at System.Transactions.TransactionStatePSPEOperation.PSPEPromote(InternalTransaction tx)
   at System.Transactions.TransactionStateDelegatedBase.EnterState(InternalTransaction tx)
   at System.Transactions.EnlistableStates.Promote(InternalTransaction tx)
   at System.Transactions.Transaction.Promote()
   at System.Transactions.TransactionInterop.ConvertToDistributedTransaction(Transaction transaction)
   at System.Transactions.TransactionInterop.GetExportCookie(Transaction transaction, Byte[] whereabouts)
   at System.Data.SqlClient.SqlInternalConnection.GetTransactionCookie(Transaction transaction, Byte[] whereAbouts)
   at System.Data.SqlClient.SqlInternalConnection.EnlistNonNull(Transaction tx)
   at System.Data.ProviderBase.DbConnectionPool.PrepareConnection(DbConnection owningObject, DbConnectionInternal obj, Transaction transaction)
   at System.Data.ProviderBase.DbConnectionPool.TryGetConnection(DbConnection owningObject, UInt32 waitForMultipleObjectsTimeout, Boolean allowCreate, Boolean onlyOneCheckConnection, DbConnectionOptions userOptions, DbConnectionInternal& connection)
   at System.Data.ProviderBase.DbConnectionPool.TryGetConnection(DbConnection owningObject, TaskCompletionSource`1 retry, DbConnectionOptions userOptions, DbConnectionInternal& connection)
   at System.Data.ProviderBase.DbConnectionFactory.TryGetConnection(DbConnection owningConnection, TaskCompletionSource`1 retry, DbConnectionOptions userOptions, DbConnectionInternal oldConnection, DbConnectionInternal& connection)
   at System.Data.ProviderBase.DbConnectionInternal.TryOpenConnectionInternal(DbConnection outerConnection, DbConnectionFactory connectionFactory, TaskCompletionSource`1 retry, DbConnectionOptions userOptions)
   at System.Data.SqlClient.SqlConnection.TryOpen(TaskCompletionSource`1 retry)
   at System.Data.SqlClient.SqlConnection.OpenAsync(CancellationToken cancellationToken)
   at async Common.Database.SqlDataAccess.Implementation.ImagesSqlDataAccessBase.ExecuteAsync(?) in /app/Common/Database/SqlDataAccess/Implementation/ImagesSqlDataAccessBase.generated.cs:line 297
   at async Common.Services.Implementation.ImagesService.SetOrderNumbersAsync(?) in /app/Common/Services/Implementation/ImagesService.cs:line 135
   at async WebApi.Controllers.EditController.SetOrderNumbersAsync(?) in /app/WebApi/Controllers/EditController.cs:line 47
   at lambda_method(Closure , Object )
   at async Microsoft.AspNetCore.Mvc.Infrastructure.ActionMethodExecutor.AwaitableResultExecutor.Execute(?)
   at async Microsoft.AspNetCore.Mvc.Infrastructure.ControllerActionInvoker.<InvokeActionMethodAsync>g__Awaited|12_0(?)
   at async Microsoft.AspNetCore.Mvc.Infrastructure.ControllerActionInvoker.<InvokeNextActionFilterAsync>g__Awaited|10_0(?)
   at Microsoft.AspNetCore.Mvc.Infrastructure.ControllerActionInvoker.Rethrow(ActionExecutedContextSealed context)
   at Microsoft.AspNetCore.Mvc.Infrastructure.ControllerActionInvoker.Next(State& next, Scope& scope, Object& state, Boolean& isCompleted)
   at async Microsoft.AspNetCore.Mvc.Infrastructure.ControllerActionInvoker.<InvokeInnerFilterAsync>g__Awaited|13_0(?)
   at async Microsoft.AspNetCore.Mvc.Infrastructure.ResourceInvoker.<InvokeFilterPipelineAsync>g__Awaited|19_0(?)
   at async Microsoft.AspNetCore.Mvc.Infrastructure.ResourceInvoker.<InvokeAsync>g__Awaited|17_0(?)
   at async Microsoft.AspNetCore.Routing.EndpointMiddleware.<Invoke>g__AwaitRequestTask|6_0(?)
   at async Microsoft.AspNetCore.Authorization.AuthorizationMiddleware.Invoke(?)
   at async Cian.Web.Facilities.ErrorHandling.ErrorHandlingMiddleware.InvokeAsync(?)

But still no new warnings about transaction state (there are still other old type of warnings present, so logging is working ok).

There are other examples of the same exception but on different method (if it helps):

System.Transactions.TransactionAbortedException: The transaction has aborted. ---> System.InvalidOperationException: The requested operation cannot be completed because the connection has been broken.

   --- End of inner exception stack trace ---
   at System.Transactions.TransactionStateAborted.EndCommit(InternalTransaction tx)
   at System.Transactions.CommittableTransaction.Commit()
   at System.Transactions.TransactionScope.InternalDispose()
   at System.Transactions.TransactionScope.Dispose()
   at async Common.Services.Implementation.ImagesService.AddNewAsync(?) in /app/Common/Services/Implementation/ImagesService.cs:line 295
   at async WebApi.Controllers.AddController.AddAsync(?) in /app/WebApi/Controllers/AddController.cs:line 78

@cheenamalhotra
Copy link
Member

Could you wrap this in a minimal repro to take a closer look at the scenario?

@MichaelLogutov
Copy link
Author

I would gladly, but I'm afraid I can't - the code is simple ADO.NET update but something could be with the all other stuff microservice doing (heavy CPU usage due to image resizing on Skia + cassandra and rabbitmq libraries connected to servers).

@cheenamalhotra
Copy link
Member

cheenamalhotra commented Oct 9, 2020

Ok, I'll try to form a repro based on provided info.

Could you also confirm if your connection is a pooled connection or pooling is disabled?
Does changing that make any difference?

And also do you also know why the transaction aborts at the first place? Are there failing commands or it's something else? Have you been able to gather it from logs and is it possible to prevent the transaction abort?

@MichaelLogutov
Copy link
Author

Could you also confirm if your connection is a pooled connection or pooling is disabled?

It's default connection string so it's pooled by default (if I'm not mistaken).
The only thing that changed - max pool size. So connection string looks like this:
server=my.server.name; database=MyDatabase; User ID=MyUser; Password=mypassword; Max Pool Size=500

Does changing that make any difference?

Before I tested it, just to clarify - should I just add "Pooling=false" or also change min/max pool size in connection string?

And also do you also know why the transaction aborts at the first place? Are there failing commands or it's something else? Have you been able to gather it from logs and is it possible to prevent the transaction abort?

I think it doesn't aborts at all. Because otherwise there should be warning about transaction in not active state before commit.

@cheenamalhotra
Copy link
Member

should I just add "Pooling=false" or also change min/max pool size in connection string

Just changing "Pooling = false" should disable it.

@cheenamalhotra
Copy link
Member

cheenamalhotra commented Oct 13, 2020

I think it doesn't aborts at all. Because otherwise there should be warning about transaction in not active state before commit.

There's certainly an abort triggered somewhere, otherwise you'd never run into this issue, since the change in last release was made specifically on "Transaction Abort" callback API.

You should be able to capture SQL Profiler traces for "Transaction" > "DTC Transaction" events to see which "Application" triggers "Transaction Aborted" events. You should notice something like below in SQL profiler traces:

EventClass TextData TransactionId SPID EventSubClass DatabaseName
DTCTransaction {D3296B44-F55C-4EA2-9B26-3A6011464D6C} 4502378 18 16 - Transaction is aborting master

@wolgasster
Copy link

wolgasster commented Jan 14, 2021

We started getting the same error in .Net Framework 4.8 application under high load (lots of parallel user sessions execute database queries) after upgrade of Microsoft.Data.SqlClient from 1.1.2 to 1.1.3.
Input conditions look the same as in the issue's description. All commands are executed wrapped by a TransactionScope and using connections from ConnectionPool, smth like:

using (var transactionScope = DataAccessFactory.Current.CreateTransactionScope())
{
	using (var connection = DataAccessFactory.Current.CreateConnection())
	{
		// create and execute command ...
	}
	transactionScope.Complete();
}

Update:
Looks like disabling of pooling fixes the issue, but it is not a good solution from performance perspective.

Meanwhile, adding the next code in the SqlDelegatedTransaction.TransactionEnded method looks strange considering if a transaction is successfully committed: (commit 0b20c3b)

// Safest approach is to doom this connection, whose transaction has been aborted externally.
// If we want to avoid dooming the connection for performance, state needs to be properly restored. (future TODO)
connection.DoomThisConnection();

+@cheenamalhotra @David-Engel

@dotWils
Copy link

dotWils commented Feb 1, 2021

Having the same problem here with .NET Framework 4.8 and Microsoft.Data.SqlClient 2.1.1

transaction = new TransactionScope(TransactionScopeOption.RequiresNew, new TransactionOptions()
{
IsolationLevel = System.Transactions.IsolationLevel.ReadUncommitted,
}, TransactionScopeAsyncFlowOption.Enabled);

MSDTC is used, connection pooling is used, SqlConnection is open and closed for every command

at System.Transactions.TransactionStateAborted.EndCommit(InternalTransaction tx)
at System.Transactions.CommittableTransaction.Commit()
at System.Transactions.TransactionScope.InternalDispose()
at System.Transactions.TransactionScope.Dispose()
at Robin.Bll.RunManager.CompleteTransaction()

@cheenamalhotra
Copy link
Member

Hi @wolgasster @dotWils

Is it possible to create a repro application to mimic your scenarios where exception occurs? That would help us in understanding more about your usecase.

The closing of Connection happens only on Transaction Abort event, and that's important to do because the driver does not lose reference to old transaction and continues to reattach it for next connection. If there's a server side delay to abort old transaction, server accepts that request. This in turn causes a security issue since whenever the transaction does end on the server side, the remaining queries on the new connection execute with implicit transactions.

We may have to look at why an aborted transaction is being committed by your application.

@dotWils
Copy link

dotWils commented Feb 2, 2021

Hi @wolgasster @dotWils

Is it possible to create a repro application to mimic your scenarios where exception occurs? That would help us in understanding more about your usecase.

The closing of Connection happens only on Transaction Abort event, and that's important to do because the driver does not lose reference to old transaction and continues to reattach it for next connection. If there's a server side delay to abort old transaction, server accepts that request. This in turn causes a security issue since whenever the transaction does end on the server side, the remaining queries on the new connection execute with implicit transactions.

We may have to look at why an aborted transaction is being committed by your application.

Thx for being willing to look at it.
It will be difficult to send a repro as it's a complete engine running procedures on a database.
Our .NET 4.8 application runs for days on .NET 4.8 System.Data.SqlClient. (No transactions aborted)
When only upgrading client to Microsoft.Data.SqlClient (of course in the whole application) we get this rather fast.
Will talk with client if it's worth the effort...

@csnemes
Copy link

csnemes commented Apr 7, 2021

We have the same problem with Microsoft.Data.SqlClient under Net 5.0. With heavy parallel load the connection pooling/transaction handling seems to break down.
With the attached solution the problem can be reproduced. It runs heavy load in cycles. Sometimes you have to wait till cycle 50 or so for the problem to occur, but it will appear consistently.
SQLProblem.zip
I tried to dig deeper, but unfortunately the System.Transactions.TransactionEventSource is not working at the moment.

@cheenamalhotra
Copy link
Member

@csnemes @dotWils @wolgasster @MichaelLogutov

We have a potential fix for this issue in PR #1042.
I've tested the PR with the repro "SQLproblem" as well from #729 (comment) and this issue seems to be solved. Would appreciate your feedback if you could test and confirm the fix from this package: Build Artifacts

Thank you

@csnemes
Copy link

csnemes commented Apr 26, 2021

Hi, I've managed to try the patch both with the reproduction and within our application (where we originally encountered with this problem), and it works perfectly. Thanks a lot for the fix.

@cheenamalhotra
Copy link
Member

Closing this issue as fixed has been merged for main(3.0), 2.1 and 1.1 servicing.
Next releases will contain fix for this issue.

@dotWils
Copy link

dotWils commented May 7, 2021

Closing this issue as fixed has been merged for main(3.0), 2.1 and 1.1 servicing.
Next releases will contain fix for this issue.

Thx for finding a solution.
Not so clear how this will be released.
The title reads System.Data.SqlClient 4.8.2 will this be version 4.8.3 or will this not be serviced?
Because the merging/servicing you mention here is for Microsoft.Data.SqlClient?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

7 participants