Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SqlException: The PROMOTE TRANSACTION request failed because there is no local transaction active #237

Closed
scale-tone opened this issue Sep 30, 2019 · 44 comments · Fixed by #543
Assignees
Milestone

Comments

@scale-tone
Copy link

scale-tone commented Sep 30, 2019

Under certain circumstances, when a combination of SqlCommands and transactional WCF calls is being run in a massively parallel manner, the current TransactionScope is occasionally "left over", SqlCommands do not get attached to it and therefore their updates are successfully committed, while the TransactionScope is rolled back. This leads to severe data corruption in a Line-of-Business application of one of our corporate customers.

Stable repro is outlined in this test project: https://github.com/scale-tone/InconsistencyDemo. See more detailed instructions on how to run it there. To reveal the issue more quickly, start and stop LoopHammer console app several times.

The LoopHammer console app emulates a message processing, consisting of the following steps:

  • Creating a TransactionScope.
  • As part of that TransactionScope, executing a couple of SQL statements against a local DB.
  • As part of that TransactionScope, making a transactional WCF call, which fails with an exception. The call forces the transaction to be promoted to a distributed one. The exception causes the entire TransactionScope to be rolled back (including the above SQL statements).

This algorithm is executed multiple times, in parallel (just in the way the real message processing service operates). Note that the repro code only creates one TransactionScope and doesn't explicitly mix it with SqlTransactions or any other nested transactions.

Expected behavior

The distributed transaction is rolled back entirely, nothing is written to DB.

Actual behavior

After seconds to minutes of normal execution the WCF call fails with the following (not expected!) exception:

System.ServiceModel.ProtocolException: The transaction has aborted. ---> System.Transactions.TransactionAbortedException: The transaction has aborted. ---> System.Transactions.TransactionPromotionException: Failure while attempting to promote transaction. ---> Microsoft.Data.SqlClient.SqlException: The PROMOTE TRANSACTION request failed because there is no local transaction active.
at Microsoft.Data.SqlClient.SqlInternalConnection.OnError(SqlException exception, Boolean breakConnection, Action1 wrapCloseInAction) at Microsoft.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose) at Microsoft.Data.SqlClient.TdsParser.TryRun(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj, Boolean& dataReady) at Microsoft.Data.SqlClient.TdsParser.Run(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj) at Microsoft.Data.SqlClient.TdsParser.TdsExecuteTransactionManagerRequest(Byte[] buffer, TransactionManagerRequestType request, String transactionName, TransactionManagerIsolationLevel isoLevel, Int32 timeout, SqlInternalTransaction transaction, TdsParserStateObject stateObj, Boolean isDelegateControlRequest) at Microsoft.Data.SqlClient.SqlInternalConnectionTds.ExecuteTransactionYukon(TransactionRequest transactionRequest, String transactionName, IsolationLevel iso, SqlInternalTransaction internalTransaction, Boolean isDelegateControlRequest) at Microsoft.Data.SqlClient.SqlDelegatedTransaction.Promote() --- End of inner exception stack trace --- at Microsoft.Data.SqlClient.SqlDelegatedTransaction.Promote() at System.Transactions.TransactionStatePSPEOperation.PSPEPromote(InternalTransaction tx) at System.Transactions.TransactionStateDelegatedBase.EnterState(InternalTransaction tx) --- End of inner exception stack trace --- at System.Transactions.TransactionStateAborted.CheckForFinishedTransaction(InternalTransaction tx) at System.Transactions.Transaction.Promote() at System.Transactions.TransactionInterop.ConvertToOletxTransaction(Transaction transaction) at System.Transactions.TransactionInterop.GetTransmitterPropagationToken(Transaction transaction) at System.ServiceModel.Transactions.WsatTransactionFormatter.WriteTransaction(Transaction transaction, Message message) at System.ServiceModel.Channels.TransactionChannel1.WriteTransactionToMessage(Message message, TransactionFlowOption txFlowOption)
--- End of inner exception stack trace ---
Server stack trace:
at System.ServiceModel.Channels.TransactionChannel1.WriteTransactionToMessage(Message message, TransactionFlowOption txFlowOption) at System.ServiceModel.Channels.TransactionChannel1.WriteTransactionDataToMessage(Message message, MessageDirection direction)
at System.ServiceModel.Channels.TransactionRequestChannelGeneric`1.Request(Message message, TimeSpan timeout)
at System.ServiceModel.Dispatcher.RequestChannelBinder.Request(Message message, TimeSpan timeout)
at System.ServiceModel.Channels.ServiceChannel.Call(String action, Boolean oneway, ProxyOperationRuntime operation, Object[] ins, Object[] outs, TimeSpan timeout)
at System.ServiceModel.Channels.ServiceChannelProxy.InvokeService(IMethodCallMessage methodCall, ProxyOperationRuntime operation)
at System.ServiceModel.Channels.ServiceChannelProxy.Invoke(IMessage message)
Exception rethrown at [0]:
at System.Runtime.Remoting.Proxies.RealProxy.HandleReturnMessage(IMessage reqMsg, IMessage retMsg)
at System.Runtime.Remoting.Proxies.RealProxy.PrivateInvoke(MessageData& msgData, Int32 type)
at DemoService.WcfAgents.IService1.DoWork(String correlationId)
at DemoService.WcfAgents.Service1Client.DoWork(String correlationId)
at LoopHammer.DostuffHandler.MakeTransactionalWcfCallThatThrows()
at LoopHammer.DostuffHandler.Handle()

This happens presumably because at the moment of promotion attempt, the TransactionScope (not itself, but some underlying internal transaction object) appears to be in an aborted state already. Because it appears to be in that state, the above SQL statements do not get attached to it and got committed immediately. So the TransactionPromotionException is just a visible part of the problem, while the actual issue occurs somewhere in between TransactionScope creation and execution of SqlCommands. We're able to detect that faulty state of TransactionScope beforehand, by executing "select @@trancount" against the DB - in a failure case it appears to be 0 (normally should be 1).

WORKAROUND: if we disable SqlConnection pooling (add "Pooling=false" to connection string), the problem disappears. This makes us thinking, that the bug is somewhere in connection sharing mechanism. Probably, a connection is sometimes not fully detached from one thread, and occasionally allows it to influence a TransactionScope belonging to another thread. If that is the case, then it can not only lead to data corruption but also introduce serious security issues.

The problem is reproducible with the latest Microsoft.Data.SqlClient v1.0.19269.1. Originally it was detected in System.Data.SqlClient coming from .Net Framework - see this variant of the same repro project: https://github.com/samegutt/InconsistencyDemo.
.Net Framework version doesn't matter (tried with all from 4.5.1 to 4.8).

When compiled against sources from this repo, the problem is also reproducible.
If we compile and start LoopHammer.exe in Debug mode, we immediately hit this debug assert. If we disable it, the further execution either leads to the same TransactionPromotionException OR to some others:

System.ObjectDisposedException: Cannot access a disposed object.
Object name: 'SqlDelegatedTransaction'.
at Microsoft.Data.SqlClient.SqlDelegatedTransaction.GetValidConnection()
at Microsoft.Data.SqlClient.SqlDelegatedTransaction.Rollback(SinglePhaseEnlistment enlistment)
at System.Transactions.TransactionStateDelegatedAborting.EnterState(InternalTransaction tx)
at System.Transactions.Transaction.Rollback()
at System.Transactions.TransactionScope.InternalDispose()
at System.Transactions.TransactionScope.Dispose()
at LoopHammer.DostuffHandler.Handle()

or even to:

Microsoft.Data.SqlClient.SqlException (0x80131904): The Microsoft Distributed Transaction Coordinator (MS DTC) has cancelled the distributed transaction.
at Microsoft.Data.SqlClient.SqlConnection.OnError(SqlException exception, Boolean breakConnection, Action1 wrapCloseInAction) in C:\projects\CSA\temp\SqlClient\src\Microsoft.Data.SqlClient\netfx\src\Microsoft\Data\SqlClient\SqlConnection.cs:line 2103 at Microsoft.Data.SqlClient.SqlInternalConnection.OnError(SqlException exception, Boolean breakConnection, Action1 wrapCloseInAction) in C:\projects\CSA\temp\SqlClient\src\Microsoft.Data.SqlClient\netfx\src\Microsoft\Data\SqlClient\SqlInternalConnection.cs:line 814
at Microsoft.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose) in C:\projects\CSA\temp\SqlClient\src\Microsoft.Data.SqlClient\netfx\src\Microsoft\Data\SqlClient\TdsParser.cs:line 1572
at Microsoft.Data.SqlClient.TdsParser.TryRun(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj, Boolean& dataReady) in C:\projects\CSA\temp\SqlClient\src\Microsoft.Data.SqlClient\netfx\src\Microsoft\Data\SqlClient\TdsParser.cs:line 2792
at Microsoft.Data.SqlClient.SqlDataReader.TryConsumeMetaData() in C:\projects\CSA\temp\SqlClient\src\Microsoft.Data.SqlClient\netfx\src\Microsoft\Data\SqlClient\SqlDataReader.cs:line 1310
at Microsoft.Data.SqlClient.SqlDataReader.get_MetaData() in C:\projects\CSA\temp\SqlClient\src\Microsoft.Data.SqlClient\netfx\src\Microsoft\Data\SqlClient\SqlDataReader.cs:line 275
at Microsoft.Data.SqlClient.SqlCommand.FinishExecuteReader(SqlDataReader ds, RunBehavior runBehavior, String resetOptionsString, Boolean isInternal, Boolean forDescribeParameterEncryption, Boolean shouldCacheForAlwaysEncrypted) in C:\projects\CSA\temp\SqlClient\src\Microsoft.Data.SqlClient\netfx\src\Microsoft\Data\SqlClient\SqlCommand.cs:line 5550
at Microsoft.Data.SqlClient.SqlCommand.RunExecuteReaderTds(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, Boolean async, Int32 timeout, Task& task, Boolean asyncWrite, Boolean inRetry, SqlDataReader ds, Boolean describeParameterEncryptionRequest) in C:\projects\CSA\temp\SqlClient\src\Microsoft.Data.SqlClient\netfx\src\Microsoft\Data\SqlClient\SqlCommand.cs:line 5312
at Microsoft.Data.SqlClient.SqlCommand.RunExecuteReader(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, String method, TaskCompletionSource`1 completion, Int32 timeout, Task& task, Boolean& usedCache, Boolean asyncWrite, Boolean inRetry) in C:\projects\CSA\temp\SqlClient\src\Microsoft.Data.SqlClient\netfx\src\Microsoft\Data\SqlClient\SqlCommand.cs:line 4907
at Microsoft.Data.SqlClient.SqlCommand.RunExecuteReader(CommandBehavior cmdBehavior, RunBehavior runBehavior, Boolean returnStream, String method) in C:\projects\CSA\temp\SqlClient\src\Microsoft.Data.SqlClient\netfx\src\Microsoft\Data\SqlClient\SqlCommand.cs:line 4762
at Microsoft.Data.SqlClient.SqlCommand.ExecuteReader(CommandBehavior behavior, String method) in C:\projects\CSA\temp\SqlClient\src\Microsoft.Data.SqlClient\netfx\src\Microsoft\Data\SqlClient\SqlCommand.cs:line 2401
at Microsoft.Data.SqlClient.SqlCommand.ExecuteReader() in C:\projects\CSA\temp\SqlClient\src\Microsoft.Data.SqlClient\netfx\src\Microsoft\Data\SqlClient\SqlCommand.cs:line 2322
at LoopHammer.DostuffHandler.GetTransactionDetails()
at LoopHammer.DostuffHandler.Handle()
ClientConnectionId:48c113c4-6fcd-4082-85bf-0f39023375bf
Error Number:1206,State:137,Class:18

A transactional WCF call is the essential part of the scenario. Surprisingly, forcing a transaction promotion by other means (e.g. by accessing another DB) does not lead to the same problem

For some reason, the MaxDegreeOfParallelism set to 4 maximizes chances for the issue to appear. Experiments with connection pool size didn't show any differences.

As mentioned above, the issue causes a serious damage to the customer's large LoB system. Furthermore, concerns are that it might indicate a severe security issue (multiple threads being able to influence each other's transaction flow).

@cheenamalhotra cheenamalhotra self-assigned this Oct 1, 2019
@cheenamalhotra
Copy link
Member

Hi @scale-tone

Just to let you know I was able to reproduce the issue from the provided repro, and currently investigating into it.

@cheenamalhotra cheenamalhotra changed the title System.ServiceModel.ProtocolException ---> System.Transactions.TransactionAbortedException ---> System.Transactions.TransactionPromotionException ---> Microsoft.Data.SqlClient.SqlException: The PROMOTE TRANSACTION request failed because there is no local transaction active SqlException: The PROMOTE TRANSACTION request failed because there is no local transaction active Oct 3, 2019
@cheenamalhotra
Copy link
Member

Hi @scale-tone

Could you give the latest v1.1.0-preview1 version we released today a try in your repro app? I don't see the error occurring again with this release version. I'm wondering if @Wraith2 's PR #232 did the trick?

Let me know how that goes!

@samegutt
Copy link

samegutt commented Oct 3, 2019

Hi, I've been working with @scale-tone on this issue, and I am still able to reproduce with the new package. However, it took a few more minutes than longer to reproduce.

@ramonsmits
Copy link

I'm currently running your reproduction. How long should it run before it occurs?

@samegutt
Copy link

samegutt commented Oct 4, 2019

Originally I’m able to reproduce from between 20 to 3000 iterations. With the new package, I had to wait until 5000 iterations.

Edit: query the ImportantBusinessData table with nolock, and check the Id column to get current iteration count.

@cheenamalhotra
Copy link
Member

I was able to reproduce the error before in a min (after 100 transactions), but now even after 30 mins, I don't get any errors, wondering how long should I wait for.

As you can see I've crossed 15000 transaction count, but no error so far. Test is going on.

image

@ramonsmits
Copy link

My test run in Release, I added the duration since start to the exception logging and only logging the message property. Just after a minute it happens for the first time and after 14 minutes 15 exceptions got logged.

MaxDegreeOfParallelism: 4
        eb7b2801-eaf2-40bd-97ec-d81ddf1a381e trandetails is TranCount = 2, Xact = 1, TranId = 4583060, WE'RE ABOUT TO CREATE INCONSISTENT DATA
00:01:02.6942252 eb7b2801-eaf2-40bd-97ec-d81ddf1a381e The transaction has aborted.
        028d4105-15d8-47f3-b7df-ee9548a4ff11 trandetails is TranCount = 2, Xact = 1, TranId = 4585263, WE'RE ABOUT TO CREATE INCONSISTENT DATA
00:02:15.2570129 028d4105-15d8-47f3-b7df-ee9548a4ff11 The transaction has aborted.
        9c6d1c7a-db66-4d03-8643-ee65f9d3a955 trandetails is TranCount = 0, Xact = 1, TranId = 4585284, WE'RE ABOUT TO CREATE INCONSISTENT DATA
00:02:15.6917334 9c6d1c7a-db66-4d03-8643-ee65f9d3a955 The transaction has aborted.
00:03:52.7842527 c464ae07-f309-41cb-a96a-0948bc607aca The Microsoft Distributed Transaction Coordinator (MS DTC) has cancelled the distributed transaction.
        f5fe49bc-c0c9-4e4f-a5fd-b229d1d883dc trandetails is TranCount = 0, Xact = 1, TranId = 4588367, WE'RE ABOUT TO CREATE INCONSISTENT DATA
00:03:52.8276654 f5fe49bc-c0c9-4e4f-a5fd-b229d1d883dc The transaction has aborted.
        ab699c1f-3ecf-4db1-92ed-973a23227adb trandetails is TranCount = 2, Xact = 1, TranId = 4590037, WE'RE ABOUT TO CREATE INCONSISTENT DATA
00:04:46.2503794 ab699c1f-3ecf-4db1-92ed-973a23227adb The transaction has aborted.
        faeeb465-1333-46c4-a0f9-ad09be377ece trandetails is TranCount = 0, Xact = 1, TranId = 4595358, WE'RE ABOUT TO CREATE INCONSISTENT DATA
00:07:37.8367850 faeeb465-1333-46c4-a0f9-ad09be377ece The transaction has aborted.
        58cc35d6-0a32-41d2-b3f9-2d38beecc2f8 trandetails is TranCount = 2, Xact = 1, TranId = 4598136, WE'RE ABOUT TO CREATE INCONSISTENT DATA
00:09:03.3872197 58cc35d6-0a32-41d2-b3f9-2d38beecc2f8 The transaction has aborted.
        cb18d305-e07c-45f5-8a0c-4823a19398e5 trandetails is TranCount = 2, Xact = 1, TranId = 4599774, WE'RE ABOUT TO CREATE INCONSISTENT DATA
        348f98ed-5dca-4d9b-917f-edad7b1d7152 trandetails is TranCount = 0, Xact = 1, TranId = 4599813, WE'RE ABOUT TO CREATE INCONSISTENT DATA
00:10:01.9555238 cb18d305-e07c-45f5-8a0c-4823a19398e5 The transaction has aborted.
00:10:01.9555854 348f98ed-5dca-4d9b-917f-edad7b1d7152 The transaction has aborted.
        801c0ec4-e591-4fca-b3d8-10787e8b01f8 trandetails is TranCount = 2, Xact = 1, TranId = 4599820, WE'RE ABOUT TO CREATE INCONSISTENT DATA
00:10:02.3636293 801c0ec4-e591-4fca-b3d8-10787e8b01f8 The transaction has aborted.
        ebeef3be-20c9-4feb-bfd8-4c3f3b82436e trandetails is TranCount = 0, Xact = 1, TranId = 4601311, WE'RE ABOUT TO CREATE INCONSISTENT DATA
00:10:48.4204056 ebeef3be-20c9-4feb-bfd8-4c3f3b82436e The transaction has aborted.
        61c322f8-6291-457b-b120-953667d4ce64 trandetails is TranCount = 2, Xact = 1, TranId = 4602237, WE'RE ABOUT TO CREATE INCONSISTENT DATA
00:11:16.4128108 61c322f8-6291-457b-b120-953667d4ce64 The transaction has aborted.
00:12:26.9201648 334ba852-adbe-4625-933c-a654bd51bca7 The Microsoft Distributed Transaction Coordinator (MS DTC) has cancelled the distributed transaction.
        093f3d64-87d5-4281-af36-400f0e8d42e6 trandetails is TranCount = 2, Xact = 1, TranId = 4606855, WE'RE ABOUT TO CREATE INCONSISTENT DATA
00:13:40.9761211 093f3d64-87d5-4281-af36-400f0e8d42e6 The transaction has aborted.

@cheenamalhotra
Copy link
Member

@ramonsmits The repro uses package reference 1.0.19269.1, which reproduces issue quickly.
Please test with "1.1.0-preview1.19275.1"

@cheenamalhotra
Copy link
Member

cheenamalhotra commented Oct 4, 2019

After few more mins (nearly 45 mins from beginning), I get this:

132c09b9-7c84-4adc-ab0a-5de35248d8a4 trandetails is TranCount = 2, Xact = 1, TranId = 187726, WE'RE ABOUT TO CREATE INCONSISTENT DATA 132c09b9-7c84-4adc-ab0a-5de35248d8a4 System.ServiceModel.ProtocolException: The transaction has aborted. ---> System.Transactions.TransactionAbortedException: The transaction has aborted. ---> System.Transactions.TransactionPromotionException: Failure while attempting to promote transaction. ---> Microsoft.Data.SqlClient.SqlException: The PROMOTE TRANSACTION request failed because there is no local transaction active.

But here I see TranCount = 2, is that expected?

Interestingly, database table [ImportantBusinessData] contains entry:

This should not be committed ever, since the WCF call below always fails.. Transaction details: TranCount = 0, Xact = 1, TranId = 187746

@ramonsmits
Copy link

Running it with 1.1.0-preview1.19275.1 resulted in 8 inserts in 58 minutes that should not have happened.

@cheenamalhotra
Copy link
Member

I got 6 inserts in 2 hours. Looking into it.

@samegutt
Copy link

samegutt commented Oct 4, 2019

@cheenamalhotra, Trancount = 2 only means that the current iteration has a transaction session leftover from another iteration, and that is is not yet aborted. If you add logging to the database, as I have in my original repo, it becomes clear that different iterations (with different CorrelationId) has inserted logs with the same TransId (Transid is really select CURRENT_TRANSACTION_ID()). The query that illustrates this is select TransId, count(*) from LogData group by TransId Having count(*) > 1

If the insert into the ImportantBusinessData table was executed when Trancount = 2, it would at least have a transaction, and it would be rolled back when the transaction was aborted. But more often than not, the transaction is already aborted at the time of the insert to ImportantBusinessData. @@trancount is at that moment 0, and we have no transaction.

@samegutt
Copy link

samegutt commented Oct 4, 2019

@cheenamalhotra, the reason you see different TransId in the Console app compared to what was incorrectly inserted into the database, is that select CURRENT_TRANSACTION_ID() returns a new value each time when @@trancount = 0.

@samegutt
Copy link

samegutt commented Oct 4, 2019

About the time it takes to reproduce. I have noticed that if I leave my computer while the setup running, it takes noticeably longer to reproduce. When left running over night, it might run for several hours without any incorrect behavior, only for a burst of inconsistencies to occur the moment I wiggle the mouse and wake the screen. Running the setup on a "busy" computer seems to be the best way to consistently reproduce the problem.

@scale-tone
Copy link
Author

@cheenamalhotra ,

I do not see any difference in behavior with the PR you mentioned. The same assert is hit immediately. If disabled, the same exception occurs almost instantly.

As I mentioned, I expect you to compile the repro against the latest sources, not against nuget packages.

As I mentioned, the quickest way to repro is to start and stop LoopHammer.exe multiple times. Please, pay attention.

@cheenamalhotra
Copy link
Member

Hi @scale-tone

Yes. I deeply looked into the repro later and figured out its targeting .Net Framework, and the PR has no link to it.

Also the NuGet packages are showing weird behaviors just due to the randomness of the issue. Because I do know nothing has changed on .Net Framework between these versions and the issue does occur with current source too.

I will continue to investigate more into it, just wanted to clarify, I'm on the same page. The issue is reproduced, its frequency is intermittent, and mostly happens on a busy machine.

@ramonsmits
Copy link

Please validate if the WCF client requires to be disposed. The catch does a Abort() but in the past I can remember that the clients require a Dispose().

finally
{
    ((IDisposable)serviceClient).Dispose();
}

Please make sure that you use a new connection pool string so that you will not be using stale dirty connections.

@scale-tone
Copy link
Author

scale-tone commented Oct 7, 2019

Good point, @ramonsmits , but no, this extra explicit Dispose() doesn't help - still the same exceptions.

And can you please elaborate on "new connection pool string"? What's that?

@scale-tone
Copy link
Author

Two months passed. Do we have any progress?

@ramonsmits
Copy link

And can you please elaborate on "new connection pool string"? What's that?

@scale-tone A SQL connection can be 'invalid' and be returned to the connection pool. As we are dealing with strange transaction behavior I made that comment to ensure that this would not be happening. By adding for example ;App=2 to the connection string you make sure that connections are from another pool.

@scale-tone
Copy link
Author

@ramonsmits , as I understand, what you're proposing is to get rid of the connection pool as such (by just using a unique conn string every time). A simpler way to achieve that is described in the WORKAROUND section above - by adding "Pooling=false" to the conn string. That helps, but certainly severely hurts performance.

And need to underscore again, that the root bug has all signs of a security issue (think of some sort of multi-tenant scenario, when tenants might be re-using connections from a pool), therefore I do expect much more attention to this from the team.

@ramonsmits
Copy link

@scale-tone I'm aware of your mentioned workaround. I made my comment about the connection pool on my "explicit Dispose()" comment to ensure that it would be tested with a fresh new pool not containing any bad connections from a previous test session.

I agree that this issue seems severe enough to get more attention as it leads to data corruption.

@scale-tone
Copy link
Author

@cheenamalhotra ,

Although I do can confirm, that with your proposed change the issue stops being reproducible, I have great doubts that it is that simple. Because:

  1. If you debug the code with your change, you'll immediately see that it keeps producing a big number of the following exceptions (they are being suppressed, but are still very much visible):

image

Which is quite understandable, because what you're now doing is sending something into a transaction that doesn't exist. Are you saying this is expected (or at least was expected by the author of that line)? I really doubt that.

  1. Pattern for using EnlistNull() method looks opposite to what you're proposing. Search for usages of SqlInternalConnection.EnlistNull().

  2. Are you saying your change doesn't break any of those numerous and comprehensive tests that this codebase has?! I cannot believe to that... At least, tons of integration tests should be broken by this.

So rather I would say that your proposed change introduces some side-effect (Probably, extra delays due to lots of exceptions thrown. Or connections being reset.), which makes the issue less visible. But the root cause should be something different.

@cheenamalhotra
Copy link
Member

@scale-tone

Yes you're right. The usage pattern in other places doesn't suggest this needs any change here.

The change somehow got me confused and amused as the Debug Assert stopped failing and so did repro. Also I didn't come across new exceptions, I too was running debug, will recheck. Thanks for quick testing!

I'll continue to look further.

@cheenamalhotra
Copy link
Member

@ramonsmits

can this be backtracked to any previous changes as this issue was not always present AFAIK.

I think as reported above this issue is reproduced in earlier versions of .Net Framework too. We will be backporting fix to latest .Net Framework version once the issue is identified and confirmed.

@scale-tone

Are you aware of any working version of SqlClient or .Net Framework that doesn't reproduce this issue? That would make investigations a bit easier.

@cheenamalhotra
Copy link
Member

cheenamalhotra commented Nov 29, 2019

There's something more I'd like to share:

Adding the below line after creating TransactionScope in repro also fixes the issue as it promotes the transaction to MSDTC transaction. It's mainly used in context of flowing transactions across app domains, but maybe it's working out in this case of parallelism?

TransactionInterop.GetTransmitterPropagationToken(Transaction.Current);

cc @roji @saurabh500

@cheenamalhotra
Copy link
Member

Hi @scale-tone

Interestingly, the stack trace above contains trigger point of this error to be this method:

System.Transactions.TransactionInterop.GetTransmitterPropagationToken(Transaction transaction)

Which suggests, by the time this line gets executed in busy machines, the transaction gets rollback/aborted. So doing this call is safe as the token is generated immediately with the TransactionScope instantiation, and can be used internally by Systems.Transactions when needed.

I also found official documentation here for this scenario, which discusses how the promotion of transaction to a full distributed 2-phase commit transaction occurs. Step 5 confirms the need for propagation token.

I've also found another blog post discussing similar problems with TransactionScope and Multi-threading, where propagation token was needed to promote transaction to DTC.

I would recommend making this call right after transaction scope is initiated, that also confirms to solve all issues:

TransactionInterop.GetTransmitterPropagationToken(Transaction.Current);

@cheenamalhotra
Copy link
Member

Hi @scale-tone

Waiting for your response if the above comment resolves the issue?

@scale-tone
Copy link
Author

Waiting for your response if the above comment resolves the issue?

No it doesn't, @cheenamalhotra . What you're proposing is yet another workaround. Effectively your proposed line promotes the transaction into a distributed one early. While normally that transaction is being automatically promoted to a distributed one inside the MakeTransactionalWcfCallThatThrows() call (if it were not promoted at that point, the whole scenario wouldn't work).

SqlClient is expected to do that automatic promotion for us, and it is expected to do that correctly, especially it should never lead to a data loss (no matter what internal threading issues it experiences).

Besides, OK, we could add that line to this particular place in our code. Would you then expect us to pepper the entire codebase with that line (because how do we know when this early promotion is needed and when it is not?)?

We do not need another workaround, we already have one. We need an ultimate fix, which is now less important for us, but way much more important for other customers, who might not even know that they're loosing data due to this!

@samegutt
Copy link

samegutt commented Dec 6, 2019

@cheenamalhotra so far, this issue is reproduced for all .NET versions since (and including) 3.0. That is the first framework with WCF, which is a key component in this setup.

I have pushed a branch to my repo with the needed changes for that version.

The exception thrown in .NET 3.0 example:

System.ServiceModel.ProtocolException: The transaction has aborted. ---> System.Transactions.TransactionAbortedException: The transaction has aborted. ---> System.Transactions.TransactionPromotionException: Failure while attempting to promote transaction. ---> System.Data.SqlClient.SqlException: The PROMOTE TRANSACTION request failed because there is no local transaction active.
at System.Data.SqlClient.SqlInternalConnection.OnError(SqlException exception, Boolean breakConnection, Action1 wrapCloseInAction) at System.Data.SqlClient.TdsParser.ThrowExceptionAndWarning(TdsParserStateObject stateObj, Boolean callerHasConnectionLock, Boolean asyncClose) at System.Data.SqlClient.TdsParser.TryRun(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj, Boolean& dataReady) at System.Data.SqlClient.TdsParser.Run(RunBehavior runBehavior, SqlCommand cmdHandler, SqlDataReader dataStream, BulkCopySimpleResultSet bulkCopyHandler, TdsParserStateObject stateObj) at System.Data.SqlClient.TdsParser.TdsExecuteTransactionManagerRequest(Byte[] buffer, TransactionManagerRequestType request, String transactionName, TransactionManagerIsolationLevel isoLevel, Int32 timeout, SqlInternalTransaction transaction, TdsParserStateObject stateObj, Boolean isDelegateControlRequest) at System.Data.SqlClient.SqlInternalConnectionTds.ExecuteTransactionYukon(TransactionRequest transactionRequest, String transactionName, IsolationLevel iso, SqlInternalTransaction internalTransaction, Boolean isDelegateControlRequest) at System.Data.SqlClient.SqlDelegatedTransaction.Promote() --- End of inner exception stack trace --- at System.Data.SqlClient.SqlDelegatedTransaction.Promote() at System.Transactions.TransactionStatePSPEOperation.PSPEPromote(InternalTransaction tx) at System.Transactions.TransactionStateDelegatedBase.EnterState(InternalTransaction tx) --- End of inner exception stack trace --- at System.Transactions.TransactionStateAborted.CheckForFinishedTransaction(InternalTransaction tx) at System.Transactions.Transaction.Promote() at System.Transactions.TransactionInterop.ConvertToOletxTransaction(Transaction transaction) at System.Transactions.TransactionInterop.GetTransmitterPropagationToken(Transaction transaction) at System.ServiceModel.Transactions.WsatTransactionFormatter.WriteTransaction(Transaction transaction, Message message) at System.ServiceModel.Channels.TransactionChannel1.WriteTransactionToMessage(Message message, TransactionFlowOption txFlowOption)
--- End of inner exception stack trace ---

@cheenamalhotra
Copy link
Member

Hi @jimcarley @corivera @geleems

Could you take a look and provide some insights for this issue reported by customer as it seems to be reproduced in System.Data.SqlClient with all .NET framework versions of starting 3.0 as mentioned above.

@ramonsmits
Copy link

I tried the TransactionInterop.GetTransmitterPropagationToken(Transaction.Current); approach after connecting the resource but it seems that this really must be done before using the first resource manager.

This means that this workaround requires the ability to control the creation of the TransactionScope.

@jimcarley
Copy link
Member

I have only skimmed over this thread. But when I saw the comment about "early" promotion of the transaction, it made me think of something.

We had a situation where one of the enlistments, either volatile or durable, was an "EnlistDuringPrepareRequired" enlistment. This means that after Commit processing was started, a new enlistment might be attempted during "Phase 0" of the commit. When the transaction commit was delegated to SQL Server through the EnlistPromotableSinglePhase enlistment, there was a problem. SQL Server would start the commit processing by issuing the commit request to MSDTC and MSDTC would discover the Phase0 enlistment and tell that enlistment to prepare first. As part of the prepare processing, that Phase0 enlistment would try to create a new enlistment with SQL Server. But SQL Server would reject it because it thought the transaction was already in the process of committing. In other words, SQL Server didn't support the idea of a Phase0 enlistment.
I don't know if this was ever resolved. The workaround for this was to force "early" promotion of the transaction, before the first EnlistPromotableSinglePhase was attempted by SqlClient. That way, the commit processing would not be delegated to SQL Server, but rather go directly to MSDTC.

But it doesn't sound like this current issue is related to that. This definitely sounds like some sort of issue with the connection pooling. And it also sounds related to the situation where the transaction aborts before the first SQL operation and so a separate SQL transaction is used for the SQL operations, but when the WCF work is attempted, the aborted System.Transactions transaction is discovered. The SQL transaction(s) aren't affected and therefore commit.

Can someone modify the repro to have the application "register" for the Transaction.Completed event and then log the time and outcome when that event is fired? This might help narrow down why the transaction might be aborting early, before the SQL enlistments are made?

Also, I don't recall if there is some option to specify in the connection string that disables the automatic "switch" to using a SQL Transaction if the System.Transactions transaction is not usable at the time of the enlistment.

@samegutt
Copy link

@jimcarley
When handling the TransactionCompleted event in my repo, the problem never occurs. Meaning, all transactions are properly rolled back.

@ramonsmits
Copy link

I added the following and issue is still happening:

Transaction.Current.TransactionCompleted += (s, e) => Console.Out.WriteLineAsync($"{Start.Elapsed}: {e.Transaction.TransactionInformation.Status}");

Reported status is always Aborted.

The only way to currently solve this is by adding

TransactionInterop.GetTransmitterPropagationToken(Transaction.Current);

@cheenamalhotra
Copy link
Member

cheenamalhotra commented Dec 31, 2019

Hi @jimcarley

I've been debugging over and over this issue and I've noticed a strange behaviour with Pooled Connections. I captured SQL Profiles for 1 round of cycle, for below 3 scenarios:

  • Pooled Connections (Bug) -
    PooledConnection
  • Non-Pooled Connections -
    Non-PooledConnection
  • Pooled Connections + Propagation Token generated with Current Transaction
    PooledConnection-PropogationToken

All further loops lead to same pattern over and over again.
As you can see when working with pooled connection, the local transaction does not propagate until the end of cycle. On debugging it seems to happen when ServiceClient.DoWork() is called. Since the service is made to throw exception, the cycle terminates with propagated transaction.

I looked into ServiceClient code, and found the tag used on IService1.DoWork:

[System.ServiceModel.TransactionFlowAttribute(System.ServiceModel.TransactionFlowOption.Allowed)]

Upon removing this tag (only for investigation purposes), I see a different pattern where propagation does not happen and transaction is rollbacked when exception occurs:

PooledConnection-DoWorkModified

Which makes sense since WPF service forces transaction promotion in this case. Also attached traces below, that can aid in investigations.
Traces.zip

At this point I'm not sure what's expected behaviour since this issue has occurred only in WPF applications like the repro and this issue has always been observed with all versions of SqlClient. To me it seems like the transactions should be promoted to DTC at the first place, but since they are pooled connections, their behaviour is different here.

Appreciate your insights on the same!

@cheenamalhotra
Copy link
Member

cheenamalhotra commented Dec 31, 2019

Another note I'd like to add specifically for pooled connections:

[This happens in the second round and onwards - post Promote Tran triggered by WCF Call]
When the connection is grabbed from pool and re-activated, during activation, the server first sends "EnvChange token Type 17: Transaction Ended" followed by "EnvChange token Type 8: Begin". As per Microsoft documentation:

ENVCHANGE type 17 is sent when a batch is used that specified a descriptor for a transaction that has ended. This is only sent in the bound session case.

The driver somehow doesn't know how to handle this state, and marks it as "Unknown" here:
https://github.com/dotnet/SqlClient/blob/master/src/Microsoft.Data.SqlClient/netfx/src/Microsoft/Data/SqlClient/TdsParser.cs#L2504-L2505

After which the local copy of pending transaction is re-used for addressing Begin transaction.

This activation of pooled connection also brings a "Warning" from SQL Server:
"The transaction active in this session has been committed or aborted by another session." which is not thrown since it is not an error.

@ramonsmits
Copy link

@scale-tone Did you verify if 2.0.0-preview3 resolved your issue?

@scale-tone
Copy link
Author

@ramonsmits , I don't see any data losses happening anymore, yes.
But instead I'm occasionally observing this beautiful NRE:

673705f1-2f3b-44af-a9bd-798d70aaed9d System.NullReferenceException: Object reference not set to an instance of an object.
at Microsoft.Data.SqlClient.SqlDelegatedTransaction.Rollback(SinglePhaseEnlistment enlistment)
at System.Transactions.TransactionStateDelegatedAborting.EnterState(InternalTransaction tx)
at System.Transactions.Transaction.Rollback()
at System.Transactions.TransactionScope.InternalDispose()
at System.Transactions.TransactionScope.Dispose()
at LoopHammer.DostuffHandler.Handle()

@cheenamalhotra , as usual, you can easily see this yourself, if you compile the repro against SqlClient 2.0.0-preview3 and run/stop/run it several times.

So the bug isn't fully fixed still.

@cheenamalhotra
Copy link
Member

cheenamalhotra commented May 14, 2020

@scale-tone

We found it was introduced with the fix when logging information, now fixed it with #563
Thanks for letting us know!

Please give the current 'master' branch a try and let us know if you find any other concerns.

@rygwdn
Copy link

rygwdn commented Oct 8, 2020

@cheenamalhotra Do you know if this change be back-ported to System.Data.SqlClient in .Net Framework (4.8) as well?

@cheenamalhotra
Copy link
Member

@cheenamalhotra Do you know if this change be back-ported to System.Data.SqlClient in .Net Framework (4.8) as well?

@saurabh500 Could you provide an update?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment