Attempt to fix Promote Transaction issue #495

cheenamalhotra · 2020-03-28T08:02:29Z

Fixes #237

After more investigations we found that this issue occurs due to SqlClient's inability to handle "Transaction Aborted" callback in SqlDelegatedTransaction. When transaction promotion is triggered by WCF Service after first set of commands, the transaction is also requested to be aborted due to exception scenarios from SysTx.Transactions.

The driver receives callback when transaction gets aborted, but it sends this connection back to the pool, without updating its "Current Transaction" state, which gets picked up later by other new connection requests. In the second round of transaction scope, it does not respect new transaction scope created for the new connection request because there was no "currentTransaction" expected.
Hence the Debug.Assert failed all the time.

Moving on with this "old" transaction, which is aborted externally, we also notice warning from SQL Server as it sends "TransactionEnded" token followed by "Begin" to initiate new transaction request, Everytime! as explained here: #237 (comment)

Most of the times we get lucky, that the transaction was aborted by SQL Server, before we tried to use it again and it always gave us new transaction to work with. But in 1% scenarios, where the error happened, server got a little late to end this transaction and we end up enlisting our connection for 2nd Transaction Scope in the 1st transaction. But an abort is pending on server side, so as soon as it aborts the transaction, all further requests to server are free of the 1st "TransactionScope", 2nd TransactionScope had no links to this round of queries, and queries get executed with "Implicit" transactions, hence the "Commits".

Everything is linked to the fact that this connection was sent back to "GeneralPool" without cleaning it's transaction state where only connections WITHOUT any "Active" transaction reference must reside.

Since this connection has enlisted transaction that got promoted to DTC but was recently aborted externally, safest approach in my opinion is to Doom this connection as soon as we capture the aborted state of current transaction to block sending back to the pool.

I couldn't find any other clean way to reset this connection properly in order to send it back to General Pool. Would welcome suggestions if that can achieved!

Please give the below package a try, that contains fix:
Microsoft.Data.SqlClient.2.0.0-dev-437.nupkg.zip

cc @saurabh500 @David-Engel @scale-tone

Wraith2 · 2020-03-28T08:56:27Z

This must have been a nightmare to trace.

When you say that "The driver receives callback when transaction gets aborted" do you mean that TransactionEnded is called? If so isn't it possible that from there you could call connection.CleanupConnectionOnTransactionCompletion here as is done in the Rollback implementation?

It might also be worth adding some checks in the pool return function for various pieces of dangling state and dooming the connection at the return to pool stage. If done this way any other errors from the inappropriate release could be forced safe without finding all possible caller combinations and we could have have it assert in tests.

scale-tone · 2020-03-28T12:39:22Z

Nice to see progress on this (think it's safe to say) critical issue, @cheenamalhotra . Still we don't seem to have full understanding of the mechanics.

If you go up in the stack, you'll see that another attempt to properly cleanup that connection before returning to the pool is being made:

            CleanupConnectionOnTransactionCompletion(transaction);

Do you have an idea, why that attempt isn't successful? If that line effectively does nothing, then why is it in place after all?

I can confirm that with the proposed DoomThisConnection() line added the issue stops happening. But ObjectDisposedException still occurs. Why is that?

@Wraith2 , CleanupConnectionOnTransactionCompletion() doesn't help. With that line added the issue still occurs (can't explain, why).

saurabh500 · 2020-03-28T17:37:46Z

src/Microsoft.Data.SqlClient/netcore/src/Microsoft/Data/SqlClient/SqlDelegatedTransaction.cs

@@ -413,6 +413,8 @@ internal void TransactionEnded(Transaction transaction)
                        _active = false;
                        _connection = null;
                    }
+                    // Safest approach is to doom this connection, whose transaction has been aborted externally.
+                    connection.DoomThisConnection();


Does this mean that we doom the connection in all cases of external transaction abort and not just in doubtful cases like the one we were chasing?

Yes, this is required to prevent driver trying to reuse the pooled connection and attempting to reuse Open Transaction at client side. The problem is even if this transaction was aborted, the driver got notified, it still didn't mark it "Aborted" and continues to re-use next time.
[This is current behavior applies to all cases, which should be fixed].

If you can suggest an alternative way to mark "Current Transaction" as "Aborted" before sending connection to "GeneralPool", also ignoring that next time, that would also be a viable solution. I couldn't get this to work due to lot of mixed cases that conflict with each other.

Is it possible that we could store whether we expect the transaction to be aborted or not and then if we find that an abort event is received without us expecting one then we know we need to cleanup? This was what I thought calling connection.CleanupConnectionOnTransactionCompletion would do but apparently it does not which seems strange.

I am trying to understand what the user will see with this fix. If the connection is reused from the pool, and it gets an abort notification later while it is being used, then we are going to doom the connection in the middle of its usage right ?

Before a connection with existing transaction is being reused, I think we should expect the server to drive this transaction to completion either with Abort or Complete or maybe another state. The problem is that we see the abort after we have started using the connection, and perhaps the driver can conclude that for this connection's existing transaction, the driver never received a response from the server about the state of the older transaction, and that could be used as a mechanism to not use the connection.

I am trying to understand what the user will see with this fix. If the connection is reused from the pool, and it gets an abort notification later while it is being used, then we are going to doom the connection in the middle of its usage right ?

No, this connection is picked up from "Transaction Pool" and moved to "General Pool" when "Abort" happens, it's not in-use in that time. User will not notice anything, except performance lag due to a new physical connection being made in this case.

Before a connection with existing transaction is being reused, I think we should expect the server to drive this transaction to completion either with Abort or Complete or maybe another state. The problem is that we see the abort after we have started using the connection...

Since we discussed this later, I'll just put comments to clarify for readers: The service that notifies driver about Abort also notifies server for this Abort action. It just happens that since this is asynchronous callback, and if server is already doing something on that thread, it gets a a little late to perform abort. The problem actually is we see Abort first, but we don't clean the connection's transaction state, which leads to transaction re-use.

cheenamalhotra · 2020-04-02T00:12:04Z

@scale-tone

I looked up more into the CleanupConnectionOnTransactionCompletion method and it does not cleanup completely the "CurrentTransaction" associated with "SqlInternalConnectionTds" object instance. I was able to modify the code to do so, and could clean first connection's transaction reference to be transferred back to "General Pool" safely.

However, doing so does not work well in multi-threaded scenarios, and seems like there's more to this problem than we're looking at from face-value if we try to set things right here.

Alternatively, I'm also going to make this PR more robust to avoid any possible thread-safety issues so we can have one solution to address this problem, while the other can still be investigated, where we can allow re-use of connection.

If that line effectively does nothing, then why is it in place after all?

This is effective to remove the connection from Transaction Pool, but we definitely need to groom it to do only necessary steps in this case.

But ObjectDisposedException still occurs.

I will come back on this, after I sanitize the code to do only required tasks on "Abort". It seems to be related in initial glance. Could you also share your stacktrace to confirm I'm also looking at the same error.

src/Microsoft.Data.SqlClient/netfx/src/Microsoft/Data/SqlClient/SqlDelegatedTransaction.cs

cheenamalhotra · 2020-04-03T01:15:11Z

@scale-tone

I did manage to fix the problem with ObjectDisposedException as it was thrown due to race condition scenarios between Rollback and Abort. I did not find the need to modify code any further, and I also verified with DB Performance Pool Counters we don't have any leaks, the connections are getting cleaned up as expected.

However, there may be cases where connection takes time to connect and may lead to timeout if running too many parallel executions, increasing connection timeout is a viable solution in those scenarios.

The alternate solution of cleaning the connection state and sending it back to the General pool needs deep investigations as that leads to big design changes in the driver. I will get back with progress on that one, but would also request you to verify this option as solution.

Uploading updated NuGet package here for testing:
Microsoft.Data.SqlClient.2.0.0-dev-pr495-1.zip

saurabh500 · 2020-04-03T01:19:50Z

@cheenamalhotra Question about the approach: Is this approach of dooming the connection on ending transaction final?

There has to be a different approach considered, where if the transaction is not ended, then we wait for it to end, and possibly not return this connection to the pool till its related transaction has ended.
The DTC and potentially not DTC customers using SqlTransaction will hit a performance problem while retrieving connections from the pool. The change effectively disables pooling for all transacted connections.

cheenamalhotra · 2020-04-03T01:31:04Z

Hi @saurabh500

I am investigating that option too, but wanted to make 1 solution a complete fix that can be considered as a solution if we don't find a working solution with alternate solution as it's more important to fix the unwanted "Commits" than to worry about performance which is a very edge case scenario. Also this is not going to completely stop Pooling on DTC Transactions. We will only doom the connection if it has been Aborted externally.

In normal Commit/Rollback cases, we will continue to pool the connection as always.

The alternate solution is a can of worms which is more tricky due to the design of "Transaction Pool" that differs from "General Pool". We may have to do a lot of changes to achieve the right solution we're looking for. It's in progress, but is more time consuming too. If it gets too complex, we may have to take a call if it's worth the effort.

saurabh500 · 2020-04-03T01:57:22Z

I see, I agree that the perf problem will hit in case of a DTC transaction abort. Sure.

Here's how I am looking at this situation, which makes performance very important in failure cases.
An application is using DTC to coordinate transactions between SQL Server and another system i.e. a WCF service in this case.

Let's say that the transactions start aborting due to WCF service throwing errors due to some issues (networking / bad configuration deployment / something else not working, failover testing). All the connection participating in those transactions will be doomed and new connections will need to be created the next time a sql interaction is needed. This will overload the connection pool causing timeouts in the connection pool in scenarios where some other part of the service is experiencing problems. So the problem will shift from WCF throwing exceptions to application receiving SqlException during transient errors in some other part of the application.

The connection pool is single threaded for each pool and it will have a hard time providing connections in case of a WCF service or another component participating in DTC, experiencing short durations of outages.
Couple this with the connection pool blocking period where the applications will start seeing a lot of timeouts while trying to get connections from the pool.
These edge case perf issues will cause outages for the applications and investigations about why a call to WCF failed will then become why is SqlClient timing out in case of failures in some other parts of the system even though the root cause of WCF outage was fixed.

saurabh500 · 2020-04-03T02:26:54Z

@cheenamalhotra my comment above was not to dissuade fixing this issue in steps. It was something I wanted to list down before I lost my train of thought.

I do agree that this fix will at guarantee correctness which is high priority item in the light of this bug you have been working on.

By the end of this investigation if you feel that alternate solution is way more disruptive, then could you open an issue to list a performance impact and possible solutions to address the edge case perf case instead of Dooming the connection, that would be nice?
At least folks hitting any issues can search the internet and come to github to find out what is the problem and the related discussion. And if this perf problem becomes a real issue later down the line, a developer looking at this issue will have context. Investigating this problem as a perf issue later is going to be harder than debugging this issue itself.

Since you have rich context around this issue, I think we should definitely list down your findings in another issue as perf issue, and the potential approaches you have tried and why they failed.

scale-tone · 2020-04-03T11:17:40Z

Could you also share your stacktrace to confirm I'm also looking at the same error.

It is clearly stated in the bug:

cheenamalhotra · 2020-04-27T17:48:41Z

Opening a new PR with same branch as CI doesn't seem to trigger here.

Fix Promote Transaction issue

c7026fb

saurabh500 reviewed Mar 28, 2020

View reviewed changes

cheenamalhotra force-pushed the promoteTran branch from 8404641 to 45ee87a Compare April 3, 2020 00:00

Changes to not throw exception is Abort cannot be done.

2f59a6f

cheenamalhotra force-pushed the promoteTran branch from 45ee87a to 2f59a6f Compare April 3, 2020 00:02

saurabh500 reviewed Apr 3, 2020

View reviewed changes

src/Microsoft.Data.SqlClient/netfx/src/Microsoft/Data/SqlClient/SqlDelegatedTransaction.cs Outdated Show resolved Hide resolved

Port changes to NetCore

f34cdf8

cheenamalhotra force-pushed the promoteTran branch 6 times, most recently from 0c17d90 to 3940dbb Compare April 8, 2020 20:05

Additional checks to handle external connection abort.

2fd5e75

cheenamalhotra force-pushed the promoteTran branch from 3940dbb to 2fd5e75 Compare April 8, 2020 22:13

Edit

7259dc2

cheenamalhotra closed this Apr 16, 2020

cheenamalhotra reopened this Apr 16, 2020

cheenamalhotra closed this Apr 27, 2020

cheenamalhotra mentioned this pull request Apr 27, 2020

Fix | Avoid enlistment of pooled connection in aborted transaction #543

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Attempt to fix Promote Transaction issue #495

Attempt to fix Promote Transaction issue #495

cheenamalhotra commented Mar 28, 2020 •

edited

Loading

Wraith2 commented Mar 28, 2020

scale-tone commented Mar 28, 2020

saurabh500 Mar 28, 2020

cheenamalhotra Mar 30, 2020

Wraith2 Mar 30, 2020

saurabh500 Mar 30, 2020

saurabh500 Mar 30, 2020

cheenamalhotra Apr 1, 2020

cheenamalhotra commented Apr 2, 2020 •

edited

Loading

cheenamalhotra commented Apr 3, 2020 •

edited

Loading

saurabh500 commented Apr 3, 2020

cheenamalhotra commented Apr 3, 2020 •

edited

Loading

saurabh500 commented Apr 3, 2020

saurabh500 commented Apr 3, 2020 •

edited

Loading

scale-tone commented Apr 3, 2020

cheenamalhotra commented Apr 27, 2020

Attempt to fix Promote Transaction issue #495

Attempt to fix Promote Transaction issue #495

Conversation

cheenamalhotra commented Mar 28, 2020 • edited Loading

Wraith2 commented Mar 28, 2020

scale-tone commented Mar 28, 2020

saurabh500 Mar 28, 2020

Choose a reason for hiding this comment

cheenamalhotra Mar 30, 2020

Choose a reason for hiding this comment

Wraith2 Mar 30, 2020

Choose a reason for hiding this comment

saurabh500 Mar 30, 2020

Choose a reason for hiding this comment

saurabh500 Mar 30, 2020

Choose a reason for hiding this comment

cheenamalhotra Apr 1, 2020

Choose a reason for hiding this comment

cheenamalhotra commented Apr 2, 2020 • edited Loading

cheenamalhotra commented Apr 3, 2020 • edited Loading

saurabh500 commented Apr 3, 2020

cheenamalhotra commented Apr 3, 2020 • edited Loading

saurabh500 commented Apr 3, 2020

saurabh500 commented Apr 3, 2020 • edited Loading

scale-tone commented Apr 3, 2020

cheenamalhotra commented Apr 27, 2020

cheenamalhotra commented Mar 28, 2020 •

edited

Loading

cheenamalhotra commented Apr 2, 2020 •

edited

Loading

cheenamalhotra commented Apr 3, 2020 •

edited

Loading

cheenamalhotra commented Apr 3, 2020 •

edited

Loading

saurabh500 commented Apr 3, 2020 •

edited

Loading