Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Attempt to fix Promote Transaction issue #495

Closed
wants to merge 5 commits into from

Conversation

cheenamalhotra
Copy link
Member

@cheenamalhotra cheenamalhotra commented Mar 28, 2020

Fixes #237

After more investigations we found that this issue occurs due to SqlClient's inability to handle "Transaction Aborted" callback in SqlDelegatedTransaction. When transaction promotion is triggered by WCF Service after first set of commands, the transaction is also requested to be aborted due to exception scenarios from SysTx.Transactions.

The driver receives callback when transaction gets aborted, but it sends this connection back to the pool, without updating its "Current Transaction" state, which gets picked up later by other new connection requests. In the second round of transaction scope, it does not respect new transaction scope created for the new connection request because there was no "currentTransaction" expected.
Hence the Debug.Assert failed all the time.

Moving on with this "old" transaction, which is aborted externally, we also notice warning from SQL Server as it sends "TransactionEnded" token followed by "Begin" to initiate new transaction request, Everytime! as explained here: #237 (comment)

Most of the times we get lucky, that the transaction was aborted by SQL Server, before we tried to use it again and it always gave us new transaction to work with. But in 1% scenarios, where the error happened, server got a little late to end this transaction and we end up enlisting our connection for 2nd Transaction Scope in the 1st transaction. But an abort is pending on server side, so as soon as it aborts the transaction, all further requests to server are free of the 1st "TransactionScope", 2nd TransactionScope had no links to this round of queries, and queries get executed with "Implicit" transactions, hence the "Commits".

Everything is linked to the fact that this connection was sent back to "GeneralPool" without cleaning it's transaction state where only connections WITHOUT any "Active" transaction reference must reside.

Since this connection has enlisted transaction that got promoted to DTC but was recently aborted externally, safest approach in my opinion is to Doom this connection as soon as we capture the aborted state of current transaction to block sending back to the pool.

I couldn't find any other clean way to reset this connection properly in order to send it back to General Pool. Would welcome suggestions if that can achieved!

Please give the below package a try, that contains fix:
Microsoft.Data.SqlClient.2.0.0-dev-437.nupkg.zip

cc @saurabh500 @David-Engel @scale-tone

@Wraith2
Copy link
Contributor

Wraith2 commented Mar 28, 2020

This must have been a nightmare to trace.

When you say that "The driver receives callback when transaction gets aborted" do you mean that TransactionEnded is called? If so isn't it possible that from there you could call connection.CleanupConnectionOnTransactionCompletion here as is done in the Rollback implementation?

It might also be worth adding some checks in the pool return function for various pieces of dangling state and dooming the connection at the return to pool stage. If done this way any other errors from the inappropriate release could be forced safe without finding all possible caller combinations and we could have have it assert in tests.

@scale-tone
Copy link

Nice to see progress on this (think it's safe to say) critical issue, @cheenamalhotra . Still we don't seem to have full understanding of the mechanics.

If you go up in the stack, you'll see that another attempt to properly cleanup that connection before returning to the pool is being made:

            CleanupConnectionOnTransactionCompletion(transaction);

Do you have an idea, why that attempt isn't successful? If that line effectively does nothing, then why is it in place after all?

I can confirm that with the proposed DoomThisConnection() line added the issue stops happening. But ObjectDisposedException still occurs. Why is that?

@Wraith2 , CleanupConnectionOnTransactionCompletion() doesn't help. With that line added the issue still occurs (can't explain, why).

@@ -413,6 +413,8 @@ internal void TransactionEnded(Transaction transaction)
_active = false;
_connection = null;
}
// Safest approach is to doom this connection, whose transaction has been aborted externally.
connection.DoomThisConnection();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does this mean that we doom the connection in all cases of external transaction abort and not just in doubtful cases like the one we were chasing?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is required to prevent driver trying to reuse the pooled connection and attempting to reuse Open Transaction at client side. The problem is even if this transaction was aborted, the driver got notified, it still didn't mark it "Aborted" and continues to re-use next time.
[This is current behavior applies to all cases, which should be fixed].

If you can suggest an alternative way to mark "Current Transaction" as "Aborted" before sending connection to "GeneralPool", also ignoring that next time, that would also be a viable solution. I couldn't get this to work due to lot of mixed cases that conflict with each other.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it possible that we could store whether we expect the transaction to be aborted or not and then if we find that an abort event is received without us expecting one then we know we need to cleanup? This was what I thought calling connection.CleanupConnectionOnTransactionCompletion would do but apparently it does not which seems strange.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am trying to understand what the user will see with this fix. If the connection is reused from the pool, and it gets an abort notification later while it is being used, then we are going to doom the connection in the middle of its usage right ?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before a connection with existing transaction is being reused, I think we should expect the server to drive this transaction to completion either with Abort or Complete or maybe another state. The problem is that we see the abort after we have started using the connection, and perhaps the driver can conclude that for this connection's existing transaction, the driver never received a response from the server about the state of the older transaction, and that could be used as a mechanism to not use the connection.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am trying to understand what the user will see with this fix. If the connection is reused from the pool, and it gets an abort notification later while it is being used, then we are going to doom the connection in the middle of its usage right ?

No, this connection is picked up from "Transaction Pool" and moved to "General Pool" when "Abort" happens, it's not in-use in that time. User will not notice anything, except performance lag due to a new physical connection being made in this case.

Before a connection with existing transaction is being reused, I think we should expect the server to drive this transaction to completion either with Abort or Complete or maybe another state. The problem is that we see the abort after we have started using the connection...

Since we discussed this later, I'll just put comments to clarify for readers: The service that notifies driver about Abort also notifies server for this Abort action. It just happens that since this is asynchronous callback, and if server is already doing something on that thread, it gets a a little late to perform abort. The problem actually is we see Abort first, but we don't clean the connection's transaction state, which leads to transaction re-use.

@cheenamalhotra
Copy link
Member Author

cheenamalhotra commented Apr 2, 2020

@scale-tone

I looked up more into the CleanupConnectionOnTransactionCompletion method and it does not cleanup completely the "CurrentTransaction" associated with "SqlInternalConnectionTds" object instance. I was able to modify the code to do so, and could clean first connection's transaction reference to be transferred back to "General Pool" safely.

However, doing so does not work well in multi-threaded scenarios, and seems like there's more to this problem than we're looking at from face-value if we try to set things right here.

Alternatively, I'm also going to make this PR more robust to avoid any possible thread-safety issues so we can have one solution to address this problem, while the other can still be investigated, where we can allow re-use of connection.

If that line effectively does nothing, then why is it in place after all?

This is effective to remove the connection from Transaction Pool, but we definitely need to groom it to do only necessary steps in this case.

But ObjectDisposedException still occurs.

I will come back on this, after I sanitize the code to do only required tasks on "Abort". It seems to be related in initial glance. Could you also share your stacktrace to confirm I'm also looking at the same error.

@cheenamalhotra
Copy link
Member Author

cheenamalhotra commented Apr 3, 2020

@scale-tone

I did manage to fix the problem with ObjectDisposedException as it was thrown due to race condition scenarios between Rollback and Abort. I did not find the need to modify code any further, and I also verified with DB Performance Pool Counters we don't have any leaks, the connections are getting cleaned up as expected.

However, there may be cases where connection takes time to connect and may lead to timeout if running too many parallel executions, increasing connection timeout is a viable solution in those scenarios.

The alternate solution of cleaning the connection state and sending it back to the General pool needs deep investigations as that leads to big design changes in the driver. I will get back with progress on that one, but would also request you to verify this option as solution.

Uploading updated NuGet package here for testing:
Microsoft.Data.SqlClient.2.0.0-dev-pr495-1.zip

@saurabh500
Copy link
Contributor

@cheenamalhotra Question about the approach: Is this approach of dooming the connection on ending transaction final?

There has to be a different approach considered, where if the transaction is not ended, then we wait for it to end, and possibly not return this connection to the pool till its related transaction has ended.
The DTC and potentially not DTC customers using SqlTransaction will hit a performance problem while retrieving connections from the pool. The change effectively disables pooling for all transacted connections.

@cheenamalhotra
Copy link
Member Author

cheenamalhotra commented Apr 3, 2020

Hi @saurabh500

I am investigating that option too, but wanted to make 1 solution a complete fix that can be considered as a solution if we don't find a working solution with alternate solution as it's more important to fix the unwanted "Commits" than to worry about performance which is a very edge case scenario. Also this is not going to completely stop Pooling on DTC Transactions. We will only doom the connection if it has been Aborted externally.

In normal Commit/Rollback cases, we will continue to pool the connection as always.

The alternate solution is a can of worms which is more tricky due to the design of "Transaction Pool" that differs from "General Pool". We may have to do a lot of changes to achieve the right solution we're looking for. It's in progress, but is more time consuming too. If it gets too complex, we may have to take a call if it's worth the effort.

@saurabh500
Copy link
Contributor

I see, I agree that the perf problem will hit in case of a DTC transaction abort. Sure.

Here's how I am looking at this situation, which makes performance very important in failure cases.
An application is using DTC to coordinate transactions between SQL Server and another system i.e. a WCF service in this case.

Let's say that the transactions start aborting due to WCF service throwing errors due to some issues (networking / bad configuration deployment / something else not working, failover testing). All the connection participating in those transactions will be doomed and new connections will need to be created the next time a sql interaction is needed. This will overload the connection pool causing timeouts in the connection pool in scenarios where some other part of the service is experiencing problems. So the problem will shift from WCF throwing exceptions to application receiving SqlException during transient errors in some other part of the application.

The connection pool is single threaded for each pool and it will have a hard time providing connections in case of a WCF service or another component participating in DTC, experiencing short durations of outages.
Couple this with the connection pool blocking period where the applications will start seeing a lot of timeouts while trying to get connections from the pool.
These edge case perf issues will cause outages for the applications and investigations about why a call to WCF failed will then become why is SqlClient timing out in case of failures in some other parts of the system even though the root cause of WCF outage was fixed.

@saurabh500
Copy link
Contributor

saurabh500 commented Apr 3, 2020

@cheenamalhotra my comment above was not to dissuade fixing this issue in steps. It was something I wanted to list down before I lost my train of thought.

I do agree that this fix will at guarantee correctness which is high priority item in the light of this bug you have been working on.

By the end of this investigation if you feel that alternate solution is way more disruptive, then could you open an issue to list a performance impact and possible solutions to address the edge case perf case instead of Dooming the connection, that would be nice?
At least folks hitting any issues can search the internet and come to github to find out what is the problem and the related discussion. And if this perf problem becomes a real issue later down the line, a developer looking at this issue will have context. Investigating this problem as a perf issue later is going to be harder than debugging this issue itself.

Since you have rich context around this issue, I think we should definitely list down your findings in another issue as perf issue, and the potential approaches you have tried and why they failed.

@scale-tone
Copy link

Could you also share your stacktrace to confirm I'm also looking at the same error.

It is clearly stated in the bug:
image

@cheenamalhotra cheenamalhotra force-pushed the promoteTran branch 6 times, most recently from 0c17d90 to 3940dbb Compare April 8, 2020 20:05
@cheenamalhotra
Copy link
Member Author

Opening a new PR with same branch as CI doesn't seem to trigger here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

SqlException: The PROMOTE TRANSACTION request failed because there is no local transaction active
4 participants