-
Notifications
You must be signed in to change notification settings - Fork 291
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Attempt to fix Promote Transaction issue #495
Conversation
This must have been a nightmare to trace. When you say that "The driver receives callback when transaction gets aborted" do you mean that It might also be worth adding some checks in the pool return function for various pieces of dangling state and dooming the connection at the return to pool stage. If done this way any other errors from the inappropriate release could be forced safe without finding all possible caller combinations and we could have have it assert in tests. |
Nice to see progress on this (think it's safe to say) critical issue, @cheenamalhotra . Still we don't seem to have full understanding of the mechanics. If you go up in the stack, you'll see that another attempt to properly cleanup that connection before returning to the pool is being made:
Do you have an idea, why that attempt isn't successful? If that line effectively does nothing, then why is it in place after all? I can confirm that with the proposed DoomThisConnection() line added the issue stops happening. But ObjectDisposedException still occurs. Why is that? @Wraith2 , CleanupConnectionOnTransactionCompletion() doesn't help. With that line added the issue still occurs (can't explain, why). |
@@ -413,6 +413,8 @@ internal void TransactionEnded(Transaction transaction) | |||
_active = false; | |||
_connection = null; | |||
} | |||
// Safest approach is to doom this connection, whose transaction has been aborted externally. | |||
connection.DoomThisConnection(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does this mean that we doom the connection in all cases of external transaction abort and not just in doubtful cases like the one we were chasing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is required to prevent driver trying to reuse the pooled connection and attempting to reuse Open Transaction at client side. The problem is even if this transaction was aborted, the driver got notified, it still didn't mark it "Aborted" and continues to re-use next time.
[This is current behavior applies to all cases, which should be fixed].
If you can suggest an alternative way to mark "Current Transaction" as "Aborted" before sending connection to "GeneralPool", also ignoring that next time, that would also be a viable solution. I couldn't get this to work due to lot of mixed cases that conflict with each other.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it possible that we could store whether we expect the transaction to be aborted or not and then if we find that an abort event is received without us expecting one then we know we need to cleanup? This was what I thought calling connection.CleanupConnectionOnTransactionCompletion
would do but apparently it does not which seems strange.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am trying to understand what the user will see with this fix. If the connection is reused from the pool, and it gets an abort notification later while it is being used, then we are going to doom the connection in the middle of its usage right ?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Before a connection with existing transaction is being reused, I think we should expect the server to drive this transaction to completion either with Abort or Complete or maybe another state. The problem is that we see the abort after we have started using the connection, and perhaps the driver can conclude that for this connection's existing transaction, the driver never received a response from the server about the state of the older transaction, and that could be used as a mechanism to not use the connection.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am trying to understand what the user will see with this fix. If the connection is reused from the pool, and it gets an abort notification later while it is being used, then we are going to doom the connection in the middle of its usage right ?
No, this connection is picked up from "Transaction Pool" and moved to "General Pool" when "Abort" happens, it's not in-use in that time. User will not notice anything, except performance lag due to a new physical connection being made in this case.
Before a connection with existing transaction is being reused, I think we should expect the server to drive this transaction to completion either with Abort or Complete or maybe another state. The problem is that we see the abort after we have started using the connection...
Since we discussed this later, I'll just put comments to clarify for readers: The service that notifies driver about Abort also notifies server for this Abort action. It just happens that since this is asynchronous callback, and if server is already doing something on that thread, it gets a a little late to perform abort. The problem actually is we see Abort first, but we don't clean the connection's transaction state, which leads to transaction re-use.
I looked up more into the However, doing so does not work well in multi-threaded scenarios, and seems like there's more to this problem than we're looking at from face-value if we try to set things right here. Alternatively, I'm also going to make this PR more robust to avoid any possible thread-safety issues so we can have one solution to address this problem, while the other can still be investigated, where we can allow re-use of connection.
This is effective to remove the connection from Transaction Pool, but we definitely need to groom it to do only necessary steps in this case.
I will come back on this, after I sanitize the code to do only required tasks on "Abort". It seems to be related in initial glance. Could you also share your stacktrace to confirm I'm also looking at the same error. |
8404641
to
45ee87a
Compare
45ee87a
to
2f59a6f
Compare
src/Microsoft.Data.SqlClient/netfx/src/Microsoft/Data/SqlClient/SqlDelegatedTransaction.cs
Outdated
Show resolved
Hide resolved
I did manage to fix the problem with However, there may be cases where connection takes time to connect and may lead to timeout if running too many parallel executions, increasing connection timeout is a viable solution in those scenarios. The alternate solution of cleaning the connection state and sending it back to the General pool needs deep investigations as that leads to big design changes in the driver. I will get back with progress on that one, but would also request you to verify this option as solution. Uploading updated NuGet package here for testing: |
@cheenamalhotra Question about the approach: Is this approach of dooming the connection on ending transaction final? There has to be a different approach considered, where if the transaction is not ended, then we wait for it to end, and possibly not return this connection to the pool till its related transaction has ended. |
Hi @saurabh500 I am investigating that option too, but wanted to make 1 solution a complete fix that can be considered as a solution if we don't find a working solution with alternate solution as it's more important to fix the unwanted "Commits" than to worry about performance which is a very edge case scenario. Also this is not going to completely stop Pooling on DTC Transactions. We will only doom the connection if it has been Aborted externally. In normal Commit/Rollback cases, we will continue to pool the connection as always. The alternate solution is a can of worms which is more tricky due to the design of "Transaction Pool" that differs from "General Pool". We may have to do a lot of changes to achieve the right solution we're looking for. It's in progress, but is more time consuming too. If it gets too complex, we may have to take a call if it's worth the effort. |
I see, I agree that the perf problem will hit in case of a DTC transaction abort. Sure. Here's how I am looking at this situation, which makes performance very important in failure cases. Let's say that the transactions start aborting due to WCF service throwing errors due to some issues (networking / bad configuration deployment / something else not working, failover testing). All the connection participating in those transactions will be doomed and new connections will need to be created the next time a sql interaction is needed. This will overload the connection pool causing timeouts in the connection pool in scenarios where some other part of the service is experiencing problems. So the problem will shift from WCF throwing exceptions to application receiving SqlException during transient errors in some other part of the application. The connection pool is single threaded for each pool and it will have a hard time providing connections in case of a WCF service or another component participating in DTC, experiencing short durations of outages. |
@cheenamalhotra my comment above was not to dissuade fixing this issue in steps. It was something I wanted to list down before I lost my train of thought. I do agree that this fix will at guarantee correctness which is high priority item in the light of this bug you have been working on. By the end of this investigation if you feel that alternate solution is way more disruptive, then could you open an issue to list a performance impact and possible solutions to address the edge case perf case instead of Dooming the connection, that would be nice? Since you have rich context around this issue, I think we should definitely list down your findings in another issue as perf issue, and the potential approaches you have tried and why they failed. |
It is clearly stated in the bug: |
0c17d90
to
3940dbb
Compare
3940dbb
to
2fd5e75
Compare
Opening a new PR with same branch as CI doesn't seem to trigger here. |
Fixes #237
After more investigations we found that this issue occurs due to SqlClient's inability to handle "Transaction Aborted" callback in SqlDelegatedTransaction. When transaction promotion is triggered by WCF Service after first set of commands, the transaction is also requested to be aborted due to exception scenarios from
SysTx.Transactions
.The driver receives callback when transaction gets aborted, but it sends this connection back to the pool, without updating its "Current Transaction" state, which gets picked up later by other new connection requests. In the second round of transaction scope, it does not respect new transaction scope created for the new connection request because there was no "currentTransaction" expected.
Hence the Debug.Assert failed all the time.
Moving on with this "old" transaction, which is aborted externally, we also notice warning from SQL Server as it sends "TransactionEnded" token followed by "Begin" to initiate new transaction request, Everytime! as explained here: #237 (comment)
Most of the times we get lucky, that the transaction was aborted by SQL Server, before we tried to use it again and it always gave us new transaction to work with. But in 1% scenarios, where the error happened, server got a little late to end this transaction and we end up enlisting our connection for 2nd Transaction Scope in the 1st transaction. But an abort is pending on server side, so as soon as it aborts the transaction, all further requests to server are free of the 1st "TransactionScope", 2nd TransactionScope had no links to this round of queries, and queries get executed with "Implicit" transactions, hence the "Commits".
Everything is linked to the fact that this connection was sent back to "GeneralPool" without cleaning it's transaction state where only connections WITHOUT any "Active" transaction reference must reside.
Since this connection has enlisted transaction that got promoted to DTC but was recently aborted externally, safest approach in my opinion is to Doom this connection as soon as we capture the aborted state of current transaction to block sending back to the pool.
I couldn't find any other clean way to reset this connection properly in order to send it back to General Pool. Would welcome suggestions if that can achieved!
Please give the below package a try, that contains fix:
Microsoft.Data.SqlClient.2.0.0-dev-437.nupkg.zip
cc @saurabh500 @David-Engel @scale-tone