Configurable Retry Logic - Preview 1 #693

DavoudEshtehari · 2020-08-15T00:03:59Z

This PR is created for discussion and share the progress status of the 'Configurable Retry Logic' proposed by Silvano Coriani.
In the first step, it has designed to cover customers' provider customization.

Overview

An application that communicates with elements running in a service, especially in the cloud must be sensitive to the transient faults that can occur in this environment. These faults are typically self-corrective, and if the action that triggered a fault is repeated after a suitable delay it is likely to be successful.
Since this feature is in the preview stage, it is required to enable the considered application context switch.

AppContext.SetSwitch("Switch.Microsoft.Data.SqlClient.EnableRetryLogic", true);

Goal

Applying the retry policy on opening a connection and executing a command in the expectation that they will succeed.

Proposal

There are two sections:

1. Core API

Generally, to support the retry logic on an operation, developing the following interfaces is required:

Enumerator: it generates a sequence of the time intervals.
SqlRetryIntervalBaseEnumerator
Retry logic: it retrieves the next time interval with respect to the number of retries if a transient condition happens.
SqlRetryLogicBase
Retry provider: it applies a retry logic on an operation.
SqlRetryLogicBaseProvider

2. Configurable retry logic

The suggested implementation by the core API is included in the driver.
This implementation is supposed to cover all or a part of transient faults and is going to improve simultaneously with discussions here.

Notes & Thoughts

Each environment has its own configuration and probable faults, so this is the customer’s responsibility to monitor and detect them to have a measurable configuration value before getting to use the retry logic. It is strongly recommended do not include all transient errors without investigating the environment.
Customers can implement their own retry logic by developing the Core API if the implemented retry logic does not cover their requirements.
The best practice is to apply a specific configuration for a specific transient error according to the fault reason. This approach will have the best performance, but it increases the complexity in addition to decrease the useability.
On the other hand, to simplify the retry configuration for the user we must sacrifice the best performance and apply a common configuration for all active transient errors.
Despite the vital role of the transaction, in beginning, it would be played as a regular connection since probably it will need support from SQL teams.

Related discussions

WIP: Add retry logic to connection open and command execution #307
Introduce SqlState on DbException for standard cross-database errors #35601
Database-agnostic way to detect transient database errors #34817

Resources

ErikEJ · 2020-08-15T13:19:49Z

Where is the proposal?

roji · 2020-08-15T15:29:15Z

There's been some previous conversation involving @scoriani in #416.

As I wrote there (#416 (comment)), I'm personally quite sceptical of the idea of a database driver attempting to internally handle resiliency; the various complexities around what to do when something fails in the middle of a transaction point towards resiliency being handled at a higher layer.

But I'm definitely interested in hearing views to the contrary - any input on how the concerns raised in #416 are to be addressed?

cheenamalhotra · 2020-08-17T17:42:15Z

@ErikEJ @roji

This is rework of #307 - it's a work in progress, so we'll be addressing open concerns/questions here in an attempt to take this feature to completion.

Also to mention, our aim is to make it an on-demand behavior that's fully customizable, not something driver will support by default.

DavoudEshtehari · 2020-08-17T19:43:47Z

The main aim of this PR is realizing the best-predefined retry solutions and being sure this design can support them in addition to support the customer's customized approaches.
As @roji mentioned, detecting the application logic at the driver level is not possible but we can prepare an infrastructure to support it. To cover these types of concerns, besides using the driver without enabling this option and managing it in the application level, customers can implement their logic and inject it through the configurable retry logic provider inside the driver.

roji · 2020-08-21T14:23:20Z

To cover these types of concerns, besides using the driver without enabling this option and managing it in the application level, customers can implement their logic and inject it through the configurable retry logic provider inside the driver.

I'm still very confused about how this approach will actually work, especially around transactions - how can any sort of driver retry logic handle things, regardless of whether the logic is user-injected or not? I can't see how any meaningful retry strategy can work without an additional layer on top of the driver, at which point any sort of driver support seems unneeded (or rather, unusable).

I may be missing the big picture here (or just misunderstanding what's being attempted)... would it be possible to get some sort of short description of how this concept works around transactions?

timespan & retry event

David-Engel · 2020-09-24T23:42:50Z

I may be missing the big picture here (or just misunderstanding what's being attempted)... would it be possible to get some sort of short description of how this concept works around transactions?

@roji The push is coming from application migrations from on-premise databases to cloud databases. There is much desire to get something out and provide an infrastructure for extending retry scenarios without major application re-writes. While cloud services offer uptimes like 99.99%, "outages" of a few seconds, due to failover for example, can occur at almost any time. Many of those applications being migrated were designed for a database that used to sit right next to it and rarely suffered from little "blips" like this so designing for them was overkill.

Transaction scenarios are probably not all going to be solved in this implementation. But depending on the error returned, it may even be safe to retry an operation within a transaction. I think providing retry capabilities at least gives customers potential options short of a major refactor. This will be off by default and give customers new knobs to test with.

roji · 2020-09-25T17:53:57Z

@David-Engel I do understand why people are looking for this - it's very tempting to speak of turning a knob and getting 100% resiliency in the face of transient errors, without any code changes. Unfortunately, I think it's simply not possible to do this at such a low level in the stack (ADO.NET driver), and that resiliency is a problem that must be faced at a higher level up the stack (e.g. by something like EF or the application itself, via something like Polly).

Here are a few of the problematic scenarios I'm thinking about:

As already discussed, many errors cause the ongoing transaction to be aborted/fail. It is not possible to automatically retry the whole transaction from the beginning. Do we know whether all/most of the cloud failures are of the type which allows the transaction to continue, or fail?
Transient issues can generally occur at any time. For example, if one occurs in the middle of consuming a resultset, it does not seem possible retry anything. The application has already read half of a resultset, and it's not possible to reliably re-execute the query and continue from that point (results may have changed in the meantime, leading to data corruption)
Similarly, when a command contains multiple statement (batching), an earlier query may succeed but a later one may fail. It's not possible to just re-executing the command, since the successful earlier statement may have side-effects.

There are other similar issues - this is just off the top of my head. Now, IMHO this is one of those cases where providing a very partial solution which does not cover many cases does more harm than good; people see an official "SqlClient resiliency" knob and get a very false sense of confidence, when in fact they may still be vulnerable to serious issues and data corruption in many cases.

I guess I'd investigate exactly what this would cover, and what not. For example, if SQL Azure transient exceptions do keep the transaction usable (i.e. allow retrying a single failed command), then maybe this makes sense, if the un-covered cases are clearly documented etc. If you've already done such an investigation and I'm unaware, than apologies... But without some knowledge on what exactly this will address, I think this could cause more problems than it solves.

David-Engel · 2020-09-25T19:34:23Z

@roji I pretty much agree with you. I think a lot of scenarios will end up being more complex than can be covered by a feature at the level of SqlClient. However, this effort is being pushed from higher up. So we are going to try to move it along to see what works and doesn't work. Please note, this feature will be under a context switch and be considered preview so that users can begin trying it and seeing where it falls down and where it helps. There are currently no plans to make it a GA feature in an upcoming release.

Much of the groundwork and investigation was done previously by @scoriani. He might be able to answer your questions specific to transactions during transient errors against Azure SQL.

roji · 2020-09-25T22:03:38Z

Understood. It will certainly be interesting to see how this involves and what it covers. I'd also be interested in further info from @scoriani or others on what can actually be covered and what can't.

scoriani · 2020-09-26T07:32:00Z

Apologies for my absence in this thread despite multiple references. @roji, make no mistake, your concerns are extremely valid and shared. And yes, we've been recommending customers to implement retry strategies higher in the stack through tools like Polly, TFH, etc. in order to consider app specific logic and requirements since quite some time, but reality is that only a small fraction of them could afford anything but connection string or simple changes in their codebases.
In our mind, this is a first step in a longer journey where a more holistic approach including things like reliable transactions or readers will be part of the bigger picture, but this will require investments on the server-side to make sure session state and context can survive disconnections that are still under evaluation.
The aim with this very first step is to provide an initial solution that customer can use to cover a subset of use-cases through a configurable approach that they can decide how and where to turn on, what error codes to retry to and so on, in addition to clear guidance on where this can help and what kind of consequences can create.
I completely agree that today it won't be applicable to transactional scenarios (other than trivial) or other complex use-cases, but we've got a lot of feedback from many customers that have other scenarios (e.g. not necessarily just reporting) where even a first basic implementation would be extremely valuable.

# Conflicts: # src/Microsoft.Data.SqlClient/netcore/src/Microsoft/Data/SqlClient/SqlCommand.cs # src/Microsoft.Data.SqlClient/netcore/src/Microsoft/Data/SqlClient/SqlConnection.cs

roji · 2020-09-26T08:38:21Z

Thanks for your thoughts and for the added details @scoriani. I think we're all more or less aligned with regards to the partiality of the proposed solution, and I understand the customer demand here. A few last comments:

In our mind, this is a first step in a longer journey where a more holistic approach including things like reliable transactions or readers will be part of the bigger picture, but this will require investments on the server-side to make sure session state and context can survive disconnections that are still under evaluation.

That is certainly an ambitious undertaking... And if the goal really is robust resiliency which does not require application awareness, then I agree that the server-side investments would be a prerequisite. FWIW I still think the goal itself is problematic; I know very little about Azure SQL, but I was made to understand that at least earlier iterations shifted the bulk of the reliability/resiliency problem to the client; to me this seems to be the problematic part. I really do believe that resiliency either needs to be handled at the server side (without the driver or application needing to know about it), or by the application.

FWIW I'm not aware of other database/cloud products/services where the bulk of resiliency is shifted to the driver in the way this proposal does it - this seems somewhat unique. But I definitely haven't done a detailed analysis on this.

I completely agree that today it won't be applicable to transactional scenarios (other than trivial) or other complex use-cases, but we've got a lot of feedback from many customers that have other scenarios (e.g. not necessarily just reporting) where even a first basic implementation would be extremely valuable.

OK. I hope the basic coverage that this can provide could be sufficient, and that those customers understand the intricacies of what can be done here and what can't.

Good luck and I'll be curious to follow this!

scoriani · 2020-09-26T09:18:49Z

Thanks @roji, and please keep the great feedback coming!

FWIW I'm not aware of other database/cloud products/services where the bulk of resiliency is shifted to the driver in the way this proposal does it - this seems somewhat unique. But I definitely haven't done a detailed analysis on this.

In my experience, this is true for most database services, as high availability is implemented with remarkably similar design everywhere. Just as a quick example:
"Applications experience minimal interruption of service if they connect using the cluster endpoint and implement connection retry logic. During the failover, AWS modifies the cluster endpoint to point to the newly created/promoted DB instance. Well-architected applications reconnect automatically. The downtime during failover depends on the existence of healthy Read Replicas. If no Read Replicas are configured, or if existing Read Replicas are not healthy, then you might notice increased downtime to create a new instance."
https://aws.amazon.com/blogs/database/failover-with-amazon-aurora-postgresql/

David-Engel · 2020-09-28T18:09:28Z

Some additional thoughts on the overall design:

I see automatic retry being problematic for most applications if it is just an on/off thing for all queries. Some other knobs to consider:

Turn retry off if there is an active transaction.
Perhaps we should add a pluggable SQL parser so that certain statements will disable the retry logic. We could provide a no-op parser which always returns true. We could provide a simple string match parser, or a regex parser. And perhaps these default parsers could be configurable via the config file: taking in a list of patterns, for example.

CC: @scoriani

* Skip retry in a transactional command * Skip retry in an unauthorized SQL statements * Apply a custom transient faults list * Configurable retry provider factory for pre-defined strategies

temprary for netcore

# Conflicts: # src/Microsoft.Data.SqlClient/netcore/src/Microsoft.Data.SqlClient.csproj # src/Microsoft.Data.SqlClient/netfx/src/Microsoft.Data.SqlClient.csproj

karinazhou · 2021-03-02T23:10:17Z

doc/samples/SqlConfigurableRetryLogic_SqlCommand.cs

+    }
+// </Snippet1>
+// <Snippet2>
+    private static void RetryCommand(SqlRetryLogicBaseProvider provider)


Can you add some comments to distinguish these private static void RetryCommand(SqlRetryLogicBaseProvider provider) ?

You can find the description in the related SqlCommand.xml file. There is a common explanation for this series of samples. Feel free to suggest here If you need something more.

doc/snippets/Microsoft.Data.SqlClient/SqlCommand.xml

karinazhou · 2021-03-02T23:25:18Z

doc/snippets/Microsoft.Data.SqlClient/SqlConfigurableRetryFactory.xml

+|10053|Could not convert the data value due to reasons other than sign mismatch or overflow.|
+|997|A connection was successfully established with the server, but then an error occurred during the login process. (provider: Named Pipes Provider, error: 0 - Overlapped I/O operation is in progress)|
+|233|A connection was successfully established with the server, but then an error occurred during the login process. (provider: Shared Memory Provider, error: 0 - No process is on the other end of the pipe.) (Microsoft SQL Server, Error: 233)|
+|64||


No descrptions for 64, 20, and 0 ?

Thank you for bringing it up. I couldn't find an acceptable description for these error numbers, especially error number 0.
Any suggestions @David-Engel @cheenamalhotra?

@DavoudEshtehari Did these error numbers come from Silvano?

Except number 997, all of them come from Silvano.

For error 0, below exceptions are seen (Researched, it may not be conclusive):

Microsoft.Data.SqlClient.SqlException (0x80131904): A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: TCP Provider, error: 0 - No such host is known.)

This is retriable as the message "No such host is known` is known retriable error.

A connection was successfully established with the server, but then an error occurred during the login process. (provider: SSL Provider, error: 0 - The target principal name is incorrect.)

This is an error generated by Domain controller during authentication. More info: https://docs.microsoft.com/en-us/troubleshoot/windows-server/identity/target-principal-name-is-incorrect-when-replicating-data
No info on retry options.

They are server timeout errors, which are also retriable.

Summarizing
Error 0 is very generic number, and should ideally not be retried upon. But looking at errors that arise with it, it may be okay to retry on.

src/Microsoft.Data.SqlClient/netcore/src/Microsoft/Data/SqlClient/SqlConnection.cs

doc/snippets/Microsoft.Data.SqlClient/SqlConnection.xml

src/Microsoft.Data.SqlClient/netcore/src/Microsoft.Data.SqlClient.csproj

src/Microsoft.Data.SqlClient/netcore/ref/Microsoft.Data.SqlClient.cs

src/Microsoft.Data.SqlClient/netfx/src/Microsoft.Data.SqlClient.csproj

Co-authored-by: Karina Zhou <v-jizho2@microsoft.com>

+ improvement

Configurable Retry Logic core draft design

d86ed3a

Improve

185c8cd

timespan & retry event

add comment & prune

f4e1a24

DavoudEshtehari changed the title ~~Configurable Retry Logic - draft design~~ [WIP] Configurable Retry Logic - draft design Sep 25, 2020

Use abstract instead interface

366c21a

DavoudEshtehari force-pushed the ConfigurableRetry branch from aa74fe5 to b6a1a6c Compare September 26, 2020 08:13

Merge remote-tracking branch 'UpStream/master' into ConfigurableRetry

9b344fc

# Conflicts: # src/Microsoft.Data.SqlClient/netcore/src/Microsoft/Data/SqlClient/SqlCommand.cs # src/Microsoft.Data.SqlClient/netcore/src/Microsoft/Data/SqlClient/SqlConnection.cs

DavoudEshtehari force-pushed the ConfigurableRetry branch from b6a1a6c to 9b344fc Compare September 26, 2020 08:34

DavoudEshtehari force-pushed the ConfigurableRetry branch from fd7c1ed to 5f9cca9 Compare October 5, 2020 19:15

improve retry condition & configs

f0ada60

* Skip retry in a transactional command * Skip retry in an unauthorized SQL statements * Apply a custom transient faults list * Configurable retry provider factory for pre-defined strategies

DavoudEshtehari force-pushed the ConfigurableRetry branch from 5f9cca9 to f0ada60 Compare October 6, 2020 03:14

Davoud Eshtehari added 4 commits October 12, 2020 20:11

Add parameter options & generalize the pre-condition

eab4106

Add unit tests

9909fe5

temprary for netcore

Support the Netfx & NetCore with common source code

868e576

Merge remote-tracking branch 'UpStream/master' into ConfigurableRetry

8c6d8f8

# Conflicts: # src/Microsoft.Data.SqlClient/netcore/src/Microsoft.Data.SqlClient.csproj # src/Microsoft.Data.SqlClient/netfx/src/Microsoft.Data.SqlClient.csproj

DavoudEshtehari force-pushed the ConfigurableRetry branch 2 times, most recently from 63e7dc3 to 0a576cf Compare October 15, 2020 23:58

karinazhou reviewed Mar 2, 2021

View reviewed changes

doc/snippets/Microsoft.Data.SqlClient/SqlCommand.xml Outdated Show resolved Hide resolved

karinazhou reviewed Mar 2, 2021

View reviewed changes

doc/snippets/Microsoft.Data.SqlClient/SqlCommand.xml Outdated Show resolved Hide resolved

karinazhou reviewed Mar 2, 2021

View reviewed changes

doc/snippets/Microsoft.Data.SqlClient/SqlCommand.xml Outdated Show resolved Hide resolved

karinazhou reviewed Mar 2, 2021

View reviewed changes

doc/snippets/Microsoft.Data.SqlClient/SqlCommand.xml Outdated Show resolved Hide resolved

karinazhou reviewed Mar 2, 2021

View reviewed changes