Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Configurable Retry Logic - Preview 1 #693

Merged
merged 48 commits into from
Mar 4, 2021

Conversation

DavoudEshtehari
Copy link
Member

@DavoudEshtehari DavoudEshtehari commented Aug 15, 2020

This PR is created for discussion and share the progress status of the 'Configurable Retry Logic' proposed by Silvano Coriani.
In the first step, it has designed to cover customers' provider customization.

image

Overview

An application that communicates with elements running in a service, especially in the cloud must be sensitive to the transient faults that can occur in this environment. These faults are typically self-corrective, and if the action that triggered a fault is repeated after a suitable delay it is likely to be successful.
Since this feature is in the preview stage, it is required to enable the considered application context switch.

AppContext.SetSwitch("Switch.Microsoft.Data.SqlClient.EnableRetryLogic", true);

Goal

Applying the retry policy on opening a connection and executing a command in the expectation that they will succeed.

Proposal

There are two sections:

1. Core API

Generally, to support the retry logic on an operation, developing the following interfaces is required:

2. Configurable retry logic

The suggested implementation by the core API is included in the driver.
This implementation is supposed to cover all or a part of transient faults and is going to improve simultaneously with discussions here.

Notes & Thoughts

  • Each environment has its own configuration and probable faults, so this is the customer’s responsibility to monitor and detect them to have a measurable configuration value before getting to use the retry logic. It is strongly recommended do not include all transient errors without investigating the environment.

  • Customers can implement their own retry logic by developing the Core API if the implemented retry logic does not cover their requirements.

  • The best practice is to apply a specific configuration for a specific transient error according to the fault reason. This approach will have the best performance, but it increases the complexity in addition to decrease the useability.

  • On the other hand, to simplify the retry configuration for the user we must sacrifice the best performance and apply a common configuration for all active transient errors.

  • Despite the vital role of the transaction, in beginning, it would be played as a regular connection since probably it will need support from SQL teams.

Related discussions

  • WIP: Add retry logic to connection open and command execution #307
  • Introduce SqlState on DbException for standard cross-database errors #35601
  • Database-agnostic way to detect transient database errors #34817

Resources

@ErikEJ
Copy link
Contributor

ErikEJ commented Aug 15, 2020

Where is the proposal?

@roji
Copy link
Member

roji commented Aug 15, 2020

There's been some previous conversation involving @scoriani in #416.

As I wrote there (#416 (comment)), I'm personally quite sceptical of the idea of a database driver attempting to internally handle resiliency; the various complexities around what to do when something fails in the middle of a transaction point towards resiliency being handled at a higher layer.

But I'm definitely interested in hearing views to the contrary - any input on how the concerns raised in #416 are to be addressed?

@cheenamalhotra
Copy link
Member

cheenamalhotra commented Aug 17, 2020

@ErikEJ @roji

This is rework of #307 - it's a work in progress, so we'll be addressing open concerns/questions here in an attempt to take this feature to completion.

Also to mention, our aim is to make it an on-demand behavior that's fully customizable, not something driver will support by default.

@DavoudEshtehari
Copy link
Member Author

The main aim of this PR is realizing the best-predefined retry solutions and being sure this design can support them in addition to support the customer's customized approaches.
As @roji mentioned, detecting the application logic at the driver level is not possible but we can prepare an infrastructure to support it. To cover these types of concerns, besides using the driver without enabling this option and managing it in the application level, customers can implement their logic and inject it through the configurable retry logic provider inside the driver.

@roji
Copy link
Member

roji commented Aug 21, 2020

To cover these types of concerns, besides using the driver without enabling this option and managing it in the application level, customers can implement their logic and inject it through the configurable retry logic provider inside the driver.

I'm still very confused about how this approach will actually work, especially around transactions - how can any sort of driver retry logic handle things, regardless of whether the logic is user-injected or not? I can't see how any meaningful retry strategy can work without an additional layer on top of the driver, at which point any sort of driver support seems unneeded (or rather, unusable).

I may be missing the big picture here (or just misunderstanding what's being attempted)... would it be possible to get some sort of short description of how this concept works around transactions?

timespan & retry event
@David-Engel
Copy link
Contributor

I may be missing the big picture here (or just misunderstanding what's being attempted)... would it be possible to get some sort of short description of how this concept works around transactions?

@roji The push is coming from application migrations from on-premise databases to cloud databases. There is much desire to get something out and provide an infrastructure for extending retry scenarios without major application re-writes. While cloud services offer uptimes like 99.99%, "outages" of a few seconds, due to failover for example, can occur at almost any time. Many of those applications being migrated were designed for a database that used to sit right next to it and rarely suffered from little "blips" like this so designing for them was overkill.

Transaction scenarios are probably not all going to be solved in this implementation. But depending on the error returned, it may even be safe to retry an operation within a transaction. I think providing retry capabilities at least gives customers potential options short of a major refactor. This will be off by default and give customers new knobs to test with.

@DavoudEshtehari DavoudEshtehari changed the title Configurable Retry Logic - draft design [WIP] Configurable Retry Logic - draft design Sep 25, 2020
@roji
Copy link
Member

roji commented Sep 25, 2020

@David-Engel I do understand why people are looking for this - it's very tempting to speak of turning a knob and getting 100% resiliency in the face of transient errors, without any code changes. Unfortunately, I think it's simply not possible to do this at such a low level in the stack (ADO.NET driver), and that resiliency is a problem that must be faced at a higher level up the stack (e.g. by something like EF or the application itself, via something like Polly).

Here are a few of the problematic scenarios I'm thinking about:

  • As already discussed, many errors cause the ongoing transaction to be aborted/fail. It is not possible to automatically retry the whole transaction from the beginning. Do we know whether all/most of the cloud failures are of the type which allows the transaction to continue, or fail?
  • Transient issues can generally occur at any time. For example, if one occurs in the middle of consuming a resultset, it does not seem possible retry anything. The application has already read half of a resultset, and it's not possible to reliably re-execute the query and continue from that point (results may have changed in the meantime, leading to data corruption)
  • Similarly, when a command contains multiple statement (batching), an earlier query may succeed but a later one may fail. It's not possible to just re-executing the command, since the successful earlier statement may have side-effects.

There are other similar issues - this is just off the top of my head. Now, IMHO this is one of those cases where providing a very partial solution which does not cover many cases does more harm than good; people see an official "SqlClient resiliency" knob and get a very false sense of confidence, when in fact they may still be vulnerable to serious issues and data corruption in many cases.

I guess I'd investigate exactly what this would cover, and what not. For example, if SQL Azure transient exceptions do keep the transaction usable (i.e. allow retrying a single failed command), then maybe this makes sense, if the un-covered cases are clearly documented etc. If you've already done such an investigation and I'm unaware, than apologies... But without some knowledge on what exactly this will address, I think this could cause more problems than it solves.

@David-Engel
Copy link
Contributor

@roji I pretty much agree with you. I think a lot of scenarios will end up being more complex than can be covered by a feature at the level of SqlClient. However, this effort is being pushed from higher up. So we are going to try to move it along to see what works and doesn't work. Please note, this feature will be under a context switch and be considered preview so that users can begin trying it and seeing where it falls down and where it helps. There are currently no plans to make it a GA feature in an upcoming release.

Much of the groundwork and investigation was done previously by @scoriani. He might be able to answer your questions specific to transactions during transient errors against Azure SQL.

@roji
Copy link
Member

roji commented Sep 25, 2020

Understood. It will certainly be interesting to see how this involves and what it covers. I'd also be interested in further info from @scoriani or others on what can actually be covered and what can't.

@scoriani
Copy link

scoriani commented Sep 26, 2020

Apologies for my absence in this thread despite multiple references. @roji, make no mistake, your concerns are extremely valid and shared. And yes, we've been recommending customers to implement retry strategies higher in the stack through tools like Polly, TFH, etc. in order to consider app specific logic and requirements since quite some time, but reality is that only a small fraction of them could afford anything but connection string or simple changes in their codebases.
In our mind, this is a first step in a longer journey where a more holistic approach including things like reliable transactions or readers will be part of the bigger picture, but this will require investments on the server-side to make sure session state and context can survive disconnections that are still under evaluation.
The aim with this very first step is to provide an initial solution that customer can use to cover a subset of use-cases through a configurable approach that they can decide how and where to turn on, what error codes to retry to and so on, in addition to clear guidance on where this can help and what kind of consequences can create.
I completely agree that today it won't be applicable to transactional scenarios (other than trivial) or other complex use-cases, but we've got a lot of feedback from many customers that have other scenarios (e.g. not necessarily just reporting) where even a first basic implementation would be extremely valuable.

# Conflicts:
#	src/Microsoft.Data.SqlClient/netcore/src/Microsoft/Data/SqlClient/SqlCommand.cs
#	src/Microsoft.Data.SqlClient/netcore/src/Microsoft/Data/SqlClient/SqlConnection.cs
@roji
Copy link
Member

roji commented Sep 26, 2020

Thanks for your thoughts and for the added details @scoriani. I think we're all more or less aligned with regards to the partiality of the proposed solution, and I understand the customer demand here. A few last comments:

In our mind, this is a first step in a longer journey where a more holistic approach including things like reliable transactions or readers will be part of the bigger picture, but this will require investments on the server-side to make sure session state and context can survive disconnections that are still under evaluation.

That is certainly an ambitious undertaking... And if the goal really is robust resiliency which does not require application awareness, then I agree that the server-side investments would be a prerequisite. FWIW I still think the goal itself is problematic; I know very little about Azure SQL, but I was made to understand that at least earlier iterations shifted the bulk of the reliability/resiliency problem to the client; to me this seems to be the problematic part. I really do believe that resiliency either needs to be handled at the server side (without the driver or application needing to know about it), or by the application.

FWIW I'm not aware of other database/cloud products/services where the bulk of resiliency is shifted to the driver in the way this proposal does it - this seems somewhat unique. But I definitely haven't done a detailed analysis on this.

I completely agree that today it won't be applicable to transactional scenarios (other than trivial) or other complex use-cases, but we've got a lot of feedback from many customers that have other scenarios (e.g. not necessarily just reporting) where even a first basic implementation would be extremely valuable.

OK. I hope the basic coverage that this can provide could be sufficient, and that those customers understand the intricacies of what can be done here and what can't.

Good luck and I'll be curious to follow this!

@scoriani
Copy link

Thanks @roji, and please keep the great feedback coming!

FWIW I'm not aware of other database/cloud products/services where the bulk of resiliency is shifted to the driver in the way this proposal does it - this seems somewhat unique. But I definitely haven't done a detailed analysis on this.

In my experience, this is true for most database services, as high availability is implemented with remarkably similar design everywhere. Just as a quick example:
"Applications experience minimal interruption of service if they connect using the cluster endpoint and implement connection retry logic. During the failover, AWS modifies the cluster endpoint to point to the newly created/promoted DB instance. Well-architected applications reconnect automatically. The downtime during failover depends on the existence of healthy Read Replicas. If no Read Replicas are configured, or if existing Read Replicas are not healthy, then you might notice increased downtime to create a new instance."
https://aws.amazon.com/blogs/database/failover-with-amazon-aurora-postgresql/

@David-Engel
Copy link
Contributor

Some additional thoughts on the overall design:

I see automatic retry being problematic for most applications if it is just an on/off thing for all queries. Some other knobs to consider:

  1. Turn retry off if there is an active transaction.
  2. Perhaps we should add a pluggable SQL parser so that certain statements will disable the retry logic. We could provide a no-op parser which always returns true. We could provide a simple string match parser, or a regex parser. And perhaps these default parsers could be configurable via the config file: taking in a list of patterns, for example.

CC: @scoriani

* Skip retry in a transactional command
* Skip retry in an unauthorized SQL statements
* Apply a custom transient faults list
* Configurable retry provider factory for pre-defined strategies
Davoud Eshtehari added 4 commits October 12, 2020 20:11
temprary for netcore
# Conflicts:
#	src/Microsoft.Data.SqlClient/netcore/src/Microsoft.Data.SqlClient.csproj
#	src/Microsoft.Data.SqlClient/netfx/src/Microsoft.Data.SqlClient.csproj
@DavoudEshtehari DavoudEshtehari force-pushed the ConfigurableRetry branch 2 times, most recently from 63e7dc3 to 0a576cf Compare October 15, 2020 23:58
}
// </Snippet1>
// <Snippet2>
private static void RetryCommand(SqlRetryLogicBaseProvider provider)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you add some comments to distinguish these private static void RetryCommand(SqlRetryLogicBaseProvider provider) ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can find the description in the related SqlCommand.xml file. There is a common explanation for this series of samples. Feel free to suggest here If you need something more.

|10053|Could not convert the data value due to reasons other than sign mismatch or overflow.|
|997|A connection was successfully established with the server, but then an error occurred during the login process. (provider: Named Pipes Provider, error: 0 - Overlapped I/O operation is in progress)|
|233|A connection was successfully established with the server, but then an error occurred during the login process. (provider: Shared Memory Provider, error: 0 - No process is on the other end of the pipe.) (Microsoft SQL Server, Error: 233)|
|64||
Copy link
Member

@karinazhou karinazhou Mar 2, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

No descrptions for 64, 20, and 0 ?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for bringing it up. I couldn't find an acceptable description for these error numbers, especially error number 0.
Any suggestions @David-Engel @cheenamalhotra?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@DavoudEshtehari Did these error numbers come from Silvano?

Copy link
Member Author

@DavoudEshtehari DavoudEshtehari Mar 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Except number 997, all of them come from Silvano.

Copy link
Member

@cheenamalhotra cheenamalhotra Mar 3, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For error 0, below exceptions are seen (Researched, it may not be conclusive):

  • Microsoft.Data.SqlClient.SqlException (0x80131904): A network-related or instance-specific error occurred while establishing a connection to SQL Server. The server was not found or was not accessible. Verify that the instance name is correct and that SQL Server is configured to allow remote connections. (provider: TCP Provider, error: 0 - No such host is known.)

This is retriable as the message "No such host is known` is known retriable error.

  • A connection was successfully established with the server, but then an error occurred during the login process. (provider: SSL Provider, error: 0 - The target principal name is incorrect.)

This is an error generated by Domain controller during authentication. More info: https://docs.microsoft.com/en-us/troubleshoot/windows-server/identity/target-principal-name-is-incorrect-when-replicating-data
No info on retry options.

  • image
    image

They are server timeout errors, which are also retriable.

Summarizing
Error 0 is very generic number, and should ideally not be retried upon. But looking at errors that arise with it, it may be okay to retry on.

DavoudEshtehari and others added 3 commits March 2, 2021 20:17
Co-authored-by: Karina Zhou <v-jizho2@microsoft.com>
+ improvement
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
🆕 Public API Use this label for new API additions to the driver.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants