Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[Bug]: ResiliencePipeline CircuitBreaker doesn't handle Half Open State Correctly when probing call throws unhandeled exception - half open state can never be left again #1979

Closed
DominicUllmann opened this issue Feb 20, 2024 · 8 comments
Labels
Milestone

Comments

@DominicUllmann
Copy link
Contributor

Describe the bug

We use a resilience pipeline with a circuit breaker. We distinguish between handled and unhandled exceptions. When the probe call is completed with an unhandled exception, the circuit breaker stays in half open state and no additional probe call is ever allowed.
The issue is probably in https://github.com/App-vNext/Polly/blame/10730b9f723f80559b0c115c2102db974382fdfd/src/Polly.Core/CircuitBreaker/Controller/CircuitStateController.cs#L131. The state controller only handles the case where the state before was open. But it doesn't correctly handle the case where the state was already half-open.
Based on https://github.com/App-vNext/Polly/wiki/Circuit-Breaker#half-open, a non handled exception results in staying in the half open state. A next probe call should then be allowed to either transition to closed or open, but it is not.

Expected behavior

The circuit breaker should not stay forever in half-open state after a non handled exception occured in the probe call used in the half open state. Instead a next probe call must be allowed and should then lead to either a next non handled exception, a handled exception or a success result.

Actual behavior

The circuit breaker stays forever in half open state after the probe call resulted in a non handled exception.

Steps to reproduce

Please have a look at:
https://github.com/DominicUllmann/PollyHalfOpenIssue/tree/main

The issue occurs in the unit test TestResiliencePipelineDirectly2, There we throw an unhandled exception from the probe call.
In this test, we first ensure to open the circuit breaker, then we verify that it is really open. Then we wait for more than the reset time and make a call throwing an unhandled exception (MyUnhandledException). Afterwards we wait again (which should not be necessary in half-open state) and try to send a successful call. But in this state, we can never leave half open state again, neither to open or closed.

The the unit test TestResiliencePipelineDirectly shows what happens when we use a handled exception instead. In this case everything works as expected (i.e. transition from half-open to open and then after the reset time again to closed).

Exception(s) (if any)

No response

Polly version

8.3.0

.NET Version

8.0.102

Anything else?

No response

@martintmk
Copy link
Contributor

martintmk commented Feb 21, 2024

Thanks @DominicUllmann for the report, the repro steps help a lot. I believe I found the culprit:

This line tells CB that the outcome is not handled and therefore it should be considered as success only if the outcome does not contain the exception. And here lies the bug, because the MyUnhandledException is not threated as handled and the else statement is not matched either because the outcome contains an exception.

This is pretty serious bug and can lead to circuit breaker being open forever :/

My thinking is that we should change the code to:

if (await _handler(args).ConfigureAwait(context.ContinueOnCapturedContext))
{
    await _controller.OnActionFailureAsync(outcome, context).ConfigureAwait(context.ContinueOnCapturedContext);
}
else
{
    await _controller.OnActionSuccessAsync(outcome, context).ConfigureAwait(context.ContinueOnCapturedContext);
}

Meaning that if the outcome is handled, the it's treated as a failure and if it's not handled it's treated as a success. Thies applies even if the outcome contains an exception.

Internally we should also change the names of internal methods to:

  • OnActionFailureAsync -> OnHandledOutcomeAsync
  • OnActionSuccessAsync -> OnUnhandledOutcomeAsync

Thoughts?
@martincostello, @peter-csala

This one should be a hotfix though.

@peter-csala
Copy link
Contributor

@martintmk It seems good to me. Regarding to the renaming. I agree with the OnHandled and OnUnhandled prefixes but we pass both the context and the outcome to the methods. So OutcomeAsync might be misleading a bit... Or simply I just overthinking it :)

@martintmk
Copy link
Contributor

martintmk commented Feb 21, 2024

@DominicUllmann Do you want to tackle this and create a PR? Otherwise, I can take a look in my free time over this weekend or next one.

@DominicUllmann
Copy link
Contributor Author

Thanks a lot for the fast feedback @martintmk. I wonder if the suggested fix matches with the documentation here:
https://github.com/App-vNext/Polly/wiki/Circuit-Breaker#half-open

But generally, it would work well, if the circuit breaker is closed in case of success as well as when an unhandled exception occurs. In case of e.g. validation errors, it is as good as a normal success answer and therefore it doesn't need to be handled differently.

I can probably create a PR this evening or tomorrow evening.

@peter-csala
Copy link
Contributor

peter-csala commented Feb 22, 2024

I wonder if the suggested fix matches with the documentation here:
https://github.com/App-vNext/Polly/wiki/Circuit-Breaker#half-open

That page is covering the legacy V7 API. That works slightly differently than the V8 API.

One of differences is covered here: https://www.pollydocs.org/strategies/circuit-breaker.html#reducing-thrown-exceptions (The shown anti-pattern works smoothly in case of V7)

@DominicUllmann
Copy link
Contributor Author

Thanks @peter-csala. Good to know. Then I will try the fix suggested above.

@peter-csala
Copy link
Contributor

@martincostello What's your plan to release the hotfix?

@martincostello
Copy link
Member

I think there was another PR I was waiting on before thinking about preparing a release, but I've forgotten what it was now.

I'll look at doing a 8.3.1 tomorrow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

4 participants