-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
OPTION/PROPOSAL: Distributed Circuit Breaker #287
Comments
/cc @rahulrai-in @mcquiggd @nareshkhatri81 @tarunp @bunceg Channel created also in slack for more fluid discussion of this proposal https://pollytalk.slack.com/messages/C6GCFAFKK (will port any key conclusions from slack back here) |
@reisenberger @mcquiggd But I would definitely like to have an opt-in transient fault handling into circuit breaker so that a transient error does not cause the circuit breaker to trip. Although I did participate and support the quorum idea earlier, I think maintaining quorum state is a hard problem to solve and requires internal knowledge of the application and the network, therefore I would rather not implement it in Polly. Moreover, many of the Microservices platforms such as Service Fabric know how to maintain a quorum and I believe no one using Polly would want to override that feature with our implementation. |
Ok, as mentioned on the Slack channel, I had deleted my post just as @rahulrai-in was replying to both @reisenberger and myself. I wasn't happy with the clarity of my my post, and now I have more time, I will now attempt to explain my position in a more structured manner:
1. Infrastructure SolutionsSo, let's start by looking at resilience in general, and where Polly Circuit Breaker can fit in. In terms of my preferred cloud provider, Azure, there are multiple levels where resiliency and redundancy can be implemented. AWS and Google Cloud offer similar solutions. Third parties offer on-premise solutions. When I plan my app deployments, I use the built-in Azure features of scale sets, fault zones, and update zones, to prevent temporary outages causing service interruptions. Take for example, a backend set This propagates up to the other 'named service' that originally called the failed service. It's own Circuit Breaker Policies determine that its dependency 'named service' is unavailable, and determines if a feature should be disabled (e.g. video processing, or new user registrations as email confirmations are not available), and this is reflected in the backend. In addition to the Load Balancer option, Azure offers Application Gateway (routing or requests to different 'named services' based on url rules, health etc), and Traffic Manager (DNS based redirection of traffic based on policy rules, that allow failover to a different data center or region, and can also handle services which are not hosted on Azure). I recommend reading Using load-balancing services in Azure Again, each of these can be configured to probe Circuit Breaker State. I have worked with Akka.Net, and it has its own official Akka.Net Circuit Breaker that takes advantage of Akka specific features for state persistence and rehydration; it is far better to not attempt to reinvent the wheel in these circumstances. 2. Quorum Logic Is HardThese services mentioned above in point 1 are aware of nodes, have advanced policy configuration, and IMHO obviate the need for Polly Circuit Breaker to implement it's own quorum logic. We would never be able to implement a solution of sufficient quality. Here is an interesting read if you wish to explore that avenue: http://vldb.org/pvldb/vol5/p776_peterbailis_vldb2012.pdf 3. My Personal Recommendations For StateThere is a need for shared / distributed Circuit Breaker State. I would break this down into two options: Redis A simple Redis cluster (2 nodes), optionally acting as a pass through cache will be sufficient for medium availability, and meet 80%+ of use cases, as a guestimate. That gives you extremely fast response times (especially if deployed to the same virtual network as your core app services, if you are using a Cloud provider), and auto failover. Importantly, it is possible to combine Redis with local, in memory caches of data, which are then synchronised with Redis, and provide resiliency to storage of state data. Here is one library that can achieve this, there are others: CacheManager- Open Source, can use a variety of backends, but Redis is the more feature rich. I'ts worth noting a feature CacheManager offers which is highly relevant:
Note that you can still also use this approach with Redis distributed cache as the source of truth, which is preferable for allowing new nodes to be brought online and query Redis for their initial state. The Redis based synchronisation of local in-memory caches would be used to prevent the nodes of a 'named service' repeatedly calling a service that was failing under high load and causing it to fail. If Redis goes down / is not contactable, the nodes continue to use their local cached data, which would effectively mean the centralised control is automatically delegated to the distributed services, allowing them to act independently. When Redis is available again, they can be synchronised. This avoids creating a single point of failure. Individually, they are also still managed by the Load Balancer determining their health and Circuit Breaker status. And the Traffic Manager above them, co-ordinates failover and optimised performance. Essentially you have a self-healing infrastructure. Serverless A 'managed microservice' - think Azure Functions or AWS Lambda, which can then call whatever data store you want. As any session state I have is maintained in my client for many reasons including scale, I certainly don't want to corrupt that architecture by then managing a farm of Microservices that are simply there to maintain my Circuit Breaker State. SummaryMy personal preference: just build a very simple interface for any Provider to implement:
And then create an 'official' Redis Provider, and let the community create their own, for their specific scenarios. For roll-your own approaches, the configuration of nodes, logic for any scenario specific rules that determine state (including reported failure ratios, whether a reported error on A categorically affects B, etc), can be passed into the Provider instance, but, should not be known to Polly, as they are, well, scenario specific and entirely optional. It simply should not be part of core Polly Circuit Breaker. The only thing that Polly Circuit Breaker would know is the serviceName, its State, and its corresponding Policy. Polly Circuit Breaker would still be highly relevant; I would look at using it to determine if a named service instance is able to connect to its database / storage / email backend, and report to Load Balancer / Traffic Manager etc that is alive and well, or failed. I would use a Circuit Breaker in my main application API to inform my UI if certain features are not available. But, it would not be the only tool used to address resilience issues; IMHO it needs to have a clearly defined role, interface, and functionality, and I feel the proposals so far are perhaps over-engineered. Just my 2 centavos. |
Thank you @mcquiggd for your comments. It looks as if elements of the proposal may have come across not as I intended; I'll aim to clarify. The comments on resilience in Azure also are great perspectives to have; thank you. There is no intention that Polly should overlap with any of that; we see the role of Polly in architectures also that Polly only provides resilience primitives. Polly users were asking about a quite specific scenario; I'll aim to focus discussion back on that, if only to move it forward (but light shone from other angles of course always welcome!). Summary
What problem are we trying to solve?Polly users asked here
and here
and here
and I am understanding these requests to mean we are considering a scenario where multiple upstream services/apps/nodes all call a single downstream dependency. Diagramatically: (The downstream system may have redundancy/horiz-scaling too, but let's consider that is hidden behind load-balancing/Azure Traffic Manager/whatever, so we are addressing via a single endpoint.) (We could also be talking about multiple dissimilar upstream systems A, B, C, D etc all calling M.) With current Polly, each upstream node/system would have an independent circuit-breaker: I am understanding that Polly users are asking for some way for those circuit-breakers to share state or share knowledge of each other's state, such that they would (or could choose whether to) break in common. Why a single shared circuit state can be dangerous: the "one bad node poisons other good nodes" problemOne solution could use a single shared breaker state: (the breakers exist as independent software elements in each upstream node/system; the solid box round them here is intended to illustrate they all consult/operate to the same shared state) and the benefit is that if one upstream caller "knows" the downstream system M is down, the other callers immediately share that knowledge (probably failing faster than if they were fully independent): [*1] As however noted in the original proposal and linked threads (here, here, here) and others also noted, an approach with a single shared circuit state also risks (and by definition embeds) a catastrophic failure mode. If the reason for the circuit-of-A1-governing-M breaking is not that downstream system M is down, but instead a problem local only to A1 (eg resource starvation in A1 affecting those calls) or to the path between A1 and M, then the single shared circuit state inappropriately cuts off all the other nodes/upstream systems from communicating with M: [*2] (M was healthy (green). Upstream A2, A3 etc were healthy (green) and had good paths (green) to M giving green local circuit statistics. Without the shared circuit state, the system would have had the redundancy benefit of A2, A3 etc still supporting healthy calls to M. However, because A1 has told the single shared circuit state to break, A2, A3 etc cannot communicate to M, and the redundancy value of A2, A3 etc is thrown away.) This is a resilience anti-pattern: the single shared circuit state approach inappropriately promotes a localised problem to a global/more widespread one, causing a (horizontally) cascading failure. I call this the "one bad node poisons other good nodes" problem. Any approach taking the word of a single source as enough evidence that the downstream system is down, risks this: a failure at a single source will block all consumers. It's rare, but it can happen. Nodes go bad. We all know the fallacies of distributed computing. The problem is not that it happens very often, but that if/when it does, the single-shared-state design (yoking upstream callers together so tightly) has such a catastrophic effect. It's a bit like journalism: don't trust a story from only one source; corroborate it with others before acting. (So the simplification proposed in your point 3 @mcquiggd in principle embeds this risk. That would have been the original proposal, were it not for this risk. Of course the simplification may well be the more appropriate solution, if in a given system you judge the risk to be acceptable.) Without the single shared circuit state, we would have continued to have the redundancy benefit of the horizontal scaling: [*3] So: Can we fashion a solution which gives the benefits of [*1] and [*3], but avoids the failure of [*2]? 'Crowd sourcing' whether to break - some simple quorum logicThe essence of the proposal is that we let users configure the number or proportion of nodes breaking, among the set, that are deemed sufficient to cause a distributed break. This creates a non-blunt instrument. Consistency of experience across callers gives enough confidence that the real-world happening is a downstream system failure, not something local to a particular caller. Rather than seeking to impose a single shared truth model, the approach models the real-world complexity of multiple truths (that different callers might have different experiences of their call success to M). And provides a user-configurable way to negotiate that. Nodes tell the mediator when their local circuit transitions state. And nodes can ask the mediator the state of all nodes in the set. Based on this, the arbiter then embeds some extremely simple quorum logic (only requires addition and division):
This 'crowd sources' the wisdom of whether to distributed-break or not, in a non-blunt way. The solution correctly negotiates the scenario which for the simpler implementation induced the anti-pattern [*2]: But also provides the [*1] benefit - if M is down, the quorum logic soon detects this, and breaks: Mechanisms for sharing stateThe mechanisms for sharing state information between nodes are intended exactly to be existing distributed cache technologies such as Redis, NCache etc. (As you say @mcquiggd, all of these with their dual in-memory/remote-syncd caching are great fits.) It also doesn't have to involve a store: if users already have some asynchronous messaging tech in the mix - Azure Service Bus Queues, Amazon SQS, RabbitMQ, whatever - these are alternatives too. Polly just provides a primitiveThis doesn't make Polly embed any knowledge about the system being governed. As ever, Polly just provides a resilience primitive. Users model it to their own system by choosing which upstream systems to group into which distributed circuits, what quorum thresholds to configure, etc. |
The preceding post aimed to clarify, to further discussion; not to assume what we should implement. From an understanding of the behavioural characteristics of different solutions, we can then discuss the trade-offs that there are, between complexity and fidelity/behaviour. What do others think about the "one bad node poisons other good nodes" problem? @mcquiggd You are deeper into Azure than I: In the Azure IaaS environment, do you see upstream system behaviour as likely to be so consistent, that the risk can safely be ignored, or? @rahulrai-in @nareshkhatri81 / anyone interested : For your intended usages, do you think the "one bad node poisons other good nodes" problem relevant or irrelevant? My take is that in principle the problem exists for any single-shared-circuit-state design. In practice, as an engineer designing for a particular system, it would of course be perfectly reasonable (and very often the right decision) to be making judgments like "For my installation, that is 'not going to happen'; at least, I will sacrifice that level of engineering, and take the risk". The trouble is, as a library, we can't foresee and don't have control over the environments in which users use the feature. Four options to move forward: (other suggestions also welcome) (a) do nothingCertainly valid to ask: is the feature worth it, for the amount of engineering necessary to get it right? Wouldn't fully independent upstream callers just discover a downstream failure of their own accord, in due course? Early on I was in that camp. However, users have asked for this feature, and it can prevent each upstream caller in turn eg waiting on a timeout, to discover the same failure. So I have aimed to propose a mature/robust design that at least avoids a catastrophic failure mode. (b) implement the simple versionImplement in Polly only the simple version that embeds the anti-pattern, but warn users clearly of that failure mode. Caveat emptor! IE that users should only use it in systems, or with a scope, where they consider that not a risk. API: Users have to specify:
(c) the quorom-decision versionThe store implementations would be not that much more complicated than (b) - it's only sticking objects in a distributed cache. API: Users have to specify:
(d) Only adapt Polly to allow injection of new circuit-breaker implementationsDon't attempt to force a decision now between (b) and (c) for core Polly. Instead simply refactor the circuit-breaker to allow injection of a custom Thoughts? |
Seriously impressed with your explanation - I take my hat off to you... you've brought clarity to a subject that is tough to describe. Personally, I had seen some very specific use-cases being discussed, that I simply would not encounter. But to answer the specific question:
Well, in the systems I typically use for my mission critical backends, you can 'planet scale' Cosmos DB as long as your budget lasts. You can autoscale your Redis cluster until your credit card melts. If systems such as these, with millisecond failover times to geographically separate mirrors are not sufficient, a Polly circuit breaker with distributed state isn't going to add too much other than another point of failure. In the scenarios I described with Redis and a local cache, you have built-in resiliency for every node of every type of service. I handle offline capability in my client, and attempt to resolve data issues when the system is available again. So, I might pull back from this topic, however, as I don't want to muddy the water. Each of us will have our own requirements, I would just advise that whatever solution that is settled on, is as pluggable as possible, so users can make their own choice for simple, advanced, and industrial-strength SkyNet level resiliency of their circuit breaker state. Basically, let the users decided from all the options: So, my vote is for (d) Only adapt Polly to allow injection of new circuit-breaker implementations. :) Very interested to see what others opinions are... the more eyes on this, the better... |
Fantastic write up guys of all points so far. As somehow who has implemented something like this, I went down the route suggested by @mcquiggd. I implemented a decorator above Polly that used Redis caching to hold state - it isn't perfect but it's good enough for our needs. Therefore, although I would love an out of the box solution to take the problem away, I agree that this is hard and there are so many possibilities for how to implement this so I'd rather have an extension point / interface that I can hook into instead. Polly Contrib would be nice and, eventually perhaps, this goes down the NHibernate route of incorporating battle-tested stable contribs into polly itself but that's for something down the road. I'd vote for (d), but a simple OOB contrib to demonstrate an example would also be helpful |
I am keen on the (d) extension-point idea as a way forward for this too. @rahulrai-in : would that suit your needs too? |
Hi,
I would go through this today and add my inputs.
…Sent from my Windows 10 phone
From: reisenberger<mailto:notifications@github.com>
Sent: Wednesday, 23 August 2017 7:01 AM
To: App-vNext/Polly<mailto:Polly@noreply.github.com>
Cc: Rahul Rai<mailto:rahulrai@live.com>; Mention<mailto:mention@noreply.github.com>
Subject: Re: [App-vNext/Polly] OPTION/PROPOSAL: Distributed Circuit Breaker (#287)
I am keen on the (d) extension-point idea as a way forward for this too. @rahulrai-in<https://github.com/rahulrai-in> : would that suit your needs too?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#287 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ALVfvHLTyb4-9KyCQAjROK2JkIziTYYDks5sa0GogaJpZM4Oqtop>.
|
The serverless compute model makes sharing state like this for circuit breakers absolutely necessary. Right now local-circuit breaking requires each node to learn on it's own that a given dependency is down. If your instances are long-lived with persistent in-memory state, that's OK. E.g. on the first 5 requests an instance sees failure, and then circuit breaks the dependency. Those 5 requests are essentially an learning period. However, with serverless compute models (e.g. AWS Lambda or Azure Cloud Functions) each instance's internal memory/state is highly transitory. In the extreme, each instance only survives long enough to make a single request to the dependency, and thus no node would survive long enough to learn the dependency's condition. The system would become equivalent to one with no circuit breaking at all. Sharing state would become necessary so that all of the nodes contribute their learnings together. |
@heneryville 👍 That's an excellent/compelling use case for this feature. |
I would like to prioritize this for 2018. I'm also interested in taking the lead on this, as I've recently run across a good use case when using a Circuit Breaker in an Azure Function that could automatically scale out to multiple instances. This is where a distributed Circuit Breaker would come in as a useful way to make sure multiple instances aren't overloading a downstream service. Right now, when your Azure Function scales to 20 instances, per se, they each have their own Circuit Breaker instance. So one tripping a breaker won't cause another to stop processing. @reisenberger had suggested calling a Distributed Circuit Breaker a Circuit Breaker as a Service, or "seabass". I like it :) |
Part of the work towards Polly v6.0 envisages refactoring the circuit-breaker to allow injection of custom |
To follow up the point, well made by @heneryville - Azure Durable Functions now offer state persistence and long-running lifetimes, with triggers to rehydrate, for example to respond to external processes calling back to continue a defined workflow. Microsoft are adding multiple persistence backends for Azure Durable Functions. They could be considered to be a lightweight alternative to Logic Apps. Personally I use them for executing other Functions in a defined order - their state is persisted automatically. Perhaps there are lessons we can learn from this approach. |
Is this still up for grabs? I'd love to take this up. Little background: I work at Microsoft and we use Polly in everything we do. Personally, I have a decent foundation in distributed systems theory, and at work, I have experience with microservice based architectures. I'd need a lot of help along the way, but I'm committed to putting in the time and effort to see this through. I went through the thread, and it looks like a good place to start. |
@utkarsh5k That's great to hear! It would be great to have more developer power on this. In (scarce) spare hours in the last few weeks I coincidentally started on point (d) of this comment, which is simpler than the original proposal:
That is: refactoring the circuit-breaker engine so that it provides a better And simultaneously: refactoring the existing controller of the original circuit-breaker so that it could take (by injection) an Comments welcome! Perhaps I should take a few more days to see if I can progress that, but also look for a good place to share work out? |
I am all for an extensible model for circuit breaker. There should be a single I terrace which the providers can implement.
Outlook for Android
…________________________________
From: Dylan Reisenberger <notifications@github.com>
Sent: Tuesday, November 13, 2018 9:13:08 AM
To: App-vNext/Polly
Cc: Rahul Rai; Mention
Subject: Re: [App-vNext/Polly] OPTION/PROPOSAL: Distributed Circuit Breaker (#287)
@utkarsh5k<https://github.com/utkarsh5k> That's great to hear! It would be great to have more developer power on this.
In (scarce) spare hours in the last few weeks I coincidentally started on point (d) of this comment<#287 (comment)>, which is simpler than the original proposal<#287 (comment)>:
(d) Only adapt Polly to allow injection of new circuit-breaker implementations
That is: refactoring the circuit-breaker engine so that it provides a better ICircuitController seam (or similar) for the injection of custom ICircuitController (or similar) implementations.
And simultaneously: refactoring the existing controller of the original circuit-breaker<https://github.com/App-vNext/Polly/wiki/Circuit-Breaker> so that it could take (by injection) an IConsecutiveCountCircuitBreakerStateStore (and refactoring the existing in-memory implementation to fulfil this). Building a distributed consecutive-count circuit-breaker would then be a matter of coding, say, a StackExchange.Redis<https://stackexchange.github.io/StackExchange.Redis/> (or similar) implementation for IConsecutiveCountCircuitBreakerStateStore.
Comments welcome!
Perhaps I should take a few more days to see if I can progress that, but also look for a good place to share work out?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#287 (comment)>, or mute the thread<https://github.com/notifications/unsubscribe-auth/ALVfvCuU6X2gBrBSydBRlHgwKL1FqFibks5uufJ0gaJpZM4Oqtop>.
|
@reisenberger I agree that the Polly interface should be pluggable for the providers. Having said that, would love to see the current work that you have done on:
|
Question: Has anyone tried using Polly exception handling in Azure Durable functions orchestrator functions? I was thinking this can give us that distributed circuit breaking just outside of Polly. |
Hi @harouny , thanks for the question. If anyone has tried this, or can see a workable possibility, please say. Unfortunately, I cannot see that Azure Durable Orchestration Functions with existing Polly circuit-breaker could offer a path to Distributed Circuit Breaker:
|
To look at other possibilities: It may be possible to preserve circuit-breaker state across Azure function calls using the approach in this article or this SO discussion. If anyone tries this, please report back. The Persisting the circuit stats/state to the Azure function's local file system should work to create a distributed breaker, but it requires similar engineering within Polly to what we need to do to make eg Redis a backing store option for circuit state. And the network share/file-lock contention issues might well not be favourable 🙂 , compared to using (say) Redis. |
Is this still being considered as a feature, or possibly even being worked on? I'm looking to implement the pattern in an Azure Function, so am facing the same considerations as @heneryville in that a distributed state is not a nice optimization, but an essential part of having a circuit breaker at all. Using a If this is a capacity issue, I'd be happy to contribute to the development effort, and/or beta-test anything that may already be in a testable state. |
@AnnejanBarelds I have a working prototype for an Azure-functions-native circuit-breaker based on the new (preview) Durable Entity Functions. I'll aim to post something in the next few days. |
@reisenberger Sorry for the delay in response. I'd be interested in seeing how this works when there's something to share. Having said that, I'm not sure that I would move to Durable Functions, just to get a circuit breaker working. So there would still be value in being able to inject just the external state management to a state of the user's choosing, much like your suggestion in #287 (comment). |
Just wondering if there is any update on this? For my own use case of a Circuit Breaker with Azure Functions, I've reached the conclusion that I need more than just to externalize state, while the Circuit Breaker logic runs inside the Azure Function. The reason is that I must be able to either stop or disable the entire Function, as opposed to just suspend calls to downstream systems. I'll try to clarify. Imagine a Function that triggers off of a Service Bus Queue (i.e. just a plain messaging queue), does some processing, and then persists the result of that in a downstream system (i.e. a database or an API or whatever). If I just suspend calls to that downstream system when that system seems to be in trouble, I'll still be processing messages in the Function, which can now no longer complete the message, but must either abandon or expire the lock, or dead-letter the message. Neither is desirable, because abandoning or expiring the lock can only be done a preset number of times for a message, before it is automatically dead-lettered. And I don't want to dead-letter; I want to process the queue when the downstream system is healthy again. So instead of just not calling my downstream system, I need to suspend processing messages entirely, i.e. my Function should no longer trigger on new messages on the queue: it should be either stopped or disabled. But once my Function is stopped or disabled, it will never re-evaluate and possibly close the circuit again, unless that logic runs somewhere outside the Function itself. So even though I can disable the Function from within (which is what I have in place in my own systems right now), a full-blown solution would require having some other runtime to close the circuit, i.e. to re-enable the Function. @reisenberger Do you see any future in which this would somehow become part of Polly, or hook into some sort of extension point? The offer still stands to contribute to this in any sort of way. But I want to create something to solve this one way or another, so if this does not fit Polly, I'll try my hand at an alternative solution. |
Hey @AnnejanBarelds Description of your use case makes entire sense; I'd adopt the same/similar strategy. I'll aim to comment further by 9 September latest. By 30 September latest the Polly team intends to publish, as a separate repository, a stateful distributed circuit-breaker: hosted in Azure functions implemented using the new durable entity functions; consumable from within any azure function as a circuit-breaker that is stateful across stateless function invocations; equally, also consumable from anywhere outside azure functions via an http/s api. |
That sounds like a very promising release @reisenberger ! I'll be sure to take it for a test drive once it's released. Also looking forward to any additional comments you may have on this. |
Sorry I did not read @reisenberger last comment. So is now clear to me you are working on it and you continue doing tests but cool! |
@AnnejanBarelds To your use case: makes entire sense (been round exactly this set of reasoning in another context). On circuit breaking, you effectively want to unsubscribe, to avoid rejecting and thus dead-lettering messages. With an Azure Service Bus queue -triggered function, the mechanism to 'unsubscribe' is probably to disable that individual function or function app. You'd then need something else to 'wake up' (re-enable) the function after the circuit break duration expired. Options to re-enable the function could be anything in Azure that you can trigger for a time in the future (invoke a separate durable function which delays until that time; an ASB scheduled message; ...). One possible concern could be if that single reenable-the-circuit mechanism fails somehow (not published, missed, lost, fails to process etc) - suddenly you have a self-inflicted black-out of the original service. Of course, you could monitor for that (one might be monitoring for unexpectedly long black-out of the original service anyway). For a background worker processing messages off a queue (meaning that exact time to re-enable the circuit is not as sensitive as it might be for a user-facing operation), an alternative could be something like a timer-triggered function triggering every minute or two to check whether to re-enable the circuit (IE just check periodically, to eliminate relying on a single re-enable-the-circuit event which could be lost). That could read the circuit-is-blocked-until time (the forthcoming Polly durable circuit breaker will allow that); or even check if the underlying cause (database, API etc) has recovered. |
Polly.Contrib.AzureFunctions.CircuitBreaker is now out in preview!
Comments/questions about that implementation, best over on Polly.Contrib.AzureFunctions.CircuitBreaker x-ref: #687. |
Awesome! I'll take it for a test drive as soon as I have the time; hopefully next week. I'll be sure to provide my feedback, in praise or otherwise ;). |
Excellent news... and perfect timing for me. Will take a look :) |
Thanks @AnnejanBarelds @mcquiggd . Awesome to have any additional contributions you guys want to make too 😉 , I know you are both interested in this use case, the more power the further we'll get! (Indeed, same goes for anyone wanting to contribute ... Polly team is only a small team) |
For all interested: Jeff Hollan has now released a project https://github.com/jeffhollan/functions-durable-actor-circuitbreaker, which demonstrates another durable-entity variant on the circuit-breaker pattern in Azure Functions. It differs from Polly.Contrib.AzureFunctions.CircuitBreaker in that when the circuit breaks, it is pre-coded to fire an event which will disable a Function App. If this is your need (/cc @AnnejanBarelds ), then it's a direct example of an implementation. As previously discussed, the pattern would need to be augmented by a further process (for example a separate timer-triggered function or time-delaying orchestrator function) which re-closes the circuit (permits executions) after a given period of time. (A similar pattern - placing a call to disable a function app - could also be adapted onto Polly.Contrib.AzureFunctions.CircuitBreaker, by extending the code which breaks the circuit, here.) |
Closing this now-historic discussion: Polly.Contrib.AzureFunctions.CircuitBreaker was released in September, and is flagged up separately in the readme and in announcement issue. |
Detailed architecture proposal for a Distributed Circuit Breaker. TL;DR summary will also follow.
Background: What is a Distributed Circuit Breaker?
In a distributed system with multiple instances of an application or service A, there could be an advantage in sharing information about circuit state between instances of upstream system A which all call downstream system B.
The intention is that if the circuit-breaker in node A1 governing calls to system B detects "system B is down" (or at least, breaks), that knowledge could be shared to nodes A2, A3 etc, such that they could also break their circuits governing calls to B if deemed appropriate.
The need for quorum decisions
@reisenberger (and others) observed that a simple implementation - in which all upstream nodes A1, A2, A3 etc simply share the same circuit state - risks a catastrophic failure mode.
The assumption of a simple implementation like this is that if the circuit in A1 governing calls to B breaks, this "means" system B is down. But that is only one of three broad causes for the circuit in A1 governing calls to B breaking. Possibilities include:
(1) node A1 has some internal problem of its own (eg resource starvation, possibly due to unrelated causes)
(2) the network path between node A1 and B is faulting (perhaps only locally; other network paths to B may not be affected)
(3) system B is down.
In scenarios (1) and (2), if A1 wrongly tells A2, A3, A4 etc that they cannot talk to B (when in fact they can), that would entirely negate the redundancy value of having those horizontally scaled nodes A2, A3 etc ... a (self-induced) catastrophic failure.
As commented here, a quorum approach can mitigate this concern.
An extra dependency in the mix
Co-ordinating circuit state between upstream nodes A1, A2, A3 etc also introduces an extra network dependency, which may itself suffer failures or latency.
(Note: I have since discovered the Hystrix team echo these concerns.)
Why then a Distributed Circuit Breaker?
The premise is that when state co-ordination works well - and where empirically in some system, the causes of such circuits breaking are predominantly cause (3) not (1) or (2) - then a distributed circuit breaker may add value.
Without a distributed circuit, upstream nodes A1, A2, A3 etc may each have to endure a number of slow timeouts (say), before they decide to break. With a distributed circuit, knowledge of failing to reach B can be shared, so that other upstream A-nodes may elect to break earlier and fail faster.
The classic application would be in horizontally-scaled microservices systems.
Proposed architecture for DistributedCircuitBreaker
Let us use the terminology
DistributedCircuitBreaker
(for the distributed collection of circuit breakers acting in concert) andDistributedCircuitBreakerNode
for a local circuit-breaker node within that.DistributedCircuitNodeState
A
DistributedCircuitNodeState
instance would hold the last-known state for an individual node:IDistributedCircuitStateMediator
An
IDistributedCircuitStateMediator
implementation would manage sharing knowledge of node states. Implementations would be in new nuget packages outside the main Polly package (egPolly.DistributedCircuitStateMediator.Redis
), to prevent the main Polly package taking dependencies.Note: have avoided calling this 'state store', as implementations may not always be a store (eg async messaging). However, thinking of a state store may be easier for quick grokking. Suggestions for other names welcome!
IDistributedCircuitStateArbiter
Given knowledge of the states of nodes active in the distributed circuit, an
IDistributedCircuitStateArbiter
would take the quorum decision on whether a 'distributed break' should occur (all nodes should break).Implementations might be, for example:
Separation of concerns between
IDistributedCircuitStateMediator
andIDistributedCircuitStateArbiter
should allow clean decoupled implementations and unit-testing.DistributedCircuitController
: How local nodes take account of distributed circuit stateLocal nodes must govern calls both according to their own (local) breaker state/statistics, and distributed circuit state.
The
CircuitController
is the existing element which controls circuit behaviour ('Should this call be allowed to proceed?'; 'Does the result of this call mean the circuit needs to transition?'). Decorating the localCircuitController
with aDistributedCircuitController
fits our needs:Where the
CircuitController
instance would check if a call through the circuit is allowed to proceed, theDistributedCircuitController
would intercept and, additionally (each call; or at configurable intervals) consult theIDistributedCircuitStateMediator
andIDistributedCircuitStateArbiter
to see if the local node should break due to distributed state.Where the local
CircuitController
transitions to open, due to local statistics, theDistributedCircuitController
would inform theIDistributedCircuitStateMediator
of its break, and when it is locally broken until.Where the local
CircuitController
transitions back toClosed
, theDistributedCircuitController
would/could intercept that and also inform theIDistributedCircuitStateMediator
.(Question: Do other local circuit states/transitions need communicating to the central
IDistributedCircuitStateMediator
too, or not? For the purposes of controlling distributed breaking, we may only be interested in whether nodes are broken due to local causes and until when; we don't care about other states. However, in a distributed circuit environment, theIDistributedCircuitStateMediator
could make a useful central info source for node states, for dashboarding, if it tracks all states.)Refactoring the existing
CircuitController
may be needed to expose methods which can be intercepted, for all these events.Using the decorator pattern also meets other requirements:
Preventing self-perpetuating broken states
Code needs to distinguish whether a circuit is broken due to local events, or to distributed circuit state. This probably implies a new state,
CircuitState.DistributedBroken
.Consider that without this, it would be possible to create a system which engendered self-perpetuating broken states. For example, a rule "break all nodes if 50% are broken" would, when triggered, lead to 100% broken. One node recovering might lead to (say) 90% broken, which is still >50%, so the node breaks again ... looping back to (permanently) 100% broken.
Handling
CircuitState.Isolated
Should
CircuitState.Isolated
influence distributed state? (cause other nodes to break?). Likely not - users have more control, if they can isolate individual nodes separately.Enrolment syntax
A syntax for enrolling a local circuit-breaker in a distributed circuit is needed. Examples:
Static method:
Fluent postfix:
A further post on possible implementations/patterns for
IDistributedCircuitStateMediator
will follow..The text was updated successfully, but these errors were encountered: