Explore "Adaptive Safety Control" Approaches #131

benjchristensen · 2013-03-27T19:00:37Z

A good article by William Louth (Twitter @williamlouth) talks about adaptive systems versus circuit breaker patterns: http://www.jinspired.com/site/jxinsight-opencore-6-4-ea-11-released-adaptive-safety-control

I'd like to use this issue to explore whether there are principles that can be applied to Hystrix since Twitter is very limited and that blog doesn't take comments.

Some background for the discussion (brain dump so forgive bad grammar and stupid thoughts):

Circuit Breaker

The circuit breaker patterns gets way too much credit in Hystrix. It is the concept people seem to grab onto but it is a very minor aspect of the Hystrix implementation and I have quite publicly stated that it is just "icing on the cake" and a "release valve" when the underlying system is known to be bad.

The reason (as William states in his article) is that it's an all-or-nothing affair and it is reactive.

For example, at the velocity of traffic for the most used HystrixCommands on the Netflix API system we would be dead by the time a circuit breaker can do anything about it.

The concurrent execution limits via semaphores or threadpools are the actual thing that prevents underlying latency from saturating all application resources.

System Characteristics

Hystrix was obviously designed for the use cases Netflix has and doesn't necessarily apply to different architectures or scale down very well.

Some of these factors are:

most backend systems are accessed via synchronous client libraries over blocking IO (some use non-blocking IO which Hystrix will better support at some point as they have different resource utilization and failure scenarios - Asynchronous Executables #11)
application clusters scale instance counts from ~600 cores to ~3200 cores each day (number of instances depends on instance type ... but it can be ~100 on the low end to 1200+ on the high end)
the backend consists of 150+ different functional services (modeled as HystrixCommand instances) in 40+ groups (representing backend system resources and modeled generally by thread-pool for isolation groups)
some functionality has good fallbacks, some can fail and cause graceful degradation of user experience and others must fail fast as they are required

Behavior - Adaptive, Proactive, Reactive?

This depends on which aspect of Hystrix is being looked at ...

_Concurrency Limits and Timeouts_

These are the proactive portion of Hystrix - they prevent anything from going beyond limits and throttle immediately. They don't wait for statistics or for the system to break before doing anything. They are the actual source of protection given by Hystrix - not circuit breakers. (see diagram for the decision flow: https://github.com/Netflix/Hystrix/wiki/How-it-Works#wiki-Flow)

We configure concurrency limits using semaphores or thread-pools based on simple math - 99th percentile latency when system is healthy * peak RPS = needed concurrency. Timeouts are set with similar logic.

It's to get an order of magnitude, not exact value. The principle is that if something needs 2-3 concurrent threads/connections let's give it 10 but not 50, 100 or 200. This constrains what happens when the median becomes the 99th and the 99th multiplies.

We have considered making this value adaptive - use metrics over time to adjust it dynamically. In principle though we have not found a need to do this as systems rarely change behavior enough to need a change in their order of magnitude once first configured and if they do it is an obvious change in a system modification (new code push etc).

Making it adaptive would have to take into account long-term trends (at least 24 hours, probably longer) otherwise slow increase in latency and concurrency needs could "boil the frog" and allow the limit to slowly increase until it has stopped providing the protection it was put there for in the first place.

We have found that after operating 150+ commands at high volume for over a year that we don't have a desire for adaptive changes of what the concurrency limit or timeouts should be as that complicates the system, makes reasoning harder and opens up the ability for "drift" to just raise the limits over time and expose vulnerability.

_Circuit Breaker_

Circuit breakers are reactive. They kick in after statistics show a HystrixCommand to be in bad state (resulting from failures, timeouts, concurrency throttling, etc). It's a release valve to skip what we statistically have determined to be bad instead of trying every time. It helps the underlying system by reducing load and gets the user a response (fallback or failure) faster.

On a single application instance this is not a very "adaptive" thing - it's an on/off switch,

However, at scale it is actually quite adaptive because each HystrixCommand on each instance makes independent decisions.

What we see in practice in production is that when a backend has an error rate high enough to cause issues but not enough to shut the entire thing down the circuits on individual instances are tripping open/closed in a rolling fashion back and forth across the fleet as the individual instances use their view of the world to make decisions.

This screenshot of one circuit during such a situation demonstrates this behavior. Note the circuit open/closed and how the different counts represent different types of throttling and rejection occurring while most is still successful:

It very dynamically reduces load so that a percentage of traffic can succeed and backs off, tries again etc.

We have many times considered if we should make the logic of the circuit breaker more "adaptive" like a control valve that constrains a percentage of traffic depending on algorithms and statistics.

Every time we consider it we decide not to because it makes reasoning about the system harder and because at our scale we already effectually get this behavior because of the large size of the fleet.

When Hystrix Doesn't Work

The principles above will not work well if a cluster size is 1 or a very small number of boxes. In that case a more adaptive algorithm would likely be preferable to the on/off switch of a circuit breaker - or just turn off the circuit breaker and use just the concurrency/timeout protection.

Also, if an application only has 2 or 3 critical backend components without any reasonable fallbacks or graceful degradation then Hystrix won't be able to help much. Constraining the single service your app needs breaks the user experience - the only value it would then give is very detailed and low-latency metrics and quick recovery when the backend service comes back to life - but it won't help be resilient during the problem since there's nothing else to do.

With the above thoughts on the matter I'm curious as to where more "adaptive" approaches would make sense.

Do they provide benefit to a large system like Netflix or just make a more complicated version of the same end result?

Are they critical to make something like Hystrix work on a smaller system?

Even with an adaptive approach and a "valve" instead of "circuit" it still means it is shedding load, failing fast, doing fallbacks. Is that any different than what already happens with circuits opening/closing independently and rolling around a large fleet.

Other than the circuit breaker (which is already a limited aspect of Hystrix) where else would this concept apply?

Thoughts, insights, (intelligent) opinions etc welcome ... I'm interested in whatever the best ides are to operating a resilient system.

benjchristensen · 2013-03-28T16:15:09Z

If someone reading this hasn't seen the visualization before it can help demonstrate the behavior of many HystrixCommands operating across a large cluster.

http://www.youtube.com/watch?v=zWM7oAbVL4g

benjchristensen · 2013-03-28T16:16:58Z

Here is further commentary on this subject via Twitter:

@benjchristensen adaptiveness reflects not just workload dynamics but the env, multi h/w targets for same app, a changes in depend. services

@benjchristensen wrt to your "reasoning" comment...the new data model for mgmt is not metrics but actions by valves on data u need not see

@benjchristensen my definition of scalability includes both up & down movements..the performance of a service should be entirely independent

@benjchristensen adaptive control valves also determine optimal workload levels dynamically http://t.co/l0z2GV7Zo1 http://t.co/oFlyLizxoR

johngmyers · 2013-04-14T02:46:30Z

Have you considered using adaptive approaches in your load balancing layer, applying concurrency limits and circuit breakers to individual service instances? Ribbon appears to depend on a probe function and an optional callback giving it definitive information as to when an instance is down. Would it not make sense for load balancing code to use Hystrix-like techniques to direct traffic away from poor-performing instances?

Admittedly this would behave poorly if the resulting system were overdamped.

allenxwang · 2013-06-04T20:45:00Z

Currently Ribbon is adaptive to server health conditions in the following way:

If consecutive connect/read failures occurred on an instance, the instance would be taken out of round robin for a period of time, which will increase exponentially.
A configurable concurrency limit can be set to prevent an instance from taking traffic.

These limit/threshold are configurable as dynamic properties which can change at runtime without server restart. However, there is no intelligence for the load balancer itself to automatically adjust these limit/threshold.

We can make these measures more adaptive by detecting changes of certain stats. Any suggestions?

benjchristensen · 2013-09-22T20:35:09Z

A book just came out applicable to this subject:

Feedback Control for Computer Systems
By Philipp K. Janert
http://shop.oreilly.com/product/0636920028970.do

johngmyers · 2014-02-19T17:57:00Z

In my software load balancer, I'm taking a page from the Hystrix semaphore approach. When selecting an instance from a service pool, it only considers instances that have the least number of outstanding concurrent requests. Thus an instance that is slow or hanging gets chosen less often.

benjchristensen · 2014-12-12T03:26:23Z

cc @mattrjacobs @KoltonAndrus @elandau

Petikoch · 2015-04-22T19:32:25Z

Hi @benjchristensen,

it's been a while since you raised the question of using elements of "control-theory" in software design/engineering.

Since this week I came to the same thinking-route (didn't know about control-theory before). Fascinating examples for me are an adaptive thread-pool (something like a PID controller adjusts the number of threads in a certain range by queue-length, system-load and number of CPU-cores) or an adaptive sampler (adjustment of number of samples per timeunit by "costs" of taking a sample)...

Do you still think - 2 years after opening the question - this is something promising for us software building people?

Thanks for your feedback & best regards,
Peti

mattrjacobs · 2015-04-23T19:13:40Z

@Petikoch This is still something that we're still interested in pursuing. I've spent enough time manually tuning our production systems that I certainly feel the pain of not having it.

I've got a few ideas that I am hoping to get started on in the next few weeks/months. If you can expand on your above thoughts as a starting point for a design, that would be very helpful.

benjchristensen · 2015-04-23T22:47:48Z

Hi @Petikoch

Yes I still think there is something promising here, but I have yet to figure out a concrete solution for the Hystrix use cases. I have spoken with a few Queueing Theory PhDs and gotten general directional guidance but mostly an agreement that this is a very hard problem.

There are two books of interest on the topic, and @mattrjacobs is interested in and studying this topic as well. One of the books is listed above already. The other one (that I haven't had time to read) is:

Performance Modeling and Design of Computer Systems
by Mor Harchol-Balter
http://www.amazon.com/Performance-Modeling-Design-Computer-Systems-ebook/dp/B00ADP6ZB0/ref=tmm_kin_title_0?_encoding=UTF8&sr=&qid=

Talking with those who do have some expertise in this space have not yet been able to provide a direct solution for pool/queue sizes and timeouts that don't risk defeating the point of Hystrix (load shedding to protect the system). The best I've seen so far is to use historical data to predict the future values for each HystrixCommand. That is something I've seen work elsewhere and can understand, but of course it is limited in what it offers as it is latent and requires history. I'm also intrigued by the idea of low latency machine learning and analytics being used, but have not had a chance to try it or see it applied to this use case.

I've even asked some other large companies if they have figured out automated ways of configuring timeouts and so far the answer has always been "no", much to my disappointment. The solution for timeouts I've seen adopted at Google and elsewhere is instead to use "backup requests" rather than trying to timeout.

So, I very much want a solution to this that doesn't involve human involvement but I'm still unaware of one. If you can provide assistance on solving this that would be awesome!

Petikoch · 2015-04-24T15:23:25Z

Hi @benjchristensen and @mattrjacobs ,

thank you very much for your feedback. I'm very short on time right now, sorry for my short answer right now.

I collected my thoughts today into here: https://github.com/Petikoch/java_adaptive_control_experiments
In the next week or two I will start to implement one or more of the experiments. It's a topic which is "hot" for a customer of me, so there will be soon results available.

I'll keep you updated.

Have a nice weekend and best regards from Switzerland,
Peti

Petikoch · 2015-04-29T19:14:05Z

Hi @benjchristensen and @mattrjacobs ,

today I learned about the excellent blog from @rvanheest http://rvanheest.github.io/Literature-Study-Feedback-Control/Simulation.html. It's heavily based on http://shop.oreilly.com/product/0636920028970.do and @rvanheest has/had direct contact with the author.

It's an excellent introduction to the topic with a lot of nice illustrations and code examples. Like https://github.com/Petikoch/java_adaptive_control_experiments, but about 1 lightyear in front ;-)

Maybe this helps.

Best regards,
Peti

mattrjacobs · 2015-04-30T17:21:45Z

@Petikoch Thanks for the pointers, those look very helpful! I need some time to digest all of it. Feel free to drop any other ideas/links in the meantime

agentgt · 2016-02-05T15:05:50Z

I vaguely remember in college (many years ago) learning about Support Vector Machines (SVM) being a possible fit to feedback control. SVM unlinke many other machine learning algorithms require little data and are supposedly very fast. Its been a long time since I have looked at machine learning so this could be a completely wrong.

mattrjacobs added this to the 1.5 milestone Dec 19, 2014

mattrjacobs modified the milestone: 1.5.x Mar 3, 2015

This was referenced Aug 21, 2015

Node level fault isolation #862

Closed

strategy for tuning the hystrix configuration #866

Closed

mattrjacobs added discussion and removed question labels Feb 4, 2016

mattrjacobs closed this as completed Jun 21, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Explore "Adaptive Safety Control" Approaches #131

Explore "Adaptive Safety Control" Approaches #131

benjchristensen commented Mar 27, 2013

benjchristensen commented Mar 28, 2013

benjchristensen commented Mar 28, 2013

johngmyers commented Apr 14, 2013

allenxwang commented Jun 4, 2013

benjchristensen commented Sep 22, 2013

johngmyers commented Feb 19, 2014

benjchristensen commented Dec 12, 2014

Petikoch commented Apr 22, 2015

mattrjacobs commented Apr 23, 2015

benjchristensen commented Apr 23, 2015

Petikoch commented Apr 24, 2015

Petikoch commented Apr 29, 2015

mattrjacobs commented Apr 30, 2015

agentgt commented Feb 5, 2016

Explore "Adaptive Safety Control" Approaches #131

Explore "Adaptive Safety Control" Approaches #131

Comments

benjchristensen commented Mar 27, 2013

benjchristensen commented Mar 28, 2013

benjchristensen commented Mar 28, 2013

johngmyers commented Apr 14, 2013

allenxwang commented Jun 4, 2013

benjchristensen commented Sep 22, 2013

johngmyers commented Feb 19, 2014

benjchristensen commented Dec 12, 2014

Petikoch commented Apr 22, 2015

mattrjacobs commented Apr 23, 2015

benjchristensen commented Apr 23, 2015

Petikoch commented Apr 24, 2015

Petikoch commented Apr 29, 2015

mattrjacobs commented Apr 30, 2015

agentgt commented Feb 5, 2016