-
Notifications
You must be signed in to change notification settings - Fork 1.7k
light client : failsafe crate (circuit breaker) #9790
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good in general 👍 , I've left some questions/comments.
parity/cli/mod.rs
Outdated
ARG arg_on_demand_max_backoff_rounds: (Option<usize>) = None, or |c: &Config| c.light.as_ref()?.on_demand_max_backoff_rounds, | ||
"--on-demand-max-backoff-rounds=[TIMES]", | ||
"Specify light client maximum number of backoff iterations", | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure a user wants to have control over all those knobs, maybe it's ok to just have good (empirical research?) defaults, but I don't have a strong opinion on that.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is good to expose the primitives to give full control, but some alternate preset (basics configuration) could also be a nice thing.
(at this point it probably lacks documentation of the impact of every option, but maybe the wiki or another source is better for that : it seems difficult to explain quickly).
7482c00
to
76fd478
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a few thing that I lack for reviewing correctly this PR (I left a few review comments in the code but they may not be pertinent).
A general explanation on how the circuit behave; more precisely how many circuit states are managed and what are they following (global or per request or per user ...). At this point it only makes sense to me as a global on_demand circuit state but at the same time I feel like it does not totally match the implementation.
Could be also good to have a short word on the advantage of the approach (but I think it was already discussed in related issues or PR).
A last thing is the additional dependency on 'spin' crate (we use parking_lot and got past issues with deadlock so I am a bit worried of the impact of the circuit breaker but I did not really analyse further because I want to be sure of our strategy when using circuit breaker (aka is it global or local to each request...).
parity/cli/mod.rs
Outdated
ARG arg_on_demand_max_backoff_rounds: (Option<usize>) = None, or |c: &Config| c.light.as_ref()?.on_demand_max_backoff_rounds, | ||
"--on-demand-max-backoff-rounds=[TIMES]", | ||
"Specify light client maximum number of backoff iterations", | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it is good to expose the primitives to give full control, but some alternate preset (basics configuration) could also be a nice thing.
(at this point it probably lacks documentation of the impact of every option, but maybe the wiki or another source is better for that : it seems difficult to explain quickly).
Please rebase |
863b651
to
c6051f5
Compare
039e995
to
5ba2ca3
Compare
5ba2ca3
to
c2c53e2
Compare
I refactored this and the major thing to sort out is whether However, the response guard doesn't need the use failsafe mechanism as you pointed out earlier which I can remove if I should continue with this. Dump of the curl --data '{"method":"eth_getTransactionReceipt","params":["0x444172bef57ad978655171a8af2cfd89baa02a97fcb773067aef7794d6913fff"],"id":1,"jsonrpc":"2.0"}' -H "Content-Type: application/json" -X POST localhost:8555
{"jsonrpc":"2.0","error":{"code":-32042,"message":"Bad response on request: [ TransactionIndex ]. Error cause was EmptyResponse, (majority count: 91 / total: 91)"},"id":1} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly LGTM. A few small grumbles.
It indeed seems that the use of failsafe on ResponseGuard
is optional (and one can also argue that NoBackoff
doesn't look really nice). But I think that's not a huge issue -- being wrapped in ResponseGuard
means it can be easily changed to use without failsafe, or to use other circuit breaker policies.
trace!(target: "circuit_breaker", "ResponseGuard: {:?}, responses: {:?}", self.state, self.responses); | ||
let (&err, &max_count) = self.responses.iter().max_by_key(|(_k, v)| *v).expect("got at least one element; qed"); | ||
let majority = self.responses.values().filter(|v| **v == max_count).count() == 1; | ||
// FIXME: more efficient with a separate counter |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be nice if you can create an new issue and include an issue number there!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I will fix this if/when @cheme signs off of this :P
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Sorry to ask, did I lock something?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
haha, I was referring to the functionality that this PR breaks i.e, replacing the response limit per peer
and to use a simple timeout
instead? If, that is not sorted there is no point and implementing other things IMHO!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I must admit, those points are a bit old in my mind. The very main thing is that LES query results should always be considered unsafe (query to random full node) and be checked.
Checking result for LES is easy (and done before caching up to cht block header reference), but checking no result is impossible (or at least quite a difficult question).
Therefore when you query an information that do not exist, there is not many ways to accept a 'no result reply' others than saying either 1: 'I did query all my nodes pool and do not get new nodes for a certain time' or 2: 'I did N random query and I consider it enough to think that at least one of those query replies is accurate". That was the two use cases that were in place before.
So this PR uses different ways of doing things. At start I felt like this is additional complexity, and did not really get the immediate usefulness of it, but on the other hand trying/starting new/different things is often what allows great things in the long term.
So trying to explain the new 'no reply' conditions, what are those correct formulations ? At this point what I read is :
- 'SuccessRateOverTimeWindows' fails on response due to no first success until time window (0% success), meaning 'simple timeout'
- request_limit_reached : same as my second condition. Is managed out of fail_safe crate (When reading some failsafe crate, I just observe that there is a hard limit at 30 attempts for exponential backoff), here I think that using consecutive failure strategy could make sense.
So it does not change that much (just a hard timeout on response instead of my first condition : maybe simplier for the user), then I would not say it 'breaks' things.
Still, I do not really get why we use success rate over consecutive failure (success rate can only be 0 or 100 (request end on first success and we use our constructions on a per request basis) whereas consecutive failure takes a lot of possible states and is somehow managed as the error counter in request_guards (I may be wrong I did not read the crate implementation)).
Edit: thinking twice, no result in LES could be seen as invalid if query is valid on most query. So previous statement is far from being correct, I should check if LES queries are validated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok, thanks for taking the time but to verify that I understood your concerns:
- This PR still might be useful (I want to add that it gives some information to the user why the request to the full-node failed or least what the majority response was)
- Circuit breaker for
responses
are useless and we should replace it with a simple timeout as stated you pointed out earlier thanks for that! - Exponentially backoff for requests is useful but because we only register
bad responses
here to it is better to useconsecutive failures
(also easiser to understand)
I just observe that there is a hard limit at 30 attempts for exponential backoff), here I think that using consecutive failure strategy could make sense.
I think you are wrong there as you can see, https://github.com/dmexe/failsafe-rs/blob/master/src/backoff.rs#L136 the backoff is always regenerated. I think the MAX_RETRIES
is used to calcuate "binary exponential backoff" by shifting the value 1 << MAX_RETRIES
and bigger value than 63 will overflow. That's my best guess
Still, I do not really get why we use success rate over consecutive failure (success rate can only be 0 or 100 (request end on first success and we use our constructions on a per request basis) whereas consecutive failure takes a lot of possible states and is somehow managed as the error counter in request_guards (I may be wrong I did not read the crate implementation)).
Basically, the request will run forever
until it gets a successful response in the circuit breaker
. That is why we use an upperbound max_failures
in the RequestGuard when to drop the request
It is okay for me to try with consecutive failures instead
but then I suggest setting to value to 1 and then backoff and repeat until we either got a successful response or backoff #max_number in order to avoid doing too many N/W requests.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you are wrong there as you can see, https://github.com/dmexe/failsafe-rs/blob/master/src/backoff.rs#L136 the backoff is always regenerated. I think the MAX_RETRIES is used to calcuate "binary exponential backoff" by shifting the value 1 << MAX_RETRIES and bigger value than 63 will overflow.
Oups, sorry, I should have check more cautiously, thanks for looking at it.
Basically, the request will run forever until it gets a successful response in the circuit breaker. That is why we use an upperbound max_failures in the RequestGuard when to drop the request
That is what I understood :-)
I feel like consecutive failure before backoff could make sense as a percentage of the current number of peers, but I do not really have things to back that (except that when testing a few month back I usually got a lot of peers returning invalid value but when we do not get lot of peers we want to wait). Isn't it a kind of error rates strategy. I start to wonder if it may make sense to write our own tailored 'failure policy' like 'ErrorRateOverTimeWindow' (just thinking loud).
I do not know about dropping response_guard, am I correct if I say that 'request_guard' and 'response_guard' are both in 'Pending' struct ? (just wondering if they can be merged, it probably only make sense with custom 'failure policy')
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not know about dropping response_guard, am I correct if I say that 'request_guard' and 'response_guard' are both in 'Pending' struct ? (just wondering if they can be merged, it probably only make sense with custom 'failure policy')
Yes, they are but they we are slightly different if we got a successful request we insert the same Pending
again wait for responses and when we get a successful response or failure we drop the Pending
completely. The idea from the beginning was to have a two failure policies of the circuit-breaker
and combine them but because the state can't be accessed and it will never terminate I created two wrappers instead. The problem is that we can just query if the request is permitted or not thus can't tell if it will backoff will be closed/half-open or stays open forever. So, we can either an additional timeout or keep some state by our selves.
I don't really understand what you are suggesting and the relationship between requests
and responses
when you talk about our own failure policy
.
What I can pick off is this:
- We should start with consecutive failures based on the number of peers
- When the peers are under some threshold we should back off?
- ?
Well, I think the circuit breaker API doesn't provide the functionality to change the peers for example after the circuit breaker has been created which makes that solution hard
to get right (unless we want to write a circuit breaker ourselves)
Personally, I don't have any intuition on what this values should be and suggest just to keep it simple (e.g, consecutive failures then backoff and continue backoff and so on). The consecutive failures shall be configured by the user
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍 for keeping it simple, I don't either have strong intuition on value but I identify two case where my previous configuration where not good :
- start of a light client : there is to few nodes connected and we probably want to wait
- well established connection in light client : there will be no new node so waiting does not make a lot of sense.
About 'our own failure policy', I was mainly interpolating that our need may be some 'RateOverTimeWindow', but as we count error and not success our own implementation of failsafe trait FailurePolicy
could be an idea. Thinking twice, going for the crate success rate counting could be another target, but it would involve managing circuit breaker at a per peer level instead of a per request level. So I see two scenario (I do not say that they should be include in this pr):
- we will keep per request management, so we count errors only -> go with error count policy or a custom error rate one.
- we plan to switch to per peerid circuit so we can count multiple success (having mutex in the circuit does make sense with this scenario) and keeping a rate success is explain by this future move. The issue here is that hour request handler are not threadsafe (not sure they need to be, maybe they can stay per request specific but using per user id circuit breakers).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
All right, yeah I think it is out-of-scope for this PR but we can create an issue and investigate it further!
Thanks, Afri, it's far from being merged :P |
Bump failsafe to v0.3.0 to enable `parking_lot::Mutex` instead `spin::Mutex`
When a reponse `times-out` provide the actual request or requests that failed
Co-Authored-By: niklasad1 <niklasadolfsson1@gmail.com>
Co-Authored-By: niklasad1 <niklasadolfsson1@gmail.com>
* Use second resolution on CLI args * Use `consecutive failure policy` instead of `timeOverWindow` * Add a couple of tests for `request_guard`
768feac
to
f5b1a66
Compare
/cc @amaurymartiny any thoughts on the format of a |
The error format is good for me |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm in favor of merging this PR. It adds certain benefits outlined in the PR description.
I think using guards at peer level (instead of request level) is about peer ranking and is slightly orthogonal to what we're trying to achieve here.
/// The maximum request interval for OnDemand queries | ||
pub const DEFAULT_REQUEST_MAX_BACKOFF_DURATION: Duration = Duration::from_secs(100); | ||
/// The default window length a response is evaluated | ||
pub const DEFAULT_RESPONSE_TIME_TO_LIVE: Duration = Duration::from_secs(60); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why was it changed from 10 secs to 1min, isn't 1min too long for a user to wait?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't know, will revert sorry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we still need to (but haven't) revert this line at this moment, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, I should have commented instead of approving the PR to avoid the merge before fixing the comment, sorry.
@@ -1820,8 +1835,11 @@ mod tests { | |||
arg_snapshot_threads: None, | |||
|
|||
// -- Light options. | |||
arg_on_demand_retry_count: Some(15), | |||
arg_on_demand_inactive_time_limit: Some(15000), | |||
arg_on_demand_response_time_window: Some(2000), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you update these numbers to seconds (/ 1000
)?
* refactor(light) : N/W calls with `circuit breaker` * fix(nits) : forgot to commit new files * Add tests and change CLI args * Address grumbles * fix(failsafe-rs) : Santize input to prevent panics * chore(failsafe) : bump failsafe, (parking_lot) Bump failsafe to v0.3.0 to enable `parking_lot::Mutex` instead `spin::Mutex` * Remove `success_rate` * feat(circuit_breaker logger) * feat(CLI): separate CLI args request and response * Fix tests * Error response provide request kind When a reponse `times-out` provide the actual request or requests that failed * Update ethcore/light/src/on_demand/mod.rs Co-Authored-By: niklasad1 <niklasadolfsson1@gmail.com> * Update ethcore/light/src/on_demand/mod.rs Co-Authored-By: niklasad1 <niklasadolfsson1@gmail.com> * fix(grumbles): formatting nit * fix(grumbles) * Use second resolution on CLI args * Use `consecutive failure policy` instead of `timeOverWindow` * Add a couple of tests for `request_guard` * fix(request_guard): off-by-one error, update tests
* refactor(light) : N/W calls with `circuit breaker` * fix(nits) : forgot to commit new files * Add tests and change CLI args * Address grumbles * fix(failsafe-rs) : Santize input to prevent panics * chore(failsafe) : bump failsafe, (parking_lot) Bump failsafe to v0.3.0 to enable `parking_lot::Mutex` instead `spin::Mutex` * Remove `success_rate` * feat(circuit_breaker logger) * feat(CLI): separate CLI args request and response * Fix tests * Error response provide request kind When a reponse `times-out` provide the actual request or requests that failed * Update ethcore/light/src/on_demand/mod.rs Co-Authored-By: niklasad1 <niklasadolfsson1@gmail.com> * Update ethcore/light/src/on_demand/mod.rs Co-Authored-By: niklasad1 <niklasadolfsson1@gmail.com> * fix(grumbles): formatting nit * fix(grumbles) * Use second resolution on CLI args * Use `consecutive failure policy` instead of `timeOverWindow` * Add a couple of tests for `request_guard` * fix(request_guard): off-by-one error, update tests
Attempt to close #9536
Major changes in this PR:
exponentially moving average (provided by failsafe-rs)
during a fixed time period when a request evaluated as faulty or not (but only failures are recorded) and a simple max_timeout_time for the maximum time that responses are evaluatedBenefits
exponentially
upon failures and retries less frequently to avoid spamming the networkCircuit breaker
Example