Skip to content
This repository has been archived by the owner on Nov 6, 2020. It is now read-only.

light client : failsafe crate (circuit breaker) #9790

Merged
merged 16 commits into from
Dec 5, 2018
Merged

Conversation

niklasad1
Copy link
Collaborator

@niklasad1 niklasad1 commented Oct 21, 2018

Attempt to close #9536

Major changes in this PR:

  1. Instead of limit the request by the number query retries and empty responses this introduces a exponentially moving average (provided by failsafe-rs) during a fixed time period when a request evaluated as faulty or not (but only failures are recorded) and a simple max_timeout_time for the maximum time that responses are evaluated
  2. Adds functionality to determine the failure reason for bad responses which are propagated via the RPC interface
  3. New CLI args:
    • response_time_window
    • request_time_window
    • request_backoff_start
    • request_backoff_max
    • request_backoff_rounds_max

Benefits

  1. A user can determine for over how long time period a request should be evaluated as a failure without specifying the number of responses or queries
  2. The actual failure response is propagated to the user via the RPC interface
  3. The request backs off exponentially upon failures and retries less frequently to avoid spamming the network

Circuit breaker

image

  • closed - calls are accepted
  • open - failed over some threshold and calls are rejected
  • half-open - calls are accepted after backoff from failure

Example

curl --data '{"method":"eth_getTransactionReceipt","params":["0x444172bef57ad978655171a8af2cfd89baa02a97fcb773067aef7794d6913fff"],"id":1,"jsonrpc":"2.0"}' -H "Content-Type: application/json" -X POST localhost:8555

{"jsonrpc":"2.0","error":{"code":-32042,"message":"Bad response on request: [ TransactionIndex ]. Error cause was EmptyResponse, (majority count: 91 / total: 91)"},"id":1}

@niklasad1 niklasad1 added A0-pleasereview 🤓 Pull request needs code review. M4-core ⛓ Core client code / Rust. labels Oct 21, 2018
@niklasad1 niklasad1 requested review from cheme and ordian October 21, 2018 20:07
Copy link
Member

@ordian ordian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good in general 👍 , I've left some questions/comments.

ethcore/light/src/on_demand/mod.rs Outdated Show resolved Hide resolved
ethcore/light/src/on_demand/tests.rs Outdated Show resolved Hide resolved
ethcore/light/src/on_demand/tests.rs Outdated Show resolved Hide resolved
ethcore/light/src/on_demand/mod.rs Outdated Show resolved Hide resolved
ethcore/light/src/on_demand/request_guard.rs Outdated Show resolved Hide resolved
ethcore/light/src/on_demand/tests.rs Outdated Show resolved Hide resolved
ethcore/light/src/types/request/mod.rs Outdated Show resolved Hide resolved
ARG arg_on_demand_max_backoff_rounds: (Option<usize>) = None, or |c: &Config| c.light.as_ref()?.on_demand_max_backoff_rounds,
"--on-demand-max-backoff-rounds=[TIMES]",
"Specify light client maximum number of backoff iterations",

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not sure a user wants to have control over all those knobs, maybe it's ok to just have good (empirical research?) defaults, but I don't have a strong opinion on that.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is good to expose the primitives to give full control, but some alternate preset (basics configuration) could also be a nice thing.

(at this point it probably lacks documentation of the impact of every option, but maybe the wiki or another source is better for that : it seems difficult to explain quickly).

@niklasad1 niklasad1 force-pushed the light/circuit-breaker branch from 7482c00 to 76fd478 Compare October 23, 2018 14:36
ordian
ordian previously approved these changes Oct 23, 2018
@ordian ordian added A8-looksgood 🦄 Pull request is reviewed well. and removed A0-pleasereview 🤓 Pull request needs code review. labels Oct 23, 2018
@5chdn 5chdn added this to the 2.2 milestone Oct 24, 2018
Copy link
Contributor

@cheme cheme left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is a few thing that I lack for reviewing correctly this PR (I left a few review comments in the code but they may not be pertinent).

A general explanation on how the circuit behave; more precisely how many circuit states are managed and what are they following (global or per request or per user ...). At this point it only makes sense to me as a global on_demand circuit state but at the same time I feel like it does not totally match the implementation.

Could be also good to have a short word on the advantage of the approach (but I think it was already discussed in related issues or PR).

A last thing is the additional dependency on 'spin' crate (we use parking_lot and got past issues with deadlock so I am a bit worried of the impact of the circuit breaker but I did not really analyse further because I want to be sure of our strategy when using circuit breaker (aka is it global or local to each request...).

Cargo.lock Outdated Show resolved Hide resolved
ethcore/light/src/on_demand/mod.rs Show resolved Hide resolved
ethcore/light/src/on_demand/mod.rs Show resolved Hide resolved
ethcore/light/src/on_demand/mod.rs Show resolved Hide resolved
ethcore/light/src/on_demand/response_guard.rs Show resolved Hide resolved
ARG arg_on_demand_max_backoff_rounds: (Option<usize>) = None, or |c: &Config| c.light.as_ref()?.on_demand_max_backoff_rounds,
"--on-demand-max-backoff-rounds=[TIMES]",
"Specify light client maximum number of backoff iterations",

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it is good to expose the primitives to give full control, but some alternate preset (basics configuration) could also be a nice thing.

(at this point it probably lacks documentation of the impact of every option, but maybe the wiki or another source is better for that : it seems difficult to explain quickly).

@5chdn
Copy link
Contributor

5chdn commented Oct 26, 2018

Please rebase

@niklasad1 niklasad1 force-pushed the light/circuit-breaker branch 4 times, most recently from 863b651 to c6051f5 Compare October 27, 2018 08:20
@niklasad1 niklasad1 added A4-gotissues 💥 Pull request is reviewed and has significant issues which must be addressed. and removed A8-looksgood 🦄 Pull request is reviewed well. labels Oct 27, 2018
@5chdn 5chdn modified the milestones: 2.2, 2.3 Oct 29, 2018
@niklasad1 niklasad1 force-pushed the light/circuit-breaker branch 3 times, most recently from 039e995 to 5ba2ca3 Compare October 31, 2018 17:32
@niklasad1 niklasad1 force-pushed the light/circuit-breaker branch from 5ba2ca3 to c2c53e2 Compare November 9, 2018 16:36
@niklasad1 niklasad1 added A0-pleasereview 🤓 Pull request needs code review. and removed A4-gotissues 💥 Pull request is reviewed and has significant issues which must be addressed. labels Nov 9, 2018
@niklasad1
Copy link
Collaborator Author

niklasad1 commented Nov 9, 2018

@ordian @cheme

I refactored this and the major thing to sort out is whether the timeout feature on the responses is the way to go? IMHO, I think this is the more user-friendly than limit the number of bad_responses, basically, a user just configures "I want to wait at most x seconds for this request"

However, the response guard doesn't need the use failsafe mechanism as you pointed out earlier which I can remove if I should continue with this.

Dump of the new RPC error response:

curl --data '{"method":"eth_getTransactionReceipt","params":["0x444172bef57ad978655171a8af2cfd89baa02a97fcb773067aef7794d6913fff"],"id":1,"jsonrpc":"2.0"}' -H "Content-Type: application/json" -X POST localhost:8555

{"jsonrpc":"2.0","error":{"code":-32042,"message":"Bad response on request: [ TransactionIndex ]. Error cause was EmptyResponse, (majority count: 91 / total: 91)"},"id":1}

@5chdn
Copy link
Contributor

5chdn commented Nov 25, 2018

Could you give this a final review @ordian @cheme?

Copy link
Collaborator

@sorpaas sorpaas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly LGTM. A few small grumbles.

It indeed seems that the use of failsafe on ResponseGuard is optional (and one can also argue that NoBackoff doesn't look really nice). But I think that's not a huge issue -- being wrapped in ResponseGuard means it can be easily changed to use without failsafe, or to use other circuit breaker policies.

ethcore/light/src/on_demand/request_guard.rs Outdated Show resolved Hide resolved
trace!(target: "circuit_breaker", "ResponseGuard: {:?}, responses: {:?}", self.state, self.responses);
let (&err, &max_count) = self.responses.iter().max_by_key(|(_k, v)| *v).expect("got at least one element; qed");
let majority = self.responses.values().filter(|v| **v == max_count).count() == 1;
// FIXME: more efficient with a separate counter
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would be nice if you can create an new issue and include an issue number there!

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I will fix this if/when @cheme signs off of this :P

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Sorry to ask, did I lock something?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@cheme

haha, I was referring to the functionality that this PR breaks i.e, replacing the response limit per peer and to use a simple timeout instead? If, that is not sorted there is no point and implementing other things IMHO!

Copy link
Contributor

@cheme cheme Nov 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I must admit, those points are a bit old in my mind. The very main thing is that LES query results should always be considered unsafe (query to random full node) and be checked.
Checking result for LES is easy (and done before caching up to cht block header reference), but checking no result is impossible (or at least quite a difficult question).
Therefore when you query an information that do not exist, there is not many ways to accept a 'no result reply' others than saying either 1: 'I did query all my nodes pool and do not get new nodes for a certain time' or 2: 'I did N random query and I consider it enough to think that at least one of those query replies is accurate". That was the two use cases that were in place before.

So this PR uses different ways of doing things. At start I felt like this is additional complexity, and did not really get the immediate usefulness of it, but on the other hand trying/starting new/different things is often what allows great things in the long term.

So trying to explain the new 'no reply' conditions, what are those correct formulations ? At this point what I read is :

  • 'SuccessRateOverTimeWindows' fails on response due to no first success until time window (0% success), meaning 'simple timeout'
  • request_limit_reached : same as my second condition. Is managed out of fail_safe crate (When reading some failsafe crate, I just observe that there is a hard limit at 30 attempts for exponential backoff), here I think that using consecutive failure strategy could make sense.

So it does not change that much (just a hard timeout on response instead of my first condition : maybe simplier for the user), then I would not say it 'breaks' things.

Still, I do not really get why we use success rate over consecutive failure (success rate can only be 0 or 100 (request end on first success and we use our constructions on a per request basis) whereas consecutive failure takes a lot of possible states and is somehow managed as the error counter in request_guards (I may be wrong I did not read the crate implementation)).

Edit: thinking twice, no result in LES could be seen as invalid if query is valid on most query. So previous statement is far from being correct, I should check if LES queries are validated.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok, thanks for taking the time but to verify that I understood your concerns:

  1. This PR still might be useful (I want to add that it gives some information to the user why the request to the full-node failed or least what the majority response was)
  2. Circuit breaker for responses are useless and we should replace it with a simple timeout as stated you pointed out earlier thanks for that!
  3. Exponentially backoff for requests is useful but because we only register bad responses here to it is better to use consecutive failures (also easiser to understand)

I just observe that there is a hard limit at 30 attempts for exponential backoff), here I think that using consecutive failure strategy could make sense.

I think you are wrong there as you can see, https://github.com/dmexe/failsafe-rs/blob/master/src/backoff.rs#L136 the backoff is always regenerated. I think the MAX_RETRIES is used to calcuate "binary exponential backoff" by shifting the value 1 << MAX_RETRIES and bigger value than 63 will overflow. That's my best guess

Still, I do not really get why we use success rate over consecutive failure (success rate can only be 0 or 100 (request end on first success and we use our constructions on a per request basis) whereas consecutive failure takes a lot of possible states and is somehow managed as the error counter in request_guards (I may be wrong I did not read the crate implementation)).

Basically, the request will run forever until it gets a successful response in the circuit breaker. That is why we use an upperbound max_failures in the RequestGuard when to drop the request

It is okay for me to try with consecutive failures instead but then I suggest setting to value to 1 and then backoff and repeat until we either got a successful response or backoff #max_number in order to avoid doing too many N/W requests.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you are wrong there as you can see, https://github.com/dmexe/failsafe-rs/blob/master/src/backoff.rs#L136 the backoff is always regenerated. I think the MAX_RETRIES is used to calcuate "binary exponential backoff" by shifting the value 1 << MAX_RETRIES and bigger value than 63 will overflow.

Oups, sorry, I should have check more cautiously, thanks for looking at it.

Basically, the request will run forever until it gets a successful response in the circuit breaker. That is why we use an upperbound max_failures in the RequestGuard when to drop the request

That is what I understood :-)

I feel like consecutive failure before backoff could make sense as a percentage of the current number of peers, but I do not really have things to back that (except that when testing a few month back I usually got a lot of peers returning invalid value but when we do not get lot of peers we want to wait). Isn't it a kind of error rates strategy. I start to wonder if it may make sense to write our own tailored 'failure policy' like 'ErrorRateOverTimeWindow' (just thinking loud).

I do not know about dropping response_guard, am I correct if I say that 'request_guard' and 'response_guard' are both in 'Pending' struct ? (just wondering if they can be merged, it probably only make sense with custom 'failure policy')

Copy link
Collaborator Author

@niklasad1 niklasad1 Nov 26, 2018

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I do not know about dropping response_guard, am I correct if I say that 'request_guard' and 'response_guard' are both in 'Pending' struct ? (just wondering if they can be merged, it probably only make sense with custom 'failure policy')

Yes, they are but they we are slightly different if we got a successful request we insert the same Pending again wait for responses and when we get a successful response or failure we drop the Pending completely. The idea from the beginning was to have a two failure policies of the circuit-breaker and combine them but because the state can't be accessed and it will never terminate I created two wrappers instead. The problem is that we can just query if the request is permitted or not thus can't tell if it will backoff will be closed/half-open or stays open forever. So, we can either an additional timeout or keep some state by our selves.

I don't really understand what you are suggesting and the relationship between requests and responses when you talk about our own failure policy.

What I can pick off is this:

  • We should start with consecutive failures based on the number of peers
  • When the peers are under some threshold we should back off?
  • ?

Well, I think the circuit breaker API doesn't provide the functionality to change the peers for example after the circuit breaker has been created which makes that solution hard to get right (unless we want to write a circuit breaker ourselves)

Personally, I don't have any intuition on what this values should be and suggest just to keep it simple (e.g, consecutive failures then backoff and continue backoff and so on). The consecutive failures shall be configured by the user

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍 for keeping it simple, I don't either have strong intuition on value but I identify two case where my previous configuration where not good :

  • start of a light client : there is to few nodes connected and we probably want to wait
  • well established connection in light client : there will be no new node so waiting does not make a lot of sense.

About 'our own failure policy', I was mainly interpolating that our need may be some 'RateOverTimeWindow', but as we count error and not success our own implementation of failsafe trait FailurePolicy could be an idea. Thinking twice, going for the crate success rate counting could be another target, but it would involve managing circuit breaker at a per peer level instead of a per request level. So I see two scenario (I do not say that they should be include in this pr):

  • we will keep per request management, so we count errors only -> go with error count policy or a custom error rate one.
  • we plan to switch to per peerid circuit so we can count multiple success (having mutex in the circuit does make sense with this scenario) and keeping a rate success is explain by this future move. The issue here is that hour request handler are not threadsafe (not sure they need to be, maybe they can stay per request specific but using per user id circuit breakers).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

All right, yeah I think it is out-of-scope for this PR but we can create an issue and investigate it further!

@sorpaas sorpaas added the A5-grumble 🔥 Pull request has minor issues that must be addressed before merging. label Nov 26, 2018
@niklasad1
Copy link
Collaborator Author

Thanks, Afri, it's far from being merged :P

niklasad1 and others added 15 commits November 27, 2018 23:24
Bump failsafe to v0.3.0 to enable `parking_lot::Mutex` instead
`spin::Mutex`
When a reponse `times-out` provide the actual request or requests that failed
Co-Authored-By: niklasad1 <niklasadolfsson1@gmail.com>
Co-Authored-By: niklasad1 <niklasadolfsson1@gmail.com>
* Use second resolution on CLI args
* Use `consecutive failure policy` instead of `timeOverWindow`
* Add a couple of tests for `request_guard`
@niklasad1 niklasad1 force-pushed the light/circuit-breaker branch from 768feac to f5b1a66 Compare November 27, 2018 22:25
@niklasad1
Copy link
Collaborator Author

/cc @amaurymartiny any thoughts on the format of a non-successful request? See example in the PR description.

@amaury1093
Copy link
Contributor

The error format is good for me

Copy link
Member

@ordian ordian left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm in favor of merging this PR. It adds certain benefits outlined in the PR description.
I think using guards at peer level (instead of request level) is about peer ranking and is slightly orthogonal to what we're trying to achieve here.

/// The maximum request interval for OnDemand queries
pub const DEFAULT_REQUEST_MAX_BACKOFF_DURATION: Duration = Duration::from_secs(100);
/// The default window length a response is evaluated
pub const DEFAULT_RESPONSE_TIME_TO_LIVE: Duration = Duration::from_secs(60);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why was it changed from 10 secs to 1min, isn't 1min too long for a user to wait?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't know, will revert sorry.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we still need to (but haven't) revert this line at this moment, right?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I should have commented instead of approving the PR to avoid the merge before fixing the comment, sorry.

@@ -1820,8 +1835,11 @@ mod tests {
arg_snapshot_threads: None,

// -- Light options.
arg_on_demand_retry_count: Some(15),
arg_on_demand_inactive_time_limit: Some(15000),
arg_on_demand_response_time_window: Some(2000),
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you update these numbers to seconds (/ 1000)?

@5chdn 5chdn added A8-looksgood 🦄 Pull request is reviewed well. and removed A0-pleasereview 🤓 Pull request needs code review. labels Dec 4, 2018
@5chdn 5chdn merged commit 7fb3379 into master Dec 5, 2018
@5chdn 5chdn deleted the light/circuit-breaker branch December 5, 2018 09:37
niklasad1 added a commit that referenced this pull request Dec 16, 2018
* refactor(light) : N/W calls with `circuit breaker`

* fix(nits) : forgot to commit new files

* Add tests and change CLI args

* Address grumbles

* fix(failsafe-rs) : Santize input to prevent panics

* chore(failsafe) : bump failsafe, (parking_lot)

Bump failsafe to v0.3.0 to enable `parking_lot::Mutex` instead
`spin::Mutex`

* Remove `success_rate`

* feat(circuit_breaker logger)

* feat(CLI): separate CLI args request and response

* Fix tests

* Error response provide request kind

When a reponse `times-out` provide the actual request or requests that failed

* Update ethcore/light/src/on_demand/mod.rs

Co-Authored-By: niklasad1 <niklasadolfsson1@gmail.com>

* Update ethcore/light/src/on_demand/mod.rs

Co-Authored-By: niklasad1 <niklasadolfsson1@gmail.com>

* fix(grumbles): formatting nit

* fix(grumbles)

* Use second resolution on CLI args
* Use `consecutive failure policy` instead of `timeOverWindow`
* Add a couple of tests for `request_guard`

* fix(request_guard): off-by-one error, update tests
insipx pushed a commit to insipx/parity-ethereum that referenced this pull request Dec 17, 2018
* refactor(light) : N/W calls with `circuit breaker`

* fix(nits) : forgot to commit new files

* Add tests and change CLI args

* Address grumbles

* fix(failsafe-rs) : Santize input to prevent panics

* chore(failsafe) : bump failsafe, (parking_lot)

Bump failsafe to v0.3.0 to enable `parking_lot::Mutex` instead
`spin::Mutex`

* Remove `success_rate`

* feat(circuit_breaker logger)

* feat(CLI): separate CLI args request and response

* Fix tests

* Error response provide request kind

When a reponse `times-out` provide the actual request or requests that failed

* Update ethcore/light/src/on_demand/mod.rs

Co-Authored-By: niklasad1 <niklasadolfsson1@gmail.com>

* Update ethcore/light/src/on_demand/mod.rs

Co-Authored-By: niklasad1 <niklasadolfsson1@gmail.com>

* fix(grumbles): formatting nit

* fix(grumbles)

* Use second resolution on CLI args
* Use `consecutive failure policy` instead of `timeOverWindow`
* Add a couple of tests for `request_guard`

* fix(request_guard): off-by-one error, update tests
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
A8-looksgood 🦄 Pull request is reviewed well. M4-core ⛓ Core client code / Rust.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Refactor on-demand queries to use failsafe-rs
6 participants